Anonymous

LexBib bibliodata workflow overview: Difference between revisions

From LexBib
Line 58: Line 58:
*# The article title recorded in Zotero.
*# The article title recorded in Zotero.
* The script also lemmatizes the text bodies (this works now for English and Spanish, using [https://spacy.io/ SpaCy].)
* The script also lemmatizes the text bodies (this works now for English and Spanish, using [https://spacy.io/ SpaCy].)
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/buildtermindex.py buildtermindex.py] finds labels (lexicalisations) of [[LexVoc]] terms in the full text JSON file. This works now for English and Spanish. It also collects frequency data:
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/buildtermindex.py buildtermindex.py] finds labels (lexicalisations) of [[LexVoc]] terms in the full text JSON file. Term labels are also searched for in a lemmatized version (this is relevant for many multiword terms). Term labels that produce many false positives due to ambiguity or parallel use in general language ("article", "case", "example", etc.) are [[LexVoc#Stop-labels|filtered using a stoplist]]. That works now for English and Spanish.  
* The script also collects frequency data:
** Mention counts (hits) for the label(s) of each term in each article
** Mention counts (hits) for the label(s) of each term in each article
** Relative frequency for the label(s) of each term in each article (hits/tokens).
** Relative frequency for the label(s) of each term in each article (hits/tokens).