12,347
edits
Line 58: | Line 58: | ||
*# The article title recorded in Zotero. | *# The article title recorded in Zotero. | ||
* The script also lemmatizes the text bodies (this works now for English and Spanish, using [https://spacy.io/ SpaCy].) | * The script also lemmatizes the text bodies (this works now for English and Spanish, using [https://spacy.io/ SpaCy].) | ||
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/buildtermindex.py buildtermindex.py] finds labels (lexicalisations) of [[LexVoc]] terms in the full text JSON file. | * [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/buildtermindex.py buildtermindex.py] finds labels (lexicalisations) of [[LexVoc]] terms in the full text JSON file. Term labels are also searched for in a lemmatized version (this is relevant for many multiword terms). Term labels that produce many false positives due to ambiguity or parallel use in general language ("article", "case", "example", etc.) are [[LexVoc#Stop-labels|filtered using a stoplist]]. That works now for English and Spanish. | ||
* The script also collects frequency data: | |||
** Mention counts (hits) for the label(s) of each term in each article | ** Mention counts (hits) for the label(s) of each term in each article | ||
** Relative frequency for the label(s) of each term in each article (hits/tokens). | ** Relative frequency for the label(s) of each term in each article (hits/tokens). |