LexBib bibliodata workflow overview: Difference between revisions

LexBib bibliodata workflow overview (view source)

304 bytes added , 2 years ago

12,347

edits

@@ Line 58: / Line 58: @@
 *# The article title recorded in Zotero.
 * The script also lemmatizes the text bodies (this works now for English and Spanish, using [https://spacy.io/ SpaCy].)
-* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/buildtermindex.py buildtermindex.py] finds labels (lexicalisations) of [[LexVoc]] terms in the full text JSON file. This works now for English and Spanish. It also collects frequency data:
+* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/buildtermindex.py buildtermindex.py] finds labels (lexicalisations) of [[LexVoc]] terms in the full text JSON file. Term labels are also searched for in a lemmatized version (this is relevant for many multiword terms). Term labels that produce many false positives due to ambiguity or parallel use in general language ("article", "case", "example", etc.) are [[LexVoc#Stop-labels|filtered using a stoplist]]. That works now for English and Spanish.
+* The script also collects frequency data:
 ** Mention counts (hits) for the label(s) of each term in each article
 ** Relative frequency for the label(s) of each term in each article (hits/tokens).