LexBib bibliodata workflow overview: Difference between revisions

no edit summary
No edit summary
Line 30: Line 30:
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]).
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]).


=Migration to LexBib wikibase: Conversion into Linked Data=
=LexBib wikibase: Conversion into Linked Data=
 
LexBib wikibase is the central data repository, where Zotero literal values (text strings) are disambiguated to ontology entities, and bibliographic items and LexVoc terms (as content indicators) come together. Wikibase content can be accessed (GUI, API, SPARQL) by anybody, and edited by registered users (manually or API).


==Zotero export==
==Zotero export==
Line 43: Line 45:


==Author Disambiguation: Open Refine==
==Author Disambiguation: Open Refine==
* For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present by that time in the database to around 4,000 unique person items, using the clustering algorithms in [http://openrefine.org Open Refine]. Persons in LexBib have up to six name variants (see query at [[Main_Page#See_what.27s_in_the_database|Main Page]]).
* For subsequent updates, we use our own [https://github.com/wetneb/openrefine-wikibase wikibase reconciliation service with open refine]. That means, that person name literals are matched against person items existing in LexBib wikibase. [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/sparql/authorliteralsforopenrefine.rq This query] exports wikibase statements pointing to unmatched persons, and [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/maintenance/newcreatorsfromopenrefine.py newcreatorsfromopenrefine.py] re-imports matched person items or creates new items for those that have remained unmatched, and updates the statements.


==Indexation of bibliographical items with [[LexVoc]] terms==
==Indexation of bibliographical items with [[LexVoc]] terms==
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/buildbodytxts.py buildbodytxts.py] produces a large JSON file containing the full text bodies, as needed for the indexation process, and for Elexifinder export (see below). Full text bodies are taken from one of the different sources, with the following priority ranking, upon availability:
*# Manually produced full text body TXT.
*# GROBID-produced full text body TXT.
*# Zotero-produced "pfd2txt" raw TXT.
*# The abstract recorded in Zotero.
*# The article title recorded in Zotero.
* The script also lemmatizes the text bodies (this works now for English and Spanish, using [https://spacy.io/ SpaCy].)
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/buildtermindex.py buildtermindex.py] finds labels (lexicalisations) of [[LexVoc]] terms in the full text JSON file. This works now for English and Spanish. It also collects frequency data:
** Mention counts (hits) for the label(s) of each term in each article
** Relative frequency for the label(s) of each term in each article (hits/tokens).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/writefoundterms.py writefoundterms.py] can upload this information to LexBib wikibase. This is an expensive process, and will be done soon in this version of LexBib, as it had been done in the [http://data.lexbib.org previous (experimental) version] (see [https://data.lexbib.org/wiki/Item:Q385 example]).


==Elexifinder export==
==Elexifinder export==