LexBib bibliodata workflow overview: Difference between revisions

 
(8 intermediate revisions by the same user not shown)
Line 22: Line 22:
** '':container Qxx'' points to a containing item (a [[Item:Q12|BibCollection]] item describing a journal issue, an edited volume)
** '':container Qxx'' points to a containing item (a [[Item:Q12|BibCollection]] item describing a journal issue, an edited volume)
** '':event Qxx'' points to a corresponding event (an item describing a conference iteration, a workshop). A property pointing to the event location is attached to the LexBib wikibase [[Item:Q6|Event]] item.
** '':event Qxx'' points to a corresponding event (an item describing a conference iteration, a workshop). A property pointing to the event location is attached to the LexBib wikibase [[Item:Q6|Event]] item.
** '':abstractLanguage en'' indicates that the abstract contained in the dataset is given in [[Item:Q201|English]] (and not in the language of the article)
** '':abstractLanguage en'' indicates that the abstract contained in the Zotero record is given in [[Item:Q201|English]] (and not in the language of the article, as stated in the "language" field.)
** '':collection x'' points to an Elexifinder collection number.
** '':collection x'' points to an Elexifinder collection number.
** '':type Review'' classifies the item as [[Item:Q15|review article]].
** '':type Review'' classifies the item as [[Item:Q15|review article]].
Line 35: Line 35:
=LexBib Wikibase=
=LexBib Wikibase=


LexBib wikibase is the central data repository, where Zotero literal values (text strings) are disambiguated to ontology entities, and bibliographic items and LexVoc terms (as content indicators) come together. Wikibase content can be accessed (GUI, API, SPARQL) by anybody, and edited by registered users (manually or API).
LexBib wikibase is the central data repository, where Zotero literal values (text strings) are disambiguated to ontology entities, and bibliographic items and LexVoc terms (as content indicators) come together. Wikibase content can be accessed (GUI, API, SPARQL) by everybody, and edited by registered users (manually or API).


==Zotero export==
==Zotero export==
Line 49: Line 49:
==Author Disambiguation: Open Refine==
==Author Disambiguation: Open Refine==


* For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present by that time in the database to around 4,000 unique person items, using clustering algorithms in [http://openrefine.org Open Refine]. Persons in LexBib have up to six name variants (see query at [[Main_Page#See_what.27s_in_the_database|Main Page]]).
* For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present in the database by that time to around 4,000 unique person items, using clustering algorithms in [http://openrefine.org Open Refine]. Persons in LexBib have up to six name variants (see query at [[Main_Page#See_what.27s_in_the_database|Main Page]]).
* For subsequent updates, we use our own [https://github.com/wetneb/openrefine-wikibase wikibase reconciliation service with open refine]. That means, that person name literals are matched against person items existing in LexBib wikibase, where all name literals previously matched to a person item are stored. [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/sparql/authorliteralsforopenrefine.rq This query] exports wikibase statements pointing to unmatched persons, and [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/maintenance/newcreatorsfromopenrefine.py newcreatorsfromopenrefine.py] processes the reconciliation results, creates new items for those names that have remained unmatched, and updates the statements and the literals associated to persons.
* For subsequent updates, we use our own [https://github.com/wetneb/openrefine-wikibase wikibase reconciliation service with open refine]. That means, that person name literals are matched against person items existing in LexBib wikibase, where all name literals previously matched to a person item are stored. [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/sparql/authorliteralsforopenrefine.rq This query] exports wikibase statements pointing to unmatched persons, and [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/maintenance/newcreatorsfromopenrefine.py newcreatorsfromopenrefine.py] processes the reconciliation results, creates new items for those names that have remained unmatched, and updates the statements and the literals associated to persons.
* This part of the workflow will soon be simplyfied, as [http://wbstack.com wikibase.cloud] developers are about to build OpenRefine into wikibase, i.e. a wikibase.cloud wikibase will by default ship its own Open Refine instance for reconciliation of literal values (i.e. their matching to LexBib wikibase ontology entities), and for uploading reconciliation results to wikibase. This means a shortcut for the export-reconciliation-import process described above, wich still involves manual configuration of the Open Refine tool and the own reconciliation service, as well as the upload process for reconciled data.
* This part of the workflow will soon be simplyfied, as [http://wikibase.cloud wikibase.cloud] developers are about to build OpenRefine into wikibase, i.e. a wikibase.cloud wikibase will by default ship its own Open Refine instance for reconciliation of literal values (i.e. matching literals to wikibase items), and for uploading reconciliation results to wikibase. This means a shortcut for the export-reconciliation-import process described above, wich still involves manual configuration of the Open Refine tool and the own reconciliation service, as well as of the upload process for reconciled data.


==Indexation of bibliographical items with [[LexVoc]] terms==
==Indexation of bibliographical items with [[LexVoc]] terms==
Line 58: Line 58:
*# Manually produced full text body TXT.
*# Manually produced full text body TXT.
*# GROBID-produced full text body TXT.
*# GROBID-produced full text body TXT.
*# Zotero-produced "pfd2txt" raw TXT.
*# Zotero-produced "pdf2txt" raw TXT.
*# The abstract recorded in Zotero.
*# The abstract recorded in Zotero.
*# The article title recorded in Zotero.
*# The article title recorded in Zotero.
Line 88: Line 88:
* For all items:
* For all items:
** Setting of item descriptions (schema:description) according to class. For example, a BibItem (class [[Item:Q3|Q3]]) recieves a description containing author last names and year, such as "Publication by Kosem & Lindemann (2021)". This is useful for visual disambiguation of items in LexBib search results.
** Setting of item descriptions (schema:description) according to class. For example, a BibItem (class [[Item:Q3|Q3]]) recieves a description containing author last names and year, such as "Publication by Kosem & Lindemann (2021)". This is useful for visual disambiguation of items in LexBib search results.
** Updating of properties pointing to redirect items, i.e. to items that have been merged to another item.
** Updating of statements pointing to redirect items, i.e. to items that have been merged to another item.
* Related to LexVoc:
* Related to LexVoc:
** Updating of skos:narrower ([[Property:P73|P73]]) relations according to skos:broader ([[Property:P72|P72]]), the inverse relation.
** Updating of skos:narrower ([[Property:P73|P73]]) relations according to skos:broader ([[Property:P72|P72]]), the inverse relation.
Line 96: Line 96:
A [https://github.com/elexis-eu/elexifinder/tree/master/wikibase/lexvoc-lexonomy set of python scripts] performs transformations from and to Lexonomy XML format. This is needed for [[LexVoc translation on Lexonomy]].
A [https://github.com/elexis-eu/elexifinder/tree/master/wikibase/lexvoc-lexonomy set of python scripts] performs transformations from and to Lexonomy XML format. This is needed for [[LexVoc translation on Lexonomy]].
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/buildlexonomy.py buildlexonomy.py] builds [[LexVoc Lexonomy|38 bilingual Lexonomy XML dictionaries]] out of LexVoc SKOS data.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/buildlexonomy.py buildlexonomy.py] builds [[LexVoc Lexonomy|38 bilingual Lexonomy XML dictionaries]] out of LexVoc SKOS data.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/mergeddict2lwb.py mergeddict3lwb.py] collects translation equivalents from Lexonomy XML ([https://lexonomy.elex.is/LexVoc/ merged on Lexonomy server]), and writes them to LexBib wikibase.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/mergeddict2lwb.py mergeddict2lwb.py] collects translation equivalents from Lexonomy XML ([https://lexonomy.elex.is/LexVoc/ merged on Lexonomy server]), and writes them to LexBib wikibase.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/getdicts.py getdicts.py] collects translation equivalents from Lexonomy XML (single dictionary).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/getdicts.py getdicts.py] collects translation equivalents from Lexonomy XML (single dictionary).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/statsfrommergeddict.py statsfrommergeddict.py] and getstats.py produce data rows about translation progress
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/statsfrommergeddict.py statsfrommergeddict.py] and getstats.py produce data rows about translation progress
Line 105: Line 105:
* Indexation of bibliographical items written in languages other than English and Spanish
* Indexation of bibliographical items written in languages other than English and Spanish
** As soon as LexVoc translation is completed, and a lemmatization procedure for other languages is implemented.
** As soon as LexVoc translation is completed, and a lemmatization procedure for other languages is implemented.
* Evaluation of [[LexVoc terms]] as content-describing indicators
* Evaluation of [[LexVoc]] terms as content-describing indicators
** Idea: Authors rate the content descriptors (LexVoc terms) assigned to their articles. The rating can be used to improve the indexation process (e.g. discard descriptors repeatedly marked as irrelevant, or prioritize descriptors according to a certain frequency threshold).
** Idea: Authors rate the content descriptors (LexVoc terms) assigned to their articles. The rating can be used to improve the indexation process (e.g. discard descriptors repeatedly marked as irrelevant, or prioritize descriptors according to a certain frequency threshold).
* Alignment of [[Item:Q5|person]] (and [[Item:Q11|organization]]) items to Wikidata and VIAF:
* Alignment of [[Item:Q5|person]] (and [[Item:Q11|organization]]) items to Wikidata and VIAF: