Anonymous

LexBib bibliodata workflow overview: Difference between revisions

From LexBib
 
(15 intermediate revisions by the same user not shown)
Line 10: Line 10:
* For scraping publication metadata from web pages (e.g. article 'landing pages' in journal or publisher portals), the Zotero software includes so-called [https://www.zotero.org/support/translators translators], which ingest bibliodata as single items or in batches. Zotero will also try to harvest the PDF. If it finds a PDF, it also produces a TXT version.
* For scraping publication metadata from web pages (e.g. article 'landing pages' in journal or publisher portals), the Zotero software includes so-called [https://www.zotero.org/support/translators translators], which ingest bibliodata as single items or in batches. Zotero will also try to harvest the PDF. If it finds a PDF, it also produces a TXT version.
* We transform bibliodata that reaches us as tabular data to [https://en.wikipedia.org/wiki/RIS_(file_format) RIS format], with [https://github.com/elexis-eu/elexifinder/tree/master/BibDataConverters own converters]. RIS is straightforwardly imported by Zotero, and, if needed, exported, manipulated using regular expressions, and re-imported.
* We transform bibliodata that reaches us as tabular data to [https://en.wikipedia.org/wiki/RIS_(file_format) RIS format], with [https://github.com/elexis-eu/elexifinder/tree/master/BibDataConverters own converters]. RIS is straightforwardly imported by Zotero, and, if needed, exported, manipulated using regular expressions, and re-imported.
* We can update the Zotero database using the Zotero API. For example, we can update author first and last names according to their preferred form in LexBib wikibase.


==Manual curation==
==Manual curation==


* Completeness of publication metadata is manually checked. The editing team uses [https://www.zotero.org/groups/ Zotero group synchronization] ([https://lexbib.org/blog/getting-started-with-zotero/ tutorial]).
* The editing team uses [https://www.zotero.org/groups/ Zotero group synchronization] ([https://lexbib.org/blog/getting-started-with-zotero/ tutorial]).
* Completeness of publication metadata is manually checked.
* Every item is annotated with the first author's location; the location of the first author is a requirement for the dataset to be exported to [[Elexifinder]]. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item ([https://lexbib.org/blog/author-and-article-location-tutorial/ tutorial]).
* Every item is annotated with the first author's location; the location of the first author is a requirement for the dataset to be exported to [[Elexifinder]]. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item ([https://lexbib.org/blog/author-and-article-location-tutorial/ tutorial]).
* The Zotero "language" field (publication language) must contain a two-letter ISO-639-1, or a three-letter ISO-639-3 language code.
* The Zotero "language" field (publication language) must contain a two-letter ISO-639-1, or a three-letter ISO-639-3 language code.
Line 20: Line 22:
** '':container Qxx'' points to a containing item (a [[Item:Q12|BibCollection]] item describing a journal issue, an edited volume)
** '':container Qxx'' points to a containing item (a [[Item:Q12|BibCollection]] item describing a journal issue, an edited volume)
** '':event Qxx'' points to a corresponding event (an item describing a conference iteration, a workshop). A property pointing to the event location is attached to the LexBib wikibase [[Item:Q6|Event]] item.
** '':event Qxx'' points to a corresponding event (an item describing a conference iteration, a workshop). A property pointing to the event location is attached to the LexBib wikibase [[Item:Q6|Event]] item.
** '':abstractLanguage en'' indicates that the abstract contained in the dataset is given in [[Item:Q201|English]] (and not in the language of the article)
** '':abstractLanguage en'' indicates that the abstract contained in the Zotero record is given in [[Item:Q201|English]] (and not in the language of the article, as stated in the "language" field.)
** '':collection x'' points to an Elexifinder collection number.
** '':collection x'' points to an Elexifinder collection number.
** '':type Review'' classifies the item as [[Item:Q15|review article]].
** '':type Review'' classifies the item as [[Item:Q15|review article]].
Line 29: Line 31:


* We use the [https://grobid.readthedocs.io GROBID tool]. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
* We use the [https://grobid.readthedocs.io GROBID tool]. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]). Full texts that do not follow a standard structure, most typically because they don't contain an abstract (this is usual in book chapters), are often not propertly parsed by GROBID.
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]). Full texts that do not follow a standard structure, most typically because they don't contain an abstract (this is usual in book chapters), are often not properly parsed by GROBID.


=LexBib Wikibase=
=LexBib Wikibase=


LexBib wikibase is the central data repository, where Zotero literal values (text strings) are disambiguated to ontology entities, and bibliographic items and LexVoc terms (as content indicators) come together. Wikibase content can be accessed (GUI, API, SPARQL) by anybody, and edited by registered users (manually or API).
LexBib wikibase is the central data repository, where Zotero literal values (text strings) are disambiguated to ontology entities, and bibliographic items and LexVoc terms (as content indicators) come together. Wikibase content can be accessed (GUI, API, SPARQL) by everybody, and edited by registered users (manually or API).


==Zotero export==
==Zotero export==
Line 47: Line 49:
==Author Disambiguation: Open Refine==
==Author Disambiguation: Open Refine==


* For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present by that time in the database to around 4,000 unique person items, using clustering algorithms in [http://openrefine.org Open Refine]. Persons in LexBib have up to six name variants (see query at [[Main_Page#See_what.27s_in_the_database|Main Page]]).
* For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present in the database by that time to around 4,000 unique person items, using clustering algorithms in [http://openrefine.org Open Refine]. Persons in LexBib have up to six name variants (see query at [[Main_Page#See_what.27s_in_the_database|Main Page]]).
* For subsequent updates, we use our own [https://github.com/wetneb/openrefine-wikibase wikibase reconciliation service with open refine]. That means, that person name literals are matched against person items existing in LexBib wikibase, where all name literals previously matched to a person items are stored. [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/sparql/authorliteralsforopenrefine.rq This query] exports wikibase statements pointing to unmatched persons, and [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/maintenance/newcreatorsfromopenrefine.py newcreatorsfromopenrefine.py] processes the reconciliation results, creates new items for those names that have remained unmatched, and updates the statements and the literals associated to persons.
* For subsequent updates, we use our own [https://github.com/wetneb/openrefine-wikibase wikibase reconciliation service with open refine]. That means, that person name literals are matched against person items existing in LexBib wikibase, where all name literals previously matched to a person item are stored. [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/sparql/authorliteralsforopenrefine.rq This query] exports wikibase statements pointing to unmatched persons, and [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/maintenance/newcreatorsfromopenrefine.py newcreatorsfromopenrefine.py] processes the reconciliation results, creates new items for those names that have remained unmatched, and updates the statements and the literals associated to persons.
* This part of the workflow will soon be simplyfied, as [http://wikibase.cloud wikibase.cloud] developers are about to build OpenRefine into wikibase, i.e. a wikibase.cloud wikibase will by default ship its own Open Refine instance for reconciliation of literal values (i.e. their matching to LexBib wikibase ontology entities), and for uploading reconciliation results to wikibase. This means a shortcut for the export-reconciliation-import process described above, wich still involves manual configuration of the Open Refine tool and the own reconciliation service, as well as the upload process for reconciled data.
* This part of the workflow will soon be simplyfied, as [http://wikibase.cloud wikibase.cloud] developers are about to build OpenRefine into wikibase, i.e. a wikibase.cloud wikibase will by default ship its own Open Refine instance for reconciliation of literal values (i.e. matching literals to wikibase items), and for uploading reconciliation results to wikibase. This means a shortcut for the export-reconciliation-import process described above, wich still involves manual configuration of the Open Refine tool and the own reconciliation service, as well as of the upload process for reconciled data.


==Indexation of bibliographical items with [[LexVoc]] terms==
==Indexation of bibliographical items with [[LexVoc]] terms==
Line 56: Line 58:
*# Manually produced full text body TXT.
*# Manually produced full text body TXT.
*# GROBID-produced full text body TXT.
*# GROBID-produced full text body TXT.
*# Zotero-produced "pfd2txt" raw TXT.
*# Zotero-produced "pdf2txt" raw TXT.
*# The abstract recorded in Zotero.
*# The abstract recorded in Zotero.
*# The article title recorded in Zotero.
*# The article title recorded in Zotero.
Line 64: Line 66:
** Mention counts (hits) for the label(s) of each term in each article
** Mention counts (hits) for the label(s) of each term in each article
** Relative frequency for the label(s) of each term in each article (hits/tokens).
** Relative frequency for the label(s) of each term in each article (hits/tokens).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/writefoundterms.py writefoundterms.py] can upload this information to LexBib wikibase. This is an expensive process, and will be done soon in this version of LexBib, as it had been done in the [http://data.lexbib.org previous (experimental) version] (see [https://data.lexbib.org/wiki/Item:Q385 example]).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/writefoundterms.py writefoundterms.py] uploads this information to LexBib wikibase.


==Elexifinder export==
==Elexifinder export==
Line 70: Line 72:
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/export-elexifinder.py elexifinder-export.py] generates a dataset as needed for [[Elexifinder]], based on LexBib wikibase output obtained using SPARQL and API calls.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/export-elexifinder.py elexifinder-export.py] generates a dataset as needed for [[Elexifinder]], based on LexBib wikibase output obtained using SPARQL and API calls.
* The Elexifinder export contains one JSON object for each bibliographical item. Following the [https://github.com/elexis-eu/elexifinder/blob/master/rdf2er/lexbib_rdf_elexifinder_json_mapping.json instructions], it contains the following:
* The Elexifinder export contains one JSON object for each bibliographical item. Following the [https://github.com/elexis-eu/elexifinder/blob/master/rdf2er/lexbib_rdf_elexifinder_json_mapping.json instructions], it contains the following:
** As disambiguated entities: authors, author locations, event locations, languages, containing items
** As disambiguated entities:
** LexVoc terms found in the full text, as [[LexVoc#Elexifinder_Categories|Elexifinder "categories"]].  
***authors, author locations, event locations, languages, containing items
*** LexVoc terms found in the full text, as [[LexVoc#Elexifinder_Categories|Elexifinder "categories"]].
** Publication title and date
** Publication title and date
** URL of the corresponding Zotero item
** URL of the corresponding Zotero item
Line 85: Line 88:
* For all items:
* For all items:
** Setting of item descriptions (schema:description) according to class. For example, a BibItem (class [[Item:Q3|Q3]]) recieves a description containing author last names and year, such as "Publication by Kosem & Lindemann (2021)". This is useful for visual disambiguation of items in LexBib search results.
** Setting of item descriptions (schema:description) according to class. For example, a BibItem (class [[Item:Q3|Q3]]) recieves a description containing author last names and year, such as "Publication by Kosem & Lindemann (2021)". This is useful for visual disambiguation of items in LexBib search results.
** Updating of properties pointing to redirect items, i.e. to items that have been merged to another item.
** Updating of statements pointing to redirect items, i.e. to items that have been merged to another item.
* Related to LexVoc:
* Related to LexVoc:
** Updating of skos:narrower ([[Property:P73|P73]]) relations according to skos:broader ([[Property:P72|P72]]), the inverse relation.
** Updating of skos:narrower ([[Property:P73|P73]]) relations according to skos:broader ([[Property:P72|P72]]), the inverse relation.
Line 93: Line 96:
A [https://github.com/elexis-eu/elexifinder/tree/master/wikibase/lexvoc-lexonomy set of python scripts] performs transformations from and to Lexonomy XML format. This is needed for [[LexVoc translation on Lexonomy]].
A [https://github.com/elexis-eu/elexifinder/tree/master/wikibase/lexvoc-lexonomy set of python scripts] performs transformations from and to Lexonomy XML format. This is needed for [[LexVoc translation on Lexonomy]].
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/buildlexonomy.py buildlexonomy.py] builds [[LexVoc Lexonomy|38 bilingual Lexonomy XML dictionaries]] out of LexVoc SKOS data.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/buildlexonomy.py buildlexonomy.py] builds [[LexVoc Lexonomy|38 bilingual Lexonomy XML dictionaries]] out of LexVoc SKOS data.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/mergeddict2lwb.py mergeddict3lwb.py] collects translation equivalents from Lexonomy XML ([https://lexonomy.elex.is/LexVoc/ merged on Lexonomy server]), and writes them to LexBib wikibase.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/mergeddict2lwb.py mergeddict2lwb.py] collects translation equivalents from Lexonomy XML ([https://lexonomy.elex.is/LexVoc/ merged on Lexonomy server]), and writes them to LexBib wikibase.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/getdicts.py getdicts.py] collects translation equivalents from Lexonomy XML (single dictionary).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/getdicts.py getdicts.py] collects translation equivalents from Lexonomy XML (single dictionary).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/statsfrommergeddict.py statsfrommergeddict.py] and getstats.py produce data rows about translation progress
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/statsfrommergeddict.py statsfrommergeddict.py] and getstats.py produce data rows about translation progress
Line 102: Line 105:
* Indexation of bibliographical items written in languages other than English and Spanish
* Indexation of bibliographical items written in languages other than English and Spanish
** As soon as LexVoc translation is completed, and a lemmatization procedure for other languages is implemented.
** As soon as LexVoc translation is completed, and a lemmatization procedure for other languages is implemented.
* Evaluation of [[LexVoc terms]] as content-describing indicators
* Evaluation of [[LexVoc]] terms as content-describing indicators
** Idea: Authors rate the content descriptors (LexVoc terms) assigned to their articles. The rating can be used to improve the indexation process (e.g. discard descriptors repeatedly marked as irrelevant, or prioritize descriptors according to a certain frequency threshold).
** Idea: Authors rate the content descriptors (LexVoc terms) assigned to their articles. The rating can be used to improve the indexation process (e.g. discard descriptors repeatedly marked as irrelevant, or prioritize descriptors according to a certain frequency threshold).
* Alignment of [[Item:Q5|person]] (and [[Item:Q11|organization]]) items to Wikidata and VIAF:
* Alignment of [[Item:Q5|person]] (and [[Item:Q11|organization]]) items to Wikidata and VIAF: