LexBib bibliodata workflow overview: Difference between revisions

 
(24 intermediate revisions by the same user not shown)
Line 4: Line 4:
* LexBib [[Project:About|About page]] contains reference to written publications and presentations about LexBib and Elexifinder.
* LexBib [[Project:About|About page]] contains reference to written publications and presentations about LexBib and Elexifinder.


=Collection of bibliodata: Zotero=
=Zotero=
==Collection of bibliodata==


* All bibliodata is stored in [[LexBib Zotero]], which is a "group" on the Zotero platform. The group page is public, but item attachments (PDF, TXT) are restricted to registered group members. Member registration is restricted to members of the project.
* All bibliodata is stored in [[LexBib Zotero]], which is a "group library" on the Zotero platform. The group library is public, but item attachments (PDF, TXT) are restricted to registered group members (project members only).
* For scraping publication metadata from web pages (e.g. article 'landing pages' in journal or publisher portals), the Zotero software includes so-called [https://www.zotero.org/support/translators translators], which ingest bibliodata as single items or in batches. Zotero will also try to harvest the PDF. If it finds a PDF, it also produces a TXT version.
* For scraping publication metadata from web pages (e.g. article 'landing pages' in journal or publisher portals), the Zotero software includes so-called [https://www.zotero.org/support/translators translators], which ingest bibliodata as single items or in batches. Zotero will also try to harvest the PDF. If it finds a PDF, it also produces a TXT version.
* We transform bibliodata that reaches us as tabular data to [https://en.wikipedia.org/wiki/RIS_(file_format) RIS format], with [https://github.com/elexis-eu/elexifinder/tree/master/BibDataConverters own converters]. RIS is straightforwardly imported by Zotero, and, if needed, exported, manipulated using regular expressions, and re-imported.
* We transform bibliodata that reaches us as tabular data to [https://en.wikipedia.org/wiki/RIS_(file_format) RIS format], with [https://github.com/elexis-eu/elexifinder/tree/master/BibDataConverters own converters]. RIS is straightforwardly imported by Zotero, and, if needed, exported, manipulated using regular expressions, and re-imported.
* We can update the Zotero database using the Zotero API. For example, we can update author first and last names according to their preferred form in LexBib wikibase.


==Manual curation==
==Manual curation==


* Completeness of publication metadata is manually checked. The editing team uses [https://www.zotero.org/groups/ Zotero group synchronization] ([https://lexbib.org/blog/getting-started-with-zotero/ tutorial]).
* The editing team uses [https://www.zotero.org/groups/ Zotero group synchronization] ([https://lexbib.org/blog/getting-started-with-zotero/ tutorial]).
* Completeness of publication metadata is manually checked.
* Every item is annotated with the first author's location; the location of the first author is a requirement for the dataset to be exported to [[Elexifinder]]. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item ([https://lexbib.org/blog/author-and-article-location-tutorial/ tutorial]).
* Every item is annotated with the first author's location; the location of the first author is a requirement for the dataset to be exported to [[Elexifinder]]. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item ([https://lexbib.org/blog/author-and-article-location-tutorial/ tutorial]).
* The Zotero "language" field (publication language) must contain a two-letter ISO-639-1, or a three-letter ISO-639-3 language code.
* The Zotero "language" field (publication language) must contain a two-letter ISO-639-1, or a three-letter ISO-639-3 language code.
Line 19: Line 22:
** '':container Qxx'' points to a containing item (a [[Item:Q12|BibCollection]] item describing a journal issue, an edited volume)
** '':container Qxx'' points to a containing item (a [[Item:Q12|BibCollection]] item describing a journal issue, an edited volume)
** '':event Qxx'' points to a corresponding event (an item describing a conference iteration, a workshop). A property pointing to the event location is attached to the LexBib wikibase [[Item:Q6|Event]] item.
** '':event Qxx'' points to a corresponding event (an item describing a conference iteration, a workshop). A property pointing to the event location is attached to the LexBib wikibase [[Item:Q6|Event]] item.
** '':abstractLanguage en'' indicates that the abstract contained in the dataset is given in [[Item:Q201|English]] (and not in the language of the article)
** '':abstractLanguage en'' indicates that the abstract contained in the Zotero record is given in [[Item:Q201|English]] (and not in the language of the article, as stated in the "language" field.)
** '':collection x'' points to an Elexifinder collection number.
** '':collection x'' points to an Elexifinder collection number.
** '':type Review'' classifies the item as [[Item:Q15|review article]].
** '':type Review'' classifies the item as [[Item:Q15|review article]].
Line 28: Line 31:


* We use the [https://grobid.readthedocs.io GROBID tool]. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
* We use the [https://grobid.readthedocs.io GROBID tool]. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]). Full texts that do not follow a standard structure, most typically because they don't contain an abstract (this is usual in book chapters), are often not propertly parsed by GROBID.
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]). Full texts that do not follow a standard structure, most typically because they don't contain an abstract (this is usual in book chapters), are often not properly parsed by GROBID.


=LexBib wikibase: Conversion into Linked Data=
=LexBib Wikibase=


LexBib wikibase is the central data repository, where Zotero literal values (text strings) are disambiguated to ontology entities, and bibliographic items and LexVoc terms (as content indicators) come together. Wikibase content can be accessed (GUI, API, SPARQL) by anybody, and edited by registered users (manually or API).
LexBib wikibase is the central data repository, where Zotero literal values (text strings) are disambiguated to ontology entities, and bibliographic items and LexVoc terms (as content indicators) come together. Wikibase content can be accessed (GUI, API, SPARQL) by everybody, and edited by registered users (manually or API).


==Zotero export==
==Zotero export==


* Items are exported from Zotero using an own [https://github.com/elexis-eu/elexifinder/blob/master/Zotero/LexBib_JSON.js JSON exporter].
* Items are exported from Zotero using an own [https://github.com/elexis-eu/elexifinder/blob/master/Zotero/LexBib_JSON.js JSON exporter].
* That export is processed using [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/zotexport.py zotexport.py]. This script prepares the upload of the items to Wikibase:
* That export is processed using [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/zotexport.py zotexport.py]. The script prepares the upload of the items to Wikibase:
** Author locations and Zotero tags are interpreted. Unknown places and container items are created, so that the bibliographical item can be linked to them.
** Author locations and Zotero tags are interpreted. Unknown places and container items are created, so that the bibliographical item can be linked to them.
** PDF are stored for GROBID.
** PDF are stored for GROBID.
Line 46: Line 49:
==Author Disambiguation: Open Refine==
==Author Disambiguation: Open Refine==


* For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present by that time in the database to around 4,000 unique person items, using clustering algorithms in [http://openrefine.org Open Refine]. Persons in LexBib have up to six name variants (see query at [[Main_Page#See_what.27s_in_the_database|Main Page]]).
* For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present in the database by that time to around 4,000 unique person items, using clustering algorithms in [http://openrefine.org Open Refine]. Persons in LexBib have up to six name variants (see query at [[Main_Page#See_what.27s_in_the_database|Main Page]]).
* For subsequent updates, we use our own [https://github.com/wetneb/openrefine-wikibase wikibase reconciliation service with open refine]. That means, that person name literals are matched against person items existing in LexBib wikibase, where all name literals previously matched to a person items are stored. [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/sparql/authorliteralsforopenrefine.rq This query] exports wikibase statements pointing to unmatched persons, and [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/maintenance/newcreatorsfromopenrefine.py newcreatorsfromopenrefine.py] processes the reconciliation results, creates new items for those names that have remained unmatched, and updates the statements and the literals associated to persons.
* For subsequent updates, we use our own [https://github.com/wetneb/openrefine-wikibase wikibase reconciliation service with open refine]. That means, that person name literals are matched against person items existing in LexBib wikibase, where all name literals previously matched to a person item are stored. [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/sparql/authorliteralsforopenrefine.rq This query] exports wikibase statements pointing to unmatched persons, and [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/maintenance/newcreatorsfromopenrefine.py newcreatorsfromopenrefine.py] processes the reconciliation results, creates new items for those names that have remained unmatched, and updates the statements and the literals associated to persons.
* This part of the workflow will soon be simplyfied, as [http://wikibase.cloud wikibase.cloud] developers are about to build OpenRefine into wikibase, i.e. a wikibase.cloud wikibase will by default ship its own Open Refine instance for reconciliation of literal values (i.e. their matching to LexBib wikibase ontology entities), and for uploading reconciliation results to wikibase. This means a shortcut for the export-reconciliation-import process described above, wich still involves manual configuration of the Open Refine tool and the own reconciliation service, as well as the upload process for reconciled data.
* This part of the workflow will soon be simplyfied, as [http://wikibase.cloud wikibase.cloud] developers are about to build OpenRefine into wikibase, i.e. a wikibase.cloud wikibase will by default ship its own Open Refine instance for reconciliation of literal values (i.e. matching literals to wikibase items), and for uploading reconciliation results to wikibase. This means a shortcut for the export-reconciliation-import process described above, wich still involves manual configuration of the Open Refine tool and the own reconciliation service, as well as of the upload process for reconciled data.


==Indexation of bibliographical items with [[LexVoc]] terms==
==Indexation of bibliographical items with [[LexVoc]] terms==
Line 55: Line 58:
*# Manually produced full text body TXT.
*# Manually produced full text body TXT.
*# GROBID-produced full text body TXT.
*# GROBID-produced full text body TXT.
*# Zotero-produced "pfd2txt" raw TXT.
*# Zotero-produced "pdf2txt" raw TXT.
*# The abstract recorded in Zotero.
*# The abstract recorded in Zotero.
*# The article title recorded in Zotero.
*# The article title recorded in Zotero.
Line 63: Line 66:
** Mention counts (hits) for the label(s) of each term in each article
** Mention counts (hits) for the label(s) of each term in each article
** Relative frequency for the label(s) of each term in each article (hits/tokens).
** Relative frequency for the label(s) of each term in each article (hits/tokens).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/writefoundterms.py writefoundterms.py] can upload this information to LexBib wikibase. This is an expensive process, and will be done soon in this version of LexBib, as it had been done in the [http://data.lexbib.org previous (experimental) version] (see [https://data.lexbib.org/wiki/Item:Q385 example]).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/writefoundterms.py writefoundterms.py] uploads this information to LexBib wikibase.


==Elexifinder export==
==Elexifinder export==


* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/export-elexifinder.py elexifinder-export.py] generates dataset as needed for [[Elexifinder]], based on LexBib wikibase output obtained using SPARQL and API calls.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/export-elexifinder.py elexifinder-export.py] generates a dataset as needed for [[Elexifinder]], based on LexBib wikibase output obtained using SPARQL and API calls.
* The Elexifinder export contains one JSON object for each bibliographical item, with the following:
* The Elexifinder export contains one JSON object for each bibliographical item. Following the [https://github.com/elexis-eu/elexifinder/blob/master/rdf2er/lexbib_rdf_elexifinder_json_mapping.json instructions], it contains the following:
** As disambiguated entities: authors, author locations, event locations, languages, containing items
** As disambiguated entities:
** LexVoc terms found in the full text, as [[LexVoc#Elexifinder_Categories|Elexifinder "categories"]].  
***authors, author locations, event locations, languages, containing items
*** LexVoc terms found in the full text, as [[LexVoc#Elexifinder_Categories|Elexifinder "categories"]].
** Publication title and date
** Publication title and date
** URL of the corresponding Zotero item
** URL of the corresponding Zotero item
Line 84: Line 88:
* For all items:
* For all items:
** Setting of item descriptions (schema:description) according to class. For example, a BibItem (class [[Item:Q3|Q3]]) recieves a description containing author last names and year, such as "Publication by Kosem & Lindemann (2021)". This is useful for visual disambiguation of items in LexBib search results.
** Setting of item descriptions (schema:description) according to class. For example, a BibItem (class [[Item:Q3|Q3]]) recieves a description containing author last names and year, such as "Publication by Kosem & Lindemann (2021)". This is useful for visual disambiguation of items in LexBib search results.
** Updating of properties pointing to redirect items, i.e. to items that have been merged to another item.
** Updating of statements pointing to redirect items, i.e. to items that have been merged to another item.
* Related to LexVoc:
* Related to LexVoc:
** Updating of skos:narrower ([[Property:P73|P73]]) relations according to skos:broader ([[Property:P72|P72]]), the inverse relation.
** Updating of skos:narrower ([[Property:P73|P73]]) relations according to skos:broader ([[Property:P72|P72]]), the inverse relation.
Line 91: Line 95:


A [https://github.com/elexis-eu/elexifinder/tree/master/wikibase/lexvoc-lexonomy set of python scripts] performs transformations from and to Lexonomy XML format. This is needed for [[LexVoc translation on Lexonomy]].
A [https://github.com/elexis-eu/elexifinder/tree/master/wikibase/lexvoc-lexonomy set of python scripts] performs transformations from and to Lexonomy XML format. This is needed for [[LexVoc translation on Lexonomy]].
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/buildlexonomy.py buildlexonomy.py] builds 38 bilingual Lexonomy XML dictionaries out of LexVoc SKOS data.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/buildlexonomy.py buildlexonomy.py] builds [[LexVoc Lexonomy|38 bilingual Lexonomy XML dictionaries]] out of LexVoc SKOS data.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/mergeddict2lwb.py mergeddict3lwb.py] collects translation equivalents from Lexonomy XML (merged on Lexonomy server), and writes them to LexBib wikibase.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/mergeddict2lwb.py mergeddict2lwb.py] collects translation equivalents from Lexonomy XML ([https://lexonomy.elex.is/LexVoc/ merged on Lexonomy server]), and writes them to LexBib wikibase.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/getdicts.py getdicts.py] collects translation equivalents from Lexonomy XML (single dictionary).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/getdicts.py getdicts.py] collects translation equivalents from Lexonomy XML (single dictionary).
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/statsfrommergeddict.py statsfrommergeddict.py] and getstats.py produce data rows about translation progress
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/lexvoc-lexonomy/statsfrommergeddict.py statsfrommergeddict.py] and getstats.py produce data rows about translation progress
Line 99: Line 103:


The following tasks are planned, and awaiting a detailed workflow design:
The following tasks are planned, and awaiting a detailed workflow design:
* Indexation of bibliographical items written in languages other than English and Spanish
** As soon as LexVoc translation is completed, and a lemmatization procedure for other languages is implemented.
* Evaluation of [[LexVoc]] terms as content-describing indicators
** Idea: Authors rate the content descriptors (LexVoc terms) assigned to their articles. The rating can be used to improve the indexation process (e.g. discard descriptors repeatedly marked as irrelevant, or prioritize descriptors according to a certain frequency threshold).
* Alignment of [[Item:Q5|person]] (and [[Item:Q11|organization]]) items to Wikidata and VIAF:
* Alignment of [[Item:Q5|person]] (and [[Item:Q11|organization]]) items to Wikidata and VIAF:
** This can be done using Open Refine. An experiment using a subset of LexBib showed that about 25% of LexBib persons are found on Wikidata, and around 40% on VIAF. Person entity data on Wikidata contains ORCID identifiers, among other person metadata, like birth (and death) date, affiliations, etc. Person entity data on VIAF contains reference to authored publications (of all domains), birth (and death) date, etc.
** This can be done using Open Refine. An experiment using a subset of LexBib showed that about 25% of LexBib persons are found on Wikidata, and around 40% on VIAF. Person entity data on Wikidata ([https://www.wikidata.org/wiki/Q14981932 example] for an incomplete Wikidata entry) contains ORCID identifiers, among other person metadata, like birth (and death) date, affiliations, etc. Person entity data on VIAF contains reference to authored publications (of all domains), birth (and death) date, etc.
** For persons not found on Wikidata, new Wikidata person items can be created.
** For persons not found on Wikidata, new Wikidata person items can be created.
** Matching person items on Wikidata can be enriched with LexBib data (authorship relations).
* Alignment of [[Item:Q6|event]] items to Wikidata:
* Alignment of [[Item:Q6|event]] items to Wikidata:
** This has been done for EURALEX and eLex conference series (Wikidata items have been created and described, [https://www.wikidata.org/wiki/Q100594538 example]).
** This has been done for EURALEX and eLex conference series (Wikidata items have been created and described, [https://www.wikidata.org/wiki/Q100594538 example]).