LexBib bibliodata workflow overview: Difference between revisions

From LexBib
Line 47: Line 47:


* For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present by that time in the database to around 4,000 unique person items, using clustering algorithms in [http://openrefine.org Open Refine]. Persons in LexBib have up to six name variants (see query at [[Main_Page#See_what.27s_in_the_database|Main Page]]).
* For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present by that time in the database to around 4,000 unique person items, using clustering algorithms in [http://openrefine.org Open Refine]. Persons in LexBib have up to six name variants (see query at [[Main_Page#See_what.27s_in_the_database|Main Page]]).
* For subsequent updates, we use our own [https://github.com/wetneb/openrefine-wikibase wikibase reconciliation service with open refine]. That means, that person name literals are matched against person items existing in LexBib wikibase. [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/sparql/authorliteralsforopenrefine.rq This query] exports wikibase statements pointing to unmatched persons, and [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/maintenance/newcreatorsfromopenrefine.py newcreatorsfromopenrefine.py] re-imports matched person items or creates new items for those that have remained unmatched, and updates the statements.
* For subsequent updates, we use our own [https://github.com/wetneb/openrefine-wikibase wikibase reconciliation service with open refine]. That means, that person name literals are matched against person items existing in LexBib wikibase, where all name literals previously matched to a person items are stored. [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/sparql/authorliteralsforopenrefine.rq This query] exports wikibase statements pointing to unmatched persons, and [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/maintenance/newcreatorsfromopenrefine.py newcreatorsfromopenrefine.py] re-imports matched person items or creates new items for those that have remained unmatched, and updates the statements.


==Indexation of bibliographical items with [[LexVoc]] terms==
==Indexation of bibliographical items with [[LexVoc]] terms==

Revision as of 01:00, 12 December 2021

This page contains a summary of how bibliographical data (bibliodata) is processed in LexBib.

  • This table contains information about the status of item collections.
  • On LexBib main page, a set of queries show contents of LexBib wikibase.
  • LexBib About page contains reference to written publications and presentations about LexBib and Elexifinder.

Collection of bibliodata: Zotero

  • All bibliodata is stored in LexBib Zotero, which is a "group" on the Zotero platform. The group page is public, but item attachments (PDF, TXT) are restricted to registered group members. Member registration is restricted to members of the project.
  • For scraping publication metadata from web pages (e.g. article 'landing pages' in journal or publisher portals), the Zotero software includes so-called translators, which ingest bibliodata as single items or in batches. Zotero will also try to harvest the PDF. If it finds a PDF, it also produces a TXT version.
  • We transform bibliodata that reaches us as tabular data to RIS format, with own converters. RIS is straightforwardly imported by Zotero, and, if needed, exported, manipulated using regular expressions, and re-imported.

Manual curation

  • Completeness of publication metadata is manually checked. The editing team uses Zotero group synchronization (tutorial).
  • Every item is annotated with the first author's location. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item (tutorial).
  • The Zotero "language" field (publication language) must contain a two-letter ISO-639-1, or a three-letter ISO-639-3 language code.
  • In the sources, person names (author, editor) are often disordered or incomplete. We try to validate correct name forms already in this stage. A disambiguation proper (with unambiguous ID) is not possible in Zotero.
  • Items are annotated with Zotero tags that contain shortcodes, which are interpreted by zotexport.py. The shortcodes point either to LexBib wikibase items (Q-ID), or to pre-defined values:
    • :container Qxx points to a containing item (a journal issue, an edited volume)
    • :event Qxx points to a corresponding event (a conference iteration, a workshop)
    • :abstractLanguage en indicates that the abstract contained in the dataset is given in English (and not in the language of the article)
    • :collection x points to an Elexifinder collection number
    • :type Review classifies the item as review article
    • :type Community classifies the item as piece of community communication (anniversaries, obituaries, etc.)
    • :type Report classifies the item as event report

Full text TXT cleaning

  • We use the GROBID tool. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
  • In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT (tutorial).

LexBib wikibase: Conversion into Linked Data

LexBib wikibase is the central data repository, where Zotero literal values (text strings) are disambiguated to ontology entities, and bibliographic items and LexVoc terms (as content indicators) come together. Wikibase content can be accessed (GUI, API, SPARQL) by anybody, and edited by registered users (manually or API).

Zotero export

  • Items are exported from Zotero using an own JSON exporter.
  • That export is processed using zotexport.py. This script prepares the upload of the items to Wikibase:
    • Author locations and Zotero tags are interpreted. Unknown places and container items are created, so that the bibliographical item can be linked to them.
    • PDF are stored for GROBID.
    • Zotero fields are mapped to LexBib wikibase properties.
    • New items are assigned a LexBib URI, which is attached to the Zotero item as "link attachment", and in the field "archive location". The Zotero URI of the item is mapped to LexBib wikibase property P16; the Zotero URI of PDF and TXT are attached to that P16 statement as qualifiers.
  • bibimport.py uploads the resulting semantic triples ("wikibase statements") to LexBib wikibase.

Author Disambiguation: Open Refine

  • For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present by that time in the database to around 4,000 unique person items, using clustering algorithms in Open Refine. Persons in LexBib have up to six name variants (see query at Main Page).
  • For subsequent updates, we use our own wikibase reconciliation service with open refine. That means, that person name literals are matched against person items existing in LexBib wikibase, where all name literals previously matched to a person items are stored. This query exports wikibase statements pointing to unmatched persons, and newcreatorsfromopenrefine.py re-imports matched person items or creates new items for those that have remained unmatched, and updates the statements.

Indexation of bibliographical items with LexVoc terms

  • buildbodytxts.py produces a large JSON file containing the full text bodies, as needed for the indexation process, and for Elexifinder export (see below). Full text bodies are taken from one of the different sources, with the following priority ranking, upon availability:
    1. Manually produced full text body TXT.
    2. GROBID-produced full text body TXT.
    3. Zotero-produced "pfd2txt" raw TXT.
    4. The abstract recorded in Zotero.
    5. The article title recorded in Zotero.
  • The script also lemmatizes the text bodies (this works now for English and Spanish, using SpaCy.)
  • buildtermindex.py finds labels (lexicalisations) of LexVoc terms in the full text JSON file. Term labels are also searched for in a lemmatized version (this is relevant for many multiword terms). Term labels that produce many false positives due to ambiguity or parallel use in general language ("article", "case", "example", etc.) are filtered using a stoplist. That works now for English and Spanish.
  • The script also collects frequency data:
    • Mention counts (hits) for the label(s) of each term in each article
    • Relative frequency for the label(s) of each term in each article (hits/tokens).
  • writefoundterms.py can upload this information to LexBib wikibase. This is an expensive process, and will be done soon in this version of LexBib, as it had been done in the previous (experimental) version (see example).

Elexifinder export

  • elexifinder-export.py generates dataset as needed for Elexifinder, based on LexBib wikibase output obtained using SPARQL and API calls.
  • The Elexifinder export contains one JSON object for each bibliographical item, with the following:
    • As disambiguated entities: authors, author locations, event locations, languages, containing items
    • LexVoc terms found in the full text, as Elexifinder "categories".
    • Publication title and date
    • URL of the corresponding Zotero item
    • URL of a full text download access (direct download (preferred), 'landing page', or as doi.org link, etc.).
    • The whole full text body, which is in the Elexifinder architecture processed for Wikification (Elexifinder "concepts").

Maintenance tasks