LexBib bibliodata workflow overview
This page contains a summary of how bibliographical data (bibliodata) is processed in LexBib.
- This table contains information about the status of item collections.
- On LexBib main page, a set of queries show contents of LexBib wikibase.
- LexBib About page contains reference to written publications and presentations about LexBib and Elexifinder.
Collection of bibliodata: Zotero
- All bibliodata is stored in LexBib Zotero, which is a "group" on the Zotero platform. The group page is public, but item attachments (PDF, TXT) are restricted to registered group members. Member registration is restricted to members of the project.
- The Zotero software includes web scraping so-called translators, which ingest bibliodata as single items or in batches. Zotero will also try to harvest the PDF. If it finds a PDF, it also produces a TXT version.
- We transform different bibliodata representation formats, or tabular data, to RIS format, which is seamlessly ingested by Zotero, using own converters.
Manual curation
- Completeness of publication metadata is manually checked. The editing team uses Zotero group synchronization.
- Every item is annotated with the first author's location. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item.
- The Zotero "language" field (publication language) must contain a two-letter ISO-639-1, or a three-letter ISO-639-3 language code.
- In the sources, person names (author, editor) are often disordered or incomplete. We try to validate correct name forms already in this stage. A disambiguation proper (with unambiguous ID) is not possible in Zotero.
- Items are annotated with Zotero tags that contain shortcodes, which are interpreted by zotexport.py. The shortcodes point either to LexBib wikibase items (Q-ID), or to pre-defined values:
- :container Qxx points to a containing item (a journal issue, an edited volume)
- :event Qxx points to a corresponding event (a conference iteration, a workshop)
- :abstractLanguage en indicates that the abstract contained in the dataset is given in English (and not in the language of the article)
- :collection x points to an Elexifinder collection number
- :type Review classifies the item as review article
- :type Community classifies the item as piece of community communication (anniversaries, obituaries, etc.)
- :type Report classifies the item as event report
Full text TXT cleaning
- We use the GROBID tool. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
- In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT (tutorial).
Migration to LexBib wikibase: Conversion into Linked Data
Zotero export
- Items are exported from Zotero using an own JSON exporter.
- That export is processed using zotexport.py. This script prepares the upload of the items to Wikibase:
- Author locations and Zotero tags are interpreted. Unknown places and container items are created, so that the bibliographical item can be linked to them.
- PDF are stored for GROBID.
- Zotero fields are mapped to LexBib wikibase properties.
- New items are assigned a LexBib URI, which is attached to the Zotero item as "link attachment", and in the field "archive location". The Zotero URI of the item is mapped to LexBib wikibase property P16; the Zotero URI of PDF and TXT are attached to that P16 statement as qualifiers.
- bibimport.py uploads the resulting semantic triples ("wikibase statements") to LexBib wikibase.
Author Disambiguation: Open Refine
Indexation of bibliographical items with LexVoc terms
Elexifinder export
- elexifinder-export.py generates a JSON export as needed for Elexifinder, based on LexBib wikibase output obtained using SPARQL and API calls.
- The Elexifinder export contains the following as disambiguated entities: authors, author locations, event locations, languages, containing items, and LexVoc terms as Elexifinder "categories". Publication title and date, links that point to the corresponding Zotero item and to a full text download access (at the publisher, or as DOI, etc.) are also exported, as well as the whole full text, which is processed for Wikification (Elexifinder "concepts").