LexBib bibliodata workflow overview
This page contains a summary of how bibliographical data (bibliodata) is processed in LexBib.
- This table contains information about the status of item collections.
- On LexBib main page, a set of queries show contents of LexBib wikibase.
Collection of bibliodata: Zotero
- All bibliodata is stored in LexBib Zotero, which is a "group" on the Zotero platform. The group page is public, but item attachments (PDF, TXT) are restricted to registered group members. Member registration is restricted to members of the project.
- The Zotero software includes web scraping so-called translators, which ingest bibliodata as single items or in batches. Zotero will also try to harvest the PDF. If it finds a PDF, it also produces a TXT version.
- We transform different bibliodata representation formats, or tabular data, to RIS format, which is seamlessly ingested by Zotero, using own converters.
Manual curation
- Completeness of publication metadata is manually checked. The editing team uses Zotero group synchronization.
- Every item is annotated with the first author's location. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item.
- Items are annotated with Zotero tags that contain shortcodes for the following:
- :container points to a containing item (a journal issue, an edited volume)
- :event points to a corresponding event (a conference iteration, a workshop)
- :abstractLanguage indicates that the abstract contained in the dataset is given in a certain language (different to the language of the article)
- :collection points to an Elexifinder collection number
Full text TXT cleaning
- We use the GROBID tool. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
- In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT (tutorial).