LexBib bibliodata workflow overview
This page contains a summary of how bibliographical data (bibliodata) is processed in LexBib.
- This table contains information about the status of item collections.
- On LexBib main page, a set of queries show contents of LexBib wikibase.
- LexBib About page contains reference to written publications and presentations about LexBib and Elexifinder.
Collection of bibliodata
- All bibliodata is stored in LexBib Zotero, which is a "group library" on the Zotero platform. The group library is public, but item attachments (PDF, TXT) are restricted to registered group members (project members only).
- For scraping publication metadata from web pages (e.g. article 'landing pages' in journal or publisher portals), the Zotero software includes so-called translators, which ingest bibliodata as single items or in batches. Zotero will also try to harvest the PDF. If it finds a PDF, it also produces a TXT version.
- We transform bibliodata that reaches us as tabular data to RIS format, with own converters. RIS is straightforwardly imported by Zotero, and, if needed, exported, manipulated using regular expressions, and re-imported.
- We can update the Zotero database using the Zotero API. For example, we can update author first and last names according to their preferred form in LexBib wikibase.
- The editing team uses Zotero group synchronization (tutorial).
- Completeness of publication metadata is manually checked.
- Every item is annotated with the first author's location; the location of the first author is a requirement for the dataset to be exported to Elexifinder. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item (tutorial).
- The Zotero "language" field (publication language) must contain a two-letter ISO-639-1, or a three-letter ISO-639-3 language code.
- In the sources, person names (author, editor) are often disordered or incomplete. We try to validate correct name forms already in this stage. A disambiguation proper (with unambiguous ID) is not possible in Zotero.
- Items are annotated with Zotero tags that contain shortcodes, which are interpreted by zotexport.py. The shortcodes point either to LexBib wikibase items (Q-ID), or to pre-defined values:
- :container Qxx points to a containing item (a BibCollection item describing a journal issue, an edited volume)
- :event Qxx points to a corresponding event (an item describing a conference iteration, a workshop). A property pointing to the event location is attached to the LexBib wikibase Event item.
- :abstractLanguage en indicates that the abstract contained in the dataset is given in English (and not in the language of the article)
- :collection x points to an Elexifinder collection number.
- :type Review classifies the item as review article.
- :type Community classifies the item as piece of community communication (anniversaries, obituaries, etc.).
- :type Report classifies the item as event report.
Full text TXT cleaning
- We use the GROBID tool. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
- In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT (tutorial). Full texts that do not follow a standard structure, most typically because they don't contain an abstract (this is usual in book chapters), are often not properly parsed by GROBID.
LexBib wikibase is the central data repository, where Zotero literal values (text strings) are disambiguated to ontology entities, and bibliographic items and LexVoc terms (as content indicators) come together. Wikibase content can be accessed (GUI, API, SPARQL) by anybody, and edited by registered users (manually or API).
- Items are exported from Zotero using an own JSON exporter.
- That export is processed using zotexport.py. The script prepares the upload of the items to Wikibase:
- Author locations and Zotero tags are interpreted. Unknown places and container items are created, so that the bibliographical item can be linked to them.
- PDF are stored for GROBID.
- Zotero fields are mapped to LexBib wikibase properties.
- New items are assigned a LexBib URI, which is attached to the Zotero item as "link attachment", and in the field "archive location". The Zotero URI of the item is mapped to LexBib wikibase property P16; the Zotero URI of PDF and TXT are attached to that P16 statement as qualifiers.
- bibimport.py uploads the resulting semantic triples ("wikibase statements") to LexBib wikibase.
Author Disambiguation: Open Refine
- For Elexifinder version 2 (spring 2021), we reduced the around 5,000 different person names present by that time in the database to around 4,000 unique person items, using clustering algorithms in Open Refine. Persons in LexBib have up to six name variants (see query at Main Page).
- For subsequent updates, we use our own wikibase reconciliation service with open refine. That means, that person name literals are matched against person items existing in LexBib wikibase, where all name literals previously matched to a person item are stored. This query exports wikibase statements pointing to unmatched persons, and newcreatorsfromopenrefine.py processes the reconciliation results, creates new items for those names that have remained unmatched, and updates the statements and the literals associated to persons.
- This part of the workflow will soon be simplyfied, as wikibase.cloud developers are about to build OpenRefine into wikibase, i.e. a wikibase.cloud wikibase will by default ship its own Open Refine instance for reconciliation of literal values (i.e. their matching to LexBib wikibase ontology entities), and for uploading reconciliation results to wikibase. This means a shortcut for the export-reconciliation-import process described above, wich still involves manual configuration of the Open Refine tool and the own reconciliation service, as well as the upload process for reconciled data.
Indexation of bibliographical items with LexVoc terms
- buildbodytxts.py produces a large JSON file containing the full text bodies, as needed for the indexation process, and for Elexifinder export (see below). Full text bodies are taken from one of the different sources, with the following priority ranking, upon availability:
- Manually produced full text body TXT.
- GROBID-produced full text body TXT.
- Zotero-produced "pdf2txt" raw TXT.
- The abstract recorded in Zotero.
- The article title recorded in Zotero.
- The script also lemmatizes the text bodies (this works now for English and Spanish, using SpaCy.)
- buildtermindex.py finds labels (lexicalisations) of LexVoc terms in the full text JSON file. Term labels are also searched for in a lemmatized version (this is relevant for many multiword terms). Term labels that produce many false positives due to ambiguity or parallel use in general language ("article", "case", "example", etc.) are filtered using a stoplist. That works now for English and Spanish.
- The script also collects frequency data:
- Mention counts (hits) for the label(s) of each term in each article
- Relative frequency for the label(s) of each term in each article (hits/tokens).
- writefoundterms.py uploads this information to LexBib wikibase.
- elexifinder-export.py generates a dataset as needed for Elexifinder, based on LexBib wikibase output obtained using SPARQL and API calls.
- The Elexifinder export contains one JSON object for each bibliographical item. Following the instructions, it contains the following:
- As disambiguated entities:
- authors, author locations, event locations, languages, containing items
- LexVoc terms found in the full text, as Elexifinder "categories".
- Publication title and date
- URL of the corresponding Zotero item
- URL of a full text download access (direct download (preferred), 'landing page', or as doi.org link, etc.).
- The whole full text body, which is in the Elexifinder architecture processed for Wikification (Elexifinder "concepts").
- As disambiguated entities:
A set of python scripts performs database maintenance tasks:
- For LexBib wikibase items aligned with Wikidata items (using property P2):
- Import of preferred labels (rdfs:label) and alias labels (skos:altLabel) from Wikidata.
- Import of values of Wikidata-aligned properties (see lists of properties and Wikidata alignment using these queries).
- For all items:
- Setting of item descriptions (schema:description) according to class. For example, a BibItem (class Q3) recieves a description containing author last names and year, such as "Publication by Kosem & Lindemann (2021)". This is useful for visual disambiguation of items in LexBib search results.
- Updating of statements pointing to redirect items, i.e. to items that have been merged to another item.
- Related to LexVoc:
- buildlexonomy.py builds 38 bilingual Lexonomy XML dictionaries out of LexVoc SKOS data.
- mergeddict2lwb.py collects translation equivalents from Lexonomy XML (merged on Lexonomy server), and writes them to LexBib wikibase.
- getdicts.py collects translation equivalents from Lexonomy XML (single dictionary).
- statsfrommergeddict.py and getstats.py produce data rows about translation progress
The following tasks are planned, and awaiting a detailed workflow design:
- Indexation of bibliographical items written in languages other than English and Spanish
- As soon as LexVoc translation is completed, and a lemmatization procedure for other languages is implemented.
- Evaluation of LexVoc terms as content-describing indicators
- Idea: Authors rate the content descriptors (LexVoc terms) assigned to their articles. The rating can be used to improve the indexation process (e.g. discard descriptors repeatedly marked as irrelevant, or prioritize descriptors according to a certain frequency threshold).
- Alignment of person (and organization) items to Wikidata and VIAF:
- This can be done using Open Refine. An experiment using a subset of LexBib showed that about 25% of LexBib persons are found on Wikidata, and around 40% on VIAF. Person entity data on Wikidata (example for an incomplete Wikidata entry) contains ORCID identifiers, among other person metadata, like birth (and death) date, affiliations, etc. Person entity data on VIAF contains reference to authored publications (of all domains), birth (and death) date, etc.
- For persons not found on Wikidata, new Wikidata person items can be created.
- Matching person items on Wikidata can be enriched with LexBib data (authorship relations).
- Alignment of event items to Wikidata:
- This has been done for EURALEX and eLex conference series (Wikidata items have been created and described, example).
- Alignment of bibliographical items to Wikidata (creation of Wikidata items and transfer of bibliodata):
- A DOI-matching experiment has revealed that so far less than 1% of LexBib bibliographical items are found on Wikidata.
- A transfer of bibliodata and author items to Wikidata enables a use of tools like Scholia for lexicographical articles.
- A transfer of bibliodata to Wikidata enables its inclusion in WikiCite and OpenCitations, i.e. into open citation graphs.
- Registering DOI (via Crossref or DataCite, see comparison) for LexBib articles that do not have such an identifier (the vast majority) would include LexBib articles in citation graphs maintained by commercial providers (Web of Science, Scopus).
- Development of a metadata model for Lexical Resources such as dictionaries, and cataloguing of dictionaries:
- Regarding the data model, work is in progress, see Dictionaries in LexBib, and LexMeta.
- Regarding cataloguing, first experiments have been carried out e.g. using datasets from Glottolog, an open repository that contains several thousand of dictionary metadata sets, and Obelex-dict, a catalogue of e-dictionaries.
- Definition and representation of bibliographic item relations:
- Citation relation (BibItem A cites BibItem B)
- Review relation (BibItem A reviews BibItem B or Lexical Resource C)