Anonymous

LexBib bibliodata workflow overview: Difference between revisions

From LexBib
no edit summary
No edit summary
Line 13: Line 13:
* Completeness of publication metadata is manually checked. The editing team uses [https://www.zotero.org/groups/ Zotero group synchronization].
* Completeness of publication metadata is manually checked. The editing team uses [https://www.zotero.org/groups/ Zotero group synchronization].
* Every item is annotated with the first author's location. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item.
* Every item is annotated with the first author's location. An English Wikipedia page URL (as unambiguous identifier) is placed in the Zotero "extra" field. zotexport.py (see below) maps that to the corresponding LexBib place item.
* The Zotero "language" field (publication language) must contain a two-letter ISO-639-1, or a three-letter ISO-639-3 language code.
* In the sources, person names (author, editor) are often disordered or incomplete. We try to validate correct name forms already in this stage. A disambiguation proper (with unambiguous ID) is not possible in Zotero.
* Items are annotated with Zotero tags that contain shortcodes, which are interpreted by zotexport.py. The shortcodes point either to LexBib wikibase items (Q-ID), or to pre-defined values:
* Items are annotated with Zotero tags that contain shortcodes, which are interpreted by zotexport.py. The shortcodes point either to LexBib wikibase items (Q-ID), or to pre-defined values:
** '':container Qxx'' points to a containing item (a journal issue, an edited volume)
** '':container Qxx'' points to a containing item (a journal issue, an edited volume)
Line 26: Line 28:
* We use the [https://grobid.readthedocs.io GROBID tool]. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
* We use the [https://grobid.readthedocs.io GROBID tool]. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]).
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]).
=Migration to Wikibase=
* Items are exported from Zotero using an own [https://github.com/elexis-eu/elexifinder/blob/master/Zotero/LexBib_JSON.js JSON exporter].
* That export is processed using [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/zotexport.py zotexport.py]. This script prepares the upload of the items to Wikibase:
** Author locations and Zotero tags are interpreted. Unknown places and container items are created, so that the bibliographical item can be linked to them.
** PDF are stored for GROBID.
** Zotero fields are mapped to LexBib wikibase properties.
** New items are assigned a LexBib URI, which is attached to the Zotero item as "link attachment", and in the field "archive location". The Zotero URI of the item is mapped to LexBib wikibase property [[Property:P16|P16]]; the Zotero URI of PDF and TXT are attached to that P16 statement as qualifiers.
* [https://github.com/elexis-eu/elexifinder/blob/master/wikibase/bibimport.py bibimport.py] uploads the resulting semantic triples ("wikibase statements") to LexBib wikibase.
=Author Disambiguation: Open Refine=
=Elexifinder Export=
=Maintenance tasks=