LexBib bibliodata workflow overview: Difference between revisions

Line 28: Line 28:


* We use the [https://grobid.readthedocs.io GROBID tool]. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
* We use the [https://grobid.readthedocs.io GROBID tool]. zotexport.py (see below) leaves a copy of all PDF in a folder, which is processed by GROBID; GROBID produces a TEI-XML representation of the PDF content. The article body (i.e. the part usually starting after the abstract and ending before the references section) is enclosed in a tag called <body>.
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]).
* In cases where GROBID fails, we manually isolate the text body from the Zotero-produced TXT ([https://lexbib.org/blog/grobid-txt-validation/ tutorial]). Full texts that do not follow a standard structure, most typically because they don't contain an abstract (this is usual in book chapters), are often not propertly parsed by GROBID.


=LexBib wikibase: Conversion into Linked Data=
=LexBib wikibase: Conversion into Linked Data=