Skip to content
epoz edited this page Feb 6, 2012 · 15 revisions

Some design thoughts on the async uploads, parsers, scrapers and large recordset support.

Separate the web ui from the process doing the actual ingesting into the dao storage.

Modular: webui .. ingest .. parsercollection

The webui is part of the base BibServer. When a user submits an upload request, an 'uploadticket' is created that is stored in the queue of the ingest module. An uploadticket contains: bibserver-username, collection, source_url (basically the fields in the current upload form) The upload form in the web ui returns an uploadticket ID which can be stored for a user so that progress can be queried and displayed to the user while the tickets is being processed.

The ingest module runs as a separate daemon process from the webui. It monitors a queue of uploadtickets, when a new ticket is submitted the source_url is downloaded and ingested to the index. Progress is updated to the ticket, and potential error messages updated to the ticket.

The parsercollection is a set of parser executables which accept the required format to be parsed on stdin, and output BibJSON on stdout. This means that parsers can be written in any language, and be submitted by users. There will be a minimum set of parsers supplied by BibServer to bootstrap the process: JSON, BibTex, CSV, RIS. Having the parser function as a script makes it posible to unify 'parsers' and 'scrapers' as we can have a site-specific parser/scraper be part of this collection. As it is simply an executable, BibServer users could also run it on their own desktop machines without installing an entire BibServer environment.

Static file storage - When the ingest module downloads data from a source_url it stored on disk in a cache, where the filename of the stored data is a hash of the file itsself. In the uploadticket we store the date/time of the download and the hash of the downloaded data. If in future we need to refer to the data we can refer to the original source .(for debugging of error messages, renewed ingest etc.)

Error handling

The current version of BibServer fails silently if a problem occurs with an upload. Users need as much information as possible when an error occurs to be stored with the uploadticket so that this can be used for debugging.

If a parserscraper encounters any problems, they are output to stderr. The ingest pipelines stores this with the uploadticket for diagnostic purposes.

Parsers/Scapers

BibJSON, BibTex, CSV, RIS, OAI-PMH

Medline, PubMed, NARCIS

Mendeley, Zotero

See: http://lists.okfn.org/pipermail/openbiblio-dev/2012-February/000618.html on why we need site-specific parsing/scraping.

See: https://gist.github.com/1731588 for a convertor script doing BNB RDF-XML to JSON

Relevant tickets

See #159

Links

https://github.com/vkholodkov/nginx-upload-module/tree/2.2 nginx upload , consider this if we ever need to upload huge files

Clone this wiki locally