-
Notifications
You must be signed in to change notification settings - Fork 0
AsyncUploadDesign
Some design thoughts on the async uploads, parsers, scrapers and large recordset support.
Separate the web ui from the process doing the actual ingesting into the dao storage.
Modular: webui .. ingest .. parsercollection
The webui is part of the base BibServer. When a user submits an upload request, an 'uploadticket' is created that is stored in the queue of the ingest module. An uploadticket contains: bibserver-username, collection, source_url (basically the fields in the current upload form) The upload form in the web ui returns an uploadticket ID which can be stored for a user so that progress can be queried and displayed to the user while the tickets is being processed.
The ingest module runs as a separate daemon process from the webui. It monitors a queue of ingest tickets, when a new ticket is submitted the source_url is downloaded and ingested to the index. Progress is updated to the ticket, and potential error messages updated to the ticket.
IngestTicket
owner
collection
source_url
format
bibjson_url
data_md5
state - [new, downloading, downloaded, failed, parsing, parsed, populating_index, done]
errors - []
The parsercollection is a set of parser executables which accept the required format to be parsed on stdin, and output BibJSON on stdout. This means that parsers can be written in any language, and be submitted by users. There will be a minimum set of parsers supplied by BibServer to bootstrap the process: JSON, BibTex, CSV, RIS. Having the parser function as a script makes it posible to unify 'parsers' and 'scrapers' as we can have a site-specific parser/scraper be part of this collection. As it is simply an executable, BibServer users could also run it on their own desktop machines without installing an entire BibServer environment.
Static file storage - When the ingest module downloads data from a source_url it stored on disk in a cache, where the filename of the stored data is a hash of the file itsself. In the uploadticket we store the date/time of the download and the hash of the downloaded data. If in future we need to refer to the data we can refer to the original source .(for debugging of error messages, renewed ingest etc.)
The current version of BibServer fails silently if a problem occurs with an upload. Users need as much information as possible when an error occurs to be stored with the uploadticket so that this can be used for debugging.
If a parserscraper encounters any problems, they are output to stderr. The ingest pipelines stores this with the uploadticket for diagnostic purposes.
A BibServer Ingest parser is an executable file that accepts a file to parse on stdin, processes the file and outputs valid BibJSON output on stdout. For certain types of parsers it is possible to specify some parameters to use as input, for example to do a search query on an external website, and convert the output from that website.
For auto-discovery of installed parsers by the BibServer ingest process, a parser MUST accept at least one command-line option: -bibserver The -bibserver option outputs a JSON dict containing the BibJSON version output by this parser, and other keys that are used to indicate that this is a valid parser plugin.
When the ingest.py script is started, it scans the parser plugin directory as specified by the PARSER_PLUGIN_PATH config entry. For each file found in that directory it is attempted to run the file as a subprocess with the -bibserver command line, and captures stdout.
Example -bibserver output
{"display_name": "BibTex", "contact": "[email protected]", "bibserver_plugin": true, "BibJSON_version": "0.81", "does_download":true}
What does the "does_download" flag do? It signals that this plugin can do the download of content, when called with the -bibserver_download command. In this mode the plugin does not do parsing, but fetches the data as specified on stdin, and outputs the downloaded content to stdout. For example, a Wikipedia plugin could accept a Wikipedia pagename as input URL. The plugin does the download of the raw Wikipedia content, as it knows what the URL is for that.
Samples provided by BibServer: BibJSON, BibTex, CSV, RIS
New parsers to write: OAI-PMH, MARC. GoogleDocs Spreadsheet
Possble sites to support scraping: Medline, PubMed, NARCIS
Tools to support: Mendeley, Zotero
Why do we need site-specific parsing/scraping
A convertor script doing BNB RDF-XML to JSON
Supporting uploads in Nginx, consider this if we ever need to upload huge files