-
Notifications
You must be signed in to change notification settings - Fork 4
Bot Specification
Max Klein edited this page May 20, 2014
·
16 revisions
This is a simple specification (spec) for the design and implementation of our new OA-signal bot. This spec is intended to remain more or less human readable. Key terms should be defined or easily checked via Wikipedia, etc.
We're making some assumptions here about what tools we're going to use:
- PubMedCentral API - archive of article source files and meta-data.
- CrossRef API - article license data by DOI.
- Linux (Unix-like) server
- Python programming language
- Virtual Environment for managing packages with pip and a requirements.txt file
- Modular development (python style)
- Object-oriented development
- Multi-threading inside single python process. (pseudo-parallelism).
- Core python data structures (shelf, pickle, etc)
- Deque python core module for queue system (enables working on both ends of the stack)
- Publisher/Subscriber (Pub/Sub) paradigm (internally referred to as Producer/Consumer for clarity)
- PyWikiBot for Mediawiki (Wikimedia project) interface
- Other various appropriate libraries (python modules), specified in requirements.txt file
- JATS-to-MediaWiki XSLT converter
For the application itself, we'll use the following layers of abstraction:
-
Data
- Store as plain text or in-memory (shelf, pickle, deque)
-
Logging and Error-handling
- Log useful, fully specified messages.
- Handle errors gracefully with try/except, timeouts, max attempts, etc.
-
Queue
- Use "Double-ended Queue" (Deque)
- Queue manages stack of Articles (merely by some ID) to be handled by the application.
-
Producers
- Run multiple threads to handle input streams, feed into Queue
- Primary stream - "Listen for New Citations" probably by making a regular, narrow API query for:
- New uses of {{cite doi}} template
- Updates to existing uses of {{cite doi}} template.
- Secondary stream - "Jump the Queue" by user-submitted POST request (or similar), e.g. through on-wiki web form requesting a pass over a particular citation.
- Primary stream - "Listen for New Citations" probably by making a regular, narrow API query for:
- Run multiple threads to handle input streams, feed into Queue
-
Consumers
- Run multiple threads to source from Queue and handle output streams (only one planned for now).
- Primary stream - "Publish Article Reference, Source Content, and Meta-data" to MediaWiki instances (namely WikiSource, Wikipedia, etc). Requires the following distinct functions:
-
Download
- Use JATS-to-MediaWiki handler script, port to internal class.
-
Convert
- Use JATS-to-MediaWiki converter, port handler script to internal class.
-
Upload
- Include and extend upload script by @notconfusing
-
Download
- Primary stream - "Publish Article Reference, Source Content, and Meta-data" to MediaWiki instances (namely WikiSource, Wikipedia, etc). Requires the following distinct functions:
- Run multiple threads to source from Queue and handle output streams (only one planned for now).