-
Notifications
You must be signed in to change notification settings - Fork 4
Bot Specification
Max Klein edited this page Jun 7, 2014
·
16 revisions
This is a simple specification (spec) for the design and implementation of our new OA-signal bot. This spec is intended to remain more or less human readable. Key terms should be defined or easily checked via Wikipedia, etc.
##Flow
- A doi appears somewhere on Wikipedia, or a manual request to have a doi done is made (see Producers below).
- We get the the full-text and media from Pubmed central.
- We use the JATS-to-MediaWiki conversion library to make wikitext.
- We upload full text to Wikisource
- We upload of images and other files (limited to accepted file types) to Commons
- We start a Wikidata item with article metadata and suitable statements
- Lastly we signall availability of Wikisource/ Commons/ Wikidata materials in references cited elsewhere on Wikimedia projects, starting with the English Wikipedia.
- Making a template that looks like this mockup
We're making some assumptions here about what tools we're going to use:
- PubMedCentral API - archive of article source files and meta-data (making use of existing OAMI code as appropriate).
- CrossRef API - article license data by DOI.
- Linux (Unix-like) server
- Python programming language
- Virtual Environment for managing packages with pip and a requirements.txt file
- Modular development (python style)
- Object-oriented development
- Multi-threading inside single python process.
- Core python data structures (shelve, pickle, etc)
- Deque python core module for queue system (enables working on both ends of the stack)
- Publisher/Subscriber (Pub/Sub) paradigm (internally referred to as Producer/Consumer for clarity)
- PyWikiBot for Mediawiki (Wikimedia project) interface
- Other various appropriate libraries (python modules), specified in requirements.txt file
- JATS-to-MediaWiki XSLT converter
For the application itself, we'll use the following layers of abstraction:
-
Data
- Store as plain text or in-memory (shelve, pickle, deque)
-
Logging and Error-handling
- Log useful, fully specified messages.
- Handle errors gracefully with try/except, timeouts, max attempts, etc.
-
Queue
- Use "Double-ended Queue" (Deque)
- Queue manages stack of Articles (merely by some ID) to be handled by the application.
-
Producers
- Run multiple threads to handle input streams, feed into Queue
- Primary stream - "Listen for New Citations" probably by making a regular, narrow WMFLabs SQL-replica query for:
- New uses of {{cite doi}} template and perhaps other new citations of DOIs
- Updates to existing uses of {{cite doi}} template.
- Probably on this sub-stream: https://en.wikipedia.org/wiki/Special:RecentChangesLinked/Category:Cite_doi_templates
- Secondary stream - "Jump the Queue" by user-submitted POST request (or similar), e.g. through on-wiki web form requesting a pass over a particular citation.
- Primary stream - "Listen for New Citations" probably by making a regular, narrow WMFLabs SQL-replica query for:
- Run multiple threads to handle input streams, feed into Queue
-
Consumers
- Run multiple threads to source from Queue and handle output streams (only one planned for now).
- Primary stream - "Publish Article Reference, Source Content, and Meta-data" to MediaWiki instances (namely WikiSource, Wikipedia, etc). Requires the following distinct functions:
-
Download
- Use JATS-to-MediaWiki handler script, port to internal class.
-
Convert
- Use JATS-to-MediaWiki converter, port handler script to internal class.
-
Upload
- Use OAMI bot (or custom fork) to upload media to commons.
- Upload to WikiSource and extend upload script by @notconfusing
-
Download
- Primary stream - "Publish Article Reference, Source Content, and Meta-data" to MediaWiki instances (namely WikiSource, Wikipedia, etc). Requires the following distinct functions:
- Run multiple threads to source from Queue and handle output streams (only one planned for now).
- The Wikiproject Open Access Signalling team is funded to maintain the bot through September 2014. Thereafter our commitment to Open Access will drive us to maintain the bot in an unofficial capacity. We will also do as best we can to document and make it easier for other volunteer developers to maintain it.