Skip to content
Gabriel Vîjială edited this page May 10, 2017 · 25 revisions

High Priority

Incremental updates (needed for the Liquid Investigations setup)

  • Management command that re-walks a collection to find files that are new / modified / deleted and adds them to the digest queue (snoop)

Filter search results (ui)

  • Create pill-like filters that can be manipulated in the search bar like here
  • Make a new search every time the filters change
  • Select / De-select all collections
  • Filter by subfolder
  • Filter by email by sender/ receiver
  • Filter by filetype [boxes to tick for one or multiple filetypes before searching]
  • Filter by date or date range [date of file creation, date of email sent, date of file modification]
  • Filter by language
  • Filter for money related terms: IBAN, sum of money (USD, CHF, EUR etc) (use regex searches)
  • Filter for web address and related terms: telephone, website, email other contact info (use regex searches)

Deduplicate search results

  • Data migration to make Document.sha1 unique and store all fields in another table named DocumentInstance described here
  • OR: Group results by hash (maybe use elasticsearch field collapsing?) (ui)
  • Display if a document is duplicated (ui)
  • Page to list all occurences of duplicates for a document (ui)

Unify search and document preview

Batch search

  • implement for batch search (ui)

Normal Priority

Visualize results

  • pie chart by filetype
  • pie chart by language
  • date histogram

Vairous UI improvements

  • Separate scroll boxes for search results and document preview
  • Highlight the document that's currently being previewed
  • Click on the search icon to perform a search
  • For images, the document preview should load the image, if it's not too large
  • Make it easy to copy text from document preview
  • Embed hypothesis javascript snippet in document preview (ui)

Email threads

  • Tag all emails in a thread with the same ID (https://cr.yp.to/immhf/thread.html) (snoop)
  • Return the most recent result from an email thread, show the number of messages in the thread (ui)
  • Provide a way to see other messages in the same thread (ui)

Permalinks with document id, md5 and sha1

  • If multiple documents share the same hash, present a menu with all of them
  • All permalink versions should contain <link rel=canonical> pointing to the document id permalink

Links to quick searches from the document preview page

  • Email address (ui)
  • Names of people and companies (ui)
  • Bank accounts, phone numbers (ui)

Others

  • In the collections sidebar menu, show number of documents for each
  • Scan indexed files for viruses (ClamAV?)
  • Generalize access to dataset repositories
  • Read files from HTTP server in addition to local filesystem (nginx/apache directory listing? WebDAV?) (snoop)
  • System metrics (load, cpu, swap, disk free, memory of various services) - use code from github.com/python-diamond/Diamond/tree/master/src/collectors
  • Collection access permissions - map groups instead of users (search)

Datasets for demo server

  • 5secunde
  • Romanian gazette (MOFs)
  • Luxembourg gazette
  • US Embassy Cables
  • Enron dataset
  • OpenCorporates
  • EU tenders from TED
  • Offshore Leaks

Tooling

To be discussed

  • Remember which documents were visited/previewed
  • Detect entities (names, IBANs, emails, phone numbers, websites, authors of documents/PDFs, etc); normalize and index them separately or use custom tokenizer; make it easy to search for them
  • Have an aproximate/similar results feature, suggestions (like Google)
  • Compare up to 3 documents in the same screen
  • Download emails as PDF

Higher-order search (TBD)

  • Use list of terms from external source (e.g. gist file)
  • Venn diagram to see overlap between sets of search results

Clipboard (TBD)

  • Highlight entities, add them to clipboard with one click
  • Use clipboard to make batch searches

Collection details page (TBD)

  • Name, description
  • Who owns and hosts the collection (collections can be indexed from a remote server hosted by someone else)
  • How many documents; breakdown by filetype, language, document dates (a few buckets of values, for example: "2013, 2014, 2015" or "May, June, July, August 2015" or "26 to 29, July 2015"), source (e.g. if the collection contains news articles from multiple publishers)
  • How to download the whole collection or import it into your own hoover

Done

  • Set filetype for image and video files (snoop)
  • Serve json data for a document (snoop)
  • Serve doc.html from ui so that documents are rendered by the front-end app (ui, search)
  • Make it easy to work on UI code using the demo server as backend
  • Allow text selection in search results
  • After choosing number of results from dropdown, auto-perform new search
  • Sort results by relevance, newest, oldest
  • Cache app.js (use webpack-generated hash and cache forever?)
  • Loading indicator for search results and document preview
  • Show document's word count in search results
  • New field parent_id, links to parent archive/email or top-level folder in the collection
  • Metrics for jobs (fields: queue name, data, start time, duration, success)
  • Descriptive search errors in UI (e.g. elasticsearch is down, query syntax error)