Roadmap

High Priority

Incremental updates (needed for the Liquid Investigations setup)

Management command that re-walks a collection to find files that are new / modified / deleted and adds them to the digest queue (snoop)

Filter search results (ui)

Deduplicate search results

Data migration to make Document.sha1 unique and store all fields in another table named DocumentInstance described here
OR: Group results by hash (maybe use elasticsearch field collapsing?) (ui)
Display if a document is duplicated (ui)
Page to list all occurences of duplicates for a document (ui)

Unify search and document preview

Style the document preview (see mockups) (ui)
Render document tree (ui) mockups: in search page, in document preview

Batch search

implement for batch search (ui)

Normal Priority

Visualize results

pie chart by filetype
pie chart by language
date histogram

Vairous UI improvements

Separate scroll boxes for search results and document preview
Highlight the document that's currently being previewed
Click on the search icon to perform a search
For images, the document preview should load the image, if it's not too large
Make it easy to copy text from document preview
Embed hypothesis javascript snippet in document preview (ui)

Email threads

Tag all emails in a thread with the same ID (https://cr.yp.to/immhf/thread.html) (snoop)
Return the most recent result from an email thread, show the number of messages in the thread (ui)
Provide a way to see other messages in the same thread (ui)

Permalinks with document id, md5 and sha1

If multiple documents share the same hash, present a menu with all of them
All permalink versions should contain <link rel=canonical> pointing to the document id permalink

Links to quick searches from the document preview page

Email address (ui)
Names of people and companies (ui)
Bank accounts, phone numbers (ui)

Others

In the collections sidebar menu, show number of documents for each
Scan indexed files for viruses (ClamAV?)
Generalize access to dataset repositories
Read files from HTTP server in addition to local filesystem (nginx/apache directory listing? WebDAV?) (snoop)
System metrics (load, cpu, swap, disk free, memory of various services) - use code from github.com/python-diamond/Diamond/tree/master/src/collectors
Collection access permissions - map groups instead of users (search)

Datasets for demo server

Tooling

Travis
Auto deploy demo server when github master is updated
Installation package modelled after homebrew
Embedding solution to help with publishing

To be discussed

Remember which documents were visited/previewed
Detect entities (names, IBANs, emails, phone numbers, websites, authors of documents/PDFs, etc); normalize and index them separately or use custom tokenizer; make it easy to search for them
Have an aproximate/similar results feature, suggestions (like Google)
Compare up to 3 documents in the same screen
Download emails as PDF

Higher-order search (TBD)

Use list of terms from external source (e.g. gist file)
Venn diagram to see overlap between sets of search results

Clipboard (TBD)

Highlight entities, add them to clipboard with one click
Use clipboard to make batch searches

Collection details page (TBD)

Name, description
Who owns and hosts the collection (collections can be indexed from a remote server hosted by someone else)
How many documents; breakdown by filetype, language, document dates (a few buckets of values, for example: "2013, 2014, 2015" or "May, June, July, August 2015" or "26 to 29, July 2015"), source (e.g. if the collection contains news articles from multiple publishers)
How to download the whole collection or import it into your own hoover

Done

~~Set filetype for image and video files (snoop)~~
~~Serve json data for a document (snoop)~~
~~Serve doc.html from ui so that documents are rendered by the front-end app (ui, search)~~
~~Make it easy to work on UI code using the demo server as backend~~
~~Allow text selection in search results~~
~~After choosing number of results from dropdown, auto-perform new search~~
~~Sort results by relevance, newest, oldest~~
~~Cache app.js (use webpack-generated hash and cache forever?)~~
~~Loading indicator for search results and document preview~~
~~Show document's word count in search results~~
~~New field parent_id, links to parent archive/email or top-level folder in the collection~~
~~Metrics for jobs (fields: queue name, data, start time, duration, success)~~
~~Descriptive search errors in UI (e.g. elasticsearch is down, query syntax error)~~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap

High Priority

Incremental updates (needed for the Liquid Investigations setup)

Filter search results (ui)

Deduplicate search results

Unify search and document preview

Batch search

Normal Priority

Visualize results

Vairous UI improvements

Email threads

Permalinks with document id, md5 and sha1

Links to quick searches from the document preview page

Others

Datasets for demo server

Tooling

To be discussed

Higher-order search (TBD)

Clipboard (TBD)

Collection details page (TBD)

Done

Clone this wiki locally