Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to README.md #33

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Updates to README.md #33

wants to merge 2 commits into from

Conversation

jackrea07
Copy link

  1. Included specific example of suffix tree and added reasons why they were chosen for this specific problem.
  2. Added general reminders for contributors about how to safely edit code in repository and make pull requests.

@art-w
Copy link
Owner

art-w commented Feb 15, 2024

Thanks! As a general rule, I would prefer if the readme only contained informations that are specific to sherlodoc rather than general knowledge that can be acquired from the internet :) (otherwise it becomes harder to know where to stop in what needs to be explained)

Regarding the contributor guide, I'm afraid we don't do anything special that requires instructions (and sorry we don't plan to have a dev branch atm).

However I agree that our use of suffix trees could be expanded upon since we do stuff that are different from the wikipedia description:

  • You are right that we are using a "compressed" suffix tree, such that there are no nodes in the tree with a single child, which makes the suffix tree size and construction linear in the size of the input (your ascii art appears slightly broken in the github preview)
  • The leaves of our suffix trees contain an array of matching results (rather than a $string_loc... our suffix trees are dictionaries/maps with string keys). Those arrays of results are sorted by the static evaluation described below in the readme, to ensure that "better" results are at the beginning of the arrays (such that we don't need to traverse the entire arrays to discover them!)
  • The children/subtrees of each node are also ordered by their minimum result (rather than by the next character in the string key). This makes querying the suffix tree for a word slower-ish (since finding each next child is O(alphabet) but the alphabet is very small), but it enables treating each subtree as a purely functional priority queue: This improves the enumeration of matching results in the static ordering. Finding the top N results for a query requires to compute the (on demand) intersection of the array leaves for each word in the query (and there can be many many leaves if a query word is short, so ordering each subtree makes the intersection algorithm much faster as we can avoid exploring subtrees that have the worst candidates)
  • Finally, we use hash-consing to detect identical subtrees (with different parents in the tree) and share their representation to save space on disk (it's about 6x smaller on a library like Base with this optim). In other words, the suffix trees are compressed into a finite state machine / automata (a directed acyclic graph (DAG)). This compression doesn't impact the rest of the code since we can traverse the DAG just as if it was a tree.

It might be interesting to use graphviz on a small ocaml library to show what the indexes ends up looking like! (to make the suffix tree illustration more spicy than banana ^^)

@EmileTrotignon
Copy link
Collaborator

EmileTrotignon commented Feb 19, 2024

I like the contribution recommendation, I think they are going to be useful in the future, but they should be in a CONTRIBUTING.md file (that can be referenced in the readme).

The recommendations are general, but I would like to expand on with more project-specific things once it is merged.

Maybe something similar for the suffix-tree explanation would make @art-w happy ? Maybe a DESIGN.md file ? I am not sure about the name nor the place were it should be, but I personally think it is a good thing, especially for explanation that are specific to the project. More general explanation I am not against, but they should be short and link to external resources.

It might be interesting to use graphviz on a small ocaml library to show what the indexes ends up looking like! (to make the suffix tree illustration more spicy than banana ^^)

Since this is a tree, an ascii representation might be enough. This could be also "expect-tested" which would be quite nice in my opinion. I'll open an issue for this.

@EmileTrotignon
Copy link
Collaborator

Regarding the contributor guide, I'm afraid we don't do anything special that requires instructions (and sorry we don't plan to have a dev branch atm).

I want a contributor guide because there might be outreachy interns working on this. Other than that, we do not plan on having a dev branch, this is not common practice in the ocaml world, so I think this part should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants