Updates to README.md #33

jackrea07 · 2024-02-15T03:31:57Z

Included specific example of suffix tree and added reasons why they were chosen for this specific problem.
Added general reminders for contributors about how to safely edit code in repository and make pull requests.

art-w · 2024-02-15T11:06:30Z

Thanks! As a general rule, I would prefer if the readme only contained informations that are specific to sherlodoc rather than general knowledge that can be acquired from the internet :) (otherwise it becomes harder to know where to stop in what needs to be explained)

Regarding the contributor guide, I'm afraid we don't do anything special that requires instructions (and sorry we don't plan to have a dev branch atm).

However I agree that our use of suffix trees could be expanded upon since we do stuff that are different from the wikipedia description:

You are right that we are using a "compressed" suffix tree, such that there are no nodes in the tree with a single child, which makes the suffix tree size and construction linear in the size of the input (your ascii art appears slightly broken in the github preview)
The leaves of our suffix trees contain an array of matching results (rather than a $string_loc... our suffix trees are dictionaries/maps with string keys). Those arrays of results are sorted by the static evaluation described below in the readme, to ensure that "better" results are at the beginning of the arrays (such that we don't need to traverse the entire arrays to discover them!)
The children/subtrees of each node are also ordered by their minimum result (rather than by the next character in the string key). This makes querying the suffix tree for a word slower-ish (since finding each next child is O(alphabet) but the alphabet is very small), but it enables treating each subtree as a purely functional priority queue: This improves the enumeration of matching results in the static ordering. Finding the top N results for a query requires to compute the (on demand) intersection of the array leaves for each word in the query (and there can be many many leaves if a query word is short, so ordering each subtree makes the intersection algorithm much faster as we can avoid exploring subtrees that have the worst candidates)
Finally, we use hash-consing to detect identical subtrees (with different parents in the tree) and share their representation to save space on disk (it's about 6x smaller on a library like Base with this optim). In other words, the suffix trees are compressed into a finite state machine / automata (a directed acyclic graph (DAG)). This compression doesn't impact the rest of the code since we can traverse the DAG just as if it was a tree.

It might be interesting to use graphviz on a small ocaml library to show what the indexes ends up looking like! (to make the suffix tree illustration more spicy than banana ^^)

EmileTrotignon · 2024-02-19T13:36:50Z

I like the contribution recommendation, I think they are going to be useful in the future, but they should be in a CONTRIBUTING.md file (that can be referenced in the readme).

The recommendations are general, but I would like to expand on with more project-specific things once it is merged.

Maybe something similar for the suffix-tree explanation would make @art-w happy ? Maybe a DESIGN.md file ? I am not sure about the name nor the place were it should be, but I personally think it is a good thing, especially for explanation that are specific to the project. More general explanation I am not against, but they should be short and link to external resources.

It might be interesting to use graphviz on a small ocaml library to show what the indexes ends up looking like! (to make the suffix tree illustration more spicy than banana ^^)

Since this is a tree, an ascii representation might be enough. This could be also "expect-tested" which would be quite nice in my opinion. I'll open an issue for this.

EmileTrotignon · 2024-02-19T13:52:17Z

Regarding the contributor guide, I'm afraid we don't do anything special that requires instructions (and sorry we don't plan to have a dev branch atm).

I want a contributor guide because there might be outreachy interns working on this. Other than that, we do not plan on having a dev branch, this is not common practice in the ocaml world, so I think this part should be removed.

jackrea07 added 2 commits February 14, 2024 22:19

README updates

400fa39

fixed white spaces

9d5fe38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to README.md #33

Updates to README.md #33

jackrea07 commented Feb 15, 2024

art-w commented Feb 15, 2024

EmileTrotignon commented Feb 19, 2024 •

edited

Loading

EmileTrotignon commented Feb 19, 2024

Updates to README.md #33

Are you sure you want to change the base?

Updates to README.md #33

Conversation

jackrea07 commented Feb 15, 2024

art-w commented Feb 15, 2024

EmileTrotignon commented Feb 19, 2024 • edited Loading

EmileTrotignon commented Feb 19, 2024

EmileTrotignon commented Feb 19, 2024 •

edited

Loading