Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importer: remove UnixfsNode from the balanced builder #5118

Merged
merged 1 commit into from
Jul 16, 2018
Merged

importer: remove UnixfsNode from the balanced builder #5118

merged 1 commit into from
Jul 16, 2018

Conversation

schomatis
Copy link
Contributor

@schomatis schomatis commented Jun 14, 2018

The builder.go file has been completely rewritten, should be reviewed as an entirely new file.

The dagbuilder.go has mostly additions and can be reviewed as a diff.

Closes #5106.


importer: remove `UnixfsNode` from the balanced builder

The `UnixfsNode` structure has multiple pointers to many (non-complementary)
mutually exclusive node types, only some of them are active (not-`nil`) at a
given time in the code path which made the code too convoluted. Specifically,
the most important distinction between node types was being hidden: leaf nodes
vs internal (non-leaf) nodes.

Remove entirely the use of `UnixfsNode` from the `balanced` package replacing it
in turn with the newly created `FSNodeOverDag` structure that represents the
UnixFS node encoded inside the DAG node, primarily used for internal node
representations. Leaf nodes are handled exclusively in the `NewLeafDataNode`
encapsulating its multiple representations (that we're previously exposed in
`UnixfsNode` as conflicting pointers).

The `builder.go` file has been completely rewritten, although the basic DAG
creation algorithm has been preserved (extending a full DAG by creating a new
root and linking the old one as its child), the most significant modification
has been in the loop of `Layout` that now only handles internal nodes (i.e.,
nodes with `depth` bigger than zero) to be able to adapt `fillNodeRec` to only
that scenario (avoiding the replace logic of the zero `depth` case with the
defective `Set` function, now removed). The `fillNodeRec` now explicitly returns
the `ipld.Node` and the size of the file data it's storing to propagate it
upwards into the DAG.

The `DagBuilderHelper` was heavily extended to incorporate `ipld.Node` functions
that would replace the `UnixfsNode` ones used by the balanced builder:
`NewLeafNode()`, `NewLeafDataNode()` and `AddNodeAndClose()`. Also, the
`ProcessFileStore` function was incorporated to encapsulate all the logic
related to the Filestore support which was scattered throughout the builder
logic, the `offset` that was being passed through most functions is now a part
of the `DagBuilderHelper`.

This has turned out to be a rather big commit, it should have been split into
more smaller and logically cohesive commits, but the `UnixfsNode` was too
entangled inside the logic and that would have required a progressive
modification of the `UnixfsNode` structure as well, which wasn't possible as it
is still being used by the balanced builder (the same reason why most of the
`UnixfsNode`-related functions cannot yet be removed, leaving the `helpers.go`
file mostly untouched).

@schomatis schomatis added the topic/files Topic files label Jun 14, 2018
@schomatis schomatis added this to the Files API Documentation milestone Jun 14, 2018
@schomatis schomatis self-assigned this Jun 14, 2018
@schomatis schomatis requested a review from Kubuxu as a code owner June 14, 2018 13:11
@ghost ghost added the status/in-progress In progress label Jun 14, 2018
@whyrusleeping
Copy link
Member

Do not review.

ACK

@schomatis
Copy link
Contributor Author

@whyrusleeping I mean, take a look if you want, but right now I've basically expanded the UnixfsNode functions in the balanced builder and removed what was not necessary for an internal node, and similarly for the leaf node (internal vs leaf was the basic distinction that UnixfsNode was hiding and hence the code was losing an important dimensionality). The result is a lot of boilerplate code, but now, if this (hopefully) passes the tests it needs to be re-encapsulated in more IPLD-oriented functions.

@schomatis schomatis changed the title [WIP] remove unixfsnode from balanced importer importer: remove UnixfsNode from the balanced builder Jun 19, 2018
@schomatis
Copy link
Contributor Author

@Mr0grog Could you take a look at the comments please to check if they make sense? Especially the commit message, the package doc, the FSNodeOverDag doc, and the fillNodeRec doc.

@schomatis schomatis removed the status/in-progress In progress label Jun 19, 2018
@Mr0grog
Copy link
Contributor

Mr0grog commented Jun 19, 2018

@schomatis I’ll try and take a look later today or tomorrow. I think it’s safe to say you know a lot more about all these parts than I do now, though!

@schomatis
Copy link
Contributor Author

@Mr0grog Thanks, there's no rush, you don't need to check that what it says it's true but if actually some coherent meaning can be grasped from those lines.

Copy link
Contributor

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I didn’t review everything in too much depth, but focused on the particular docs you pointed me to @schomatis.

With code covering a conceptually complicated area like this, I think diagrams can help a lot. They take up a lot of space (and a little time to create), but they are more than worth the effort for others trying to understand the code. (I used a similar technique extensively in a couple geometry libraries I’ve worked on in the past and had lots of folks say it was very helpful.)

// first child), which proceeds to be filled up (link) to more nodes. In
// all cases, the data (chunk) is stored only at the leaves, with the
// rest of (internal) nodes only storing links to their children and
// sizes of the data stored under them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think diagrams could be enormously helpful here.

The way I read this, it’s saying something slightly different from the old text. I’m assuming functionality hasn’t changed here, though. Here’s what I’m reading:

Old:

until the maximum number of links is reached… Then, a new root is created, and points to the old root, and incorporates a new child, which proceeds to be filled up (link) to more leaves.

Which I think means:

With maximum links = 2

     +-------------+
     |   Root 1    |
     +-------------+
            |
     +------+------+
     |             |
+=========+   +=========+   + - - - - + 
| Chunk 1 |   | Chunk 2 |   | Chunk 3 |
+=========+   +=========+   + - - - - +

                        ↯

                    +-------------+
                    |   Root 3    |
                    +-------------+
                          |       
            +-------------+-------------+
            |                           |
     +-------------+             +-------------+
     |   Root 1    |             |   Root 2    |
     +-------------+             +-------------+
            |                           |       
     +------+------+             +------+ - - -+
     |             |             |             |
+=========+   +=========+   +=========+   + - - - - + 
| Chunk 1 |   | Chunk 2 |   | Chunk 3 |   | Chunk 4 |
+=========+   +=========+   +=========+   + - - - - +

While the new text:

until the maximum number of links is reached… Then, a new root is created to point to the old root (incorporating it as its first child), which proceeds to be filled up (link) to more nodes.

Sounds more like:

With maximum links = 2

     +-------------+
     |   Root 1    |
     +-------------+
            |
     +------+------+
     |             |
+=========+   +=========+   + - - - - + 
| Chunk 1 |   | Chunk 2 |   | Chunk 3 |
+=========+   +=========+   + - - - - +

                               ↯

                                  +-------------+
                                  |   Root 4    |
                                  +-------------+
                                        |       
                          +-------------+ - - - -- - -+
                          |                           |
                    +-------------+            +=============+
                    |   Root 3    |            |   Chunk 4   |
                    +-------------+            +=============+
                          |       
            +-------------+-------------+
            |                           |
     +-------------+             +=============+
     |   Root 1    |             |   Chunk 3   |
     +-------------+             +=============+
            |            
     +------+------+     
     |             |     
+=========+   +=========+
| Chunk 1 |   | Chunk 2 |
+=========+   +=========+

I think the first interpretation is correct, so how about something like:

In a balanced DAG, nodes representing chunks of data are added to a single root until the maximum number of links is reached. Then, two new nodes are created: a second node like the first (to which we add new nodes representing chunks of data until it reaches the maximum number of links), and a new root that links to it and to the original root. In all cases, the data (chunk nodes) is stored only at the leaves, with the rest of the nodes (internal nodes) only storing links to their children and the size of the data stored under them.

[diagram]

That naturally leads to me on one more big question, though, which is: what happens when the maximum number of links in a root higher than the first level is reached? i.e., we add a 5th chunk to the diagram above?

Do we mirror the structure so the whole graph has the same depth:

                                               +-------------+
                                               |   Root 6    |
                                               +-------------+
                                                     |
                          +--------------------------+----------------------------+
                          |                                                       | 
                    +-------------+                                         +-------------+
                    |   Root 3    |                                         |   Root 5    |
                    +-------------+                                         +-------------+
                          |                                                       |        
            +-------------+-------------+                           +-------------+ - - - -
            |                           |                           |              
     +-------------+             +-------------+             +-------------+       
     |   Root 1    |             |   Root 2    |             |   Root 4    |       
     +-------------+             +-------------+             +-------------+       
            |                           |                           |              
     +------+------+             +------+------+             +------+ - - - 
     |             |             |             |             |              
+=========+   +=========+   +=========+   +=========+   +=========+  
| Chunk 1 |   | Chunk 2 |   | Chunk 3 |   | Chunk 4 |   | Chunk 5 |  
+=========+   +=========+   +=========+   +=========+   +=========+  


Or do we only make it as deep as it needs to be for the data we have:

                                          +-------------+
                                          |   Root 5    |
                                          +-------------+
                                                |
                          +---------------------+--------------------+
                          |                                          |
                    +-------------+                           +-------------+
                    |   Root 3    |                           |   Root 4    |
                    +-------------+                           +-------------+
                          |                                          |       
                          |                                   +------+ - - -
                          |                                   |             
                          |                              +=========+  
                          |                              | Chunk 5 |  
                          |                              +=========+  
                          |            
            +-------------+-------------+
            |                           |
     +-------------+             +-------------+
     |   Root 1    |             |   Root 2    |
     +-------------+             +-------------+
            |                           |       
     +------+------+             +------+------+
     |             |             |             |
+=========+   +=========+   +=========+   +=========+ 
| Chunk 1 |   | Chunk 2 |   | Chunk 3 |   | Chunk 4 |
+=========+   +=========+   +=========+   +=========+

Are leaves always at the same depth, or only as deep as they need to be? (And, in the second diagram above, when we add the 7th chunk, do we create a new root at the top of the whole graph, or do we insert a parent between root 4 and root 5?)

Copy link
Contributor Author

@schomatis schomatis Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we mirror the structure so the whole graph has the same depth

Great question! You mirror what you already had, and that term definitely needs to go in the description (yes, one nice graph can clarify all of that), the balanced adjective already implies it (the 1st graph is balanced and the 2nd isn't) but may not be clear enough. The problem with the old explanation:

Then, a new root is created, and points to the old root, and incorporates a new child, which proceeds to be filled up (link) to more leaves.

Is that it is correct for the first iteration (depth of 1), but saying that a root's child (or any child for that matter) will be filled with leaves sound (to me) more like:

                                  +-------------+
                                  |   Root 4    |
                                  +-------------+
                                        |       
                          +-------------+ - - - -- - -+
                          |                           |
                    +-------------+            +=============+
                    |   Root 3    |            |   Chunk 4   |
                    +-------------+            +=============+
                          |       
            +-------------+-------------+
            |                           |
     +-------------+             +=============+
     |   Root 1    |             |   Chunk 3   |
     +-------------+             +=============+
            |            
     +------+------+     
     |             |     
+=========+   +=========+
| Chunk 1 |   | Chunk 2 |
+=========+   +=========+

because in deeper depths a root's grandchild won't be a chunk. This graph is actually the one you associated with the new explanation that talks about nodes (generic, not internal or leaves, because that depends on the depth), I'm interested in why you reached that conclusion. But the new explanation may also be too plain, let me iterate a bit on your version and ping you back. I think the mirror term is the key (and the graph, no question), that's why I talk about a B-tree in the package doc, but that may not be the most appropriate data structure.

Copy link
Contributor

@Mr0grog Mr0grog Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This graph is actually the one you associated with the new explanation that talks about nodes… I'm interested in why you reached that conclusion.

Ah, sorry if I didn’t make that clear. Here’s the sticky bit:

Then, a new root is created to point to the old root (incorporating it as its first child), which proceeds to be filled up

So according to that sentence, the new root is the the thing that “gets filled up (link) to more nodes.” That’s technically true, but since it doesn’t say anything about creating intermediary parents/roots/internals/whatever, it sounds like we are filling the rest of the new root with leaf nodes.

(The old explanation was clearer for me on this point because it says “incorporates a new child, which proceeds to be filled up.” It’s explicit about it being the new child that is filled.)

I think the mirror term is the key

I do think “mirror” is useful, BUT I also wrote all of this comment before getting to the bottom where you wrote “Balanced DAGs are generalistic DAGs in which all leaves are at the same distance from the root,” which totally resolved the question for me. Just moving that from the bottom to be directly below this introductory paragraph (or below a diagram which is below the intro paragraph) would also be plenty clear, I think.

Copy link
Contributor Author

@schomatis schomatis Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


                                               +-------------+
                                               |   Root 6    |
                                               +-------------+
                                                     |
                          +--------------------------+----------------------------+
                          |                                                       | 
                    +-------------+                                         +-------------+
                    |   Root 3    |                                         |   Root 5    |
                    +-------------+                                         +-------------+
                          |                                                       |        
            +-------------+-------------+                           +-------------+ - - - -
            |                           |                           |              
     +-------------+             +-------------+             +-------------+       
     |   Root 1    |             |   Root 2    |             |   Root 4    |       
     +-------------+             +-------------+             +-------------+       
            |                           |                           |              
     +------+------+             +------+------+             +------+ - - - 
     |             |             |             |             |              
+=========+   +=========+   +=========+   +=========+   +=========+  
| Chunk 1 |   | Chunk 2 |   | Chunk 3 |   | Chunk 4 |   | Chunk 5 |  
+=========+   +=========+   +=========+   +=========+   +=========+  

I'm just realizing that most of the graphs have too many Root nodes, if by that we're meaning a node that was at some point a root node created by Layout (and has since become an internal node when the DAG grew in depth). Root 1 is correct but Root 3 is actually Root 2 (if we're numbering in chronological order), the one drawn as 2 was never actually a root node, just an internal node generated by the fillNodeRec function at the time that (what you're drawing as) Root 3 was the root node and Root 1 was just an internal node (fillNodeRec was mirroring that internal node that used to be a root). Is that clear? Maybe we could add a more descriptive legend, e.g.,

                    +-------------+              
                    |   Root 2    |              
                    +-------------+              
                          |                      
            +-------------+---------+            
            |                       |            
     +---------------+        +--------------+   
     |   Internal 1  |        |   Internal 3 |   
     |   (ex Root 1) |        |              |   
     +---------------+        +--------------+   

(BTW, have you been doing all these graphs manually? I just did a partial redrew of one of yours and became exhausted, we definitely need a tool for this.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the graphs have too many Root nodes, if by that we're meaning a node that was at some point a root node created by Layout… the one drawn as 2 was never actually a root node

Ah, I didn’t mean anything that specific by root; only that it is the root of some sub-graph. I was trying not to change the name of a node between two diagrams that represent the graphs at different points in time and also trying not to introduce too much complexity. If you found that confusing, though, maybe it was a bad approach.

Maybe just “node?” (Still keep the leaves as “chunk,” though.) Not really sure.

     +-------------+
     |   Node 1    |
     +-------------+
            |
     +------+------+
     |             |
+=========+   +=========+   + - - - - + 
| Chunk 1 |   | Chunk 2 |   | Chunk 3 |
+=========+   +=========+   + - - - - +

                        ↯

                    +------------+
                    |   Node 2   |
                    +------------+
                          |       
            +-------------+-------------+
            |                           |
     +-------------+             +-------------+
     |    Node 1   |             |    Node 3   |
     +-------------+             +-------------+
            |                           |       
     +------+------+             +------+ - - -+
     |             |             |             |
+=========+   +=========+   +=========+   + - - - - + 
| Chunk 1 |   | Chunk 2 |   | Chunk 3 |   | Chunk 4 |
+=========+   +=========+   +=========+   + - - - - +

Root 1 is correct but Root 3 is actually Root 2 (if we're numbering in chronological order)

Yeah, after reading the code, I realized this was the case. I was sort of numbering them chronologically, but it wasn’t clear from the description which was the right order (I don’t think the description actually needs to make it clear — it’s not important to a consumer of the package and would add unnecessary complex detail to a package-level explanation). It would be a good idea to renumber the chronologically anyway, though :)

have you been doing all these graphs manually?

Yes, I have probably drawn too many weird ASCII diagrams at this point in my life :P

For diagrams like this, I’m not sure they’re so complex that finding and making a tool do what you want is really all that worthwhile (any more complex and it is, though). I will say that an editor with rectangular selections (TextMate, VS Code, etc.) is essential, though. You have to be able to pick up chunks of the graph and move them around, otherwise manipulating these graphs is insanely painful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry if I didn’t make that clear. Here’s the sticky bit:

Then, a new root is created to point to the old root (incorporating it as its first child), which proceeds to be filled up

So according to that sentence, the new root is the the thing that “gets filled up (link) to more nodes.” That’s technically true, but since it doesn’t say anything about creating intermediary parents/roots/internals/whatever, it sounds like we are filling the rest of the new root with leaf nodes.

(The old explanation was clearer for me on this point because it says “incorporates a new child, which proceeds to be filled up.” It’s explicit about it being the new child that is filled.)

Yes, I see it now, thanks. I think I will also reserve the term fill for the parent of the leaf nodes to make that extra clear (although any node can be filled, but just for the sake of the explanation).

Just moving that from the bottom to be directly below this introductory paragraph (or below a diagram which is below the intro paragraph) would also be plenty clear, I think.

Ok. I think the mirror term can be especially useful in the Layout explanation when we're talking about how DAGs are built, and leave the balanced term for the presentation of the full DAG as you said.

Copy link
Contributor Author

@schomatis schomatis Jun 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a good idea to renumber the chronologically anyway, though :)

Yes, let's use correct numbering without explaining them (at least in the intro, maybe explain them in the Layout function, which might be read more by an interested developer than by the consumer of the package).

For diagrams like this, I’m not sure they’re so complex that finding and making a tool do what you want is really all that worthwhile (any more complex and it is, though). I will say that an editor with rectangular selections (TextMate, VS Code, etc.) is essential, though. You have to be able to pick up chunks of the graph and move them around, otherwise manipulating these graphs is insanely painful.

Yes, I use the Sublime editor for that. My worry is not complexity but time consumption, as of now I haven't seen any diagrams like these in the code base, and if we want to incentivize their incorporation we need to make it damn easy for the developer to do it, otherwise it won't be done; and we need to find a tool for that, so the developer can focus exclusively (or as much as possible) in what is being done (the layout of the DAG) instead how it will be done (the editing part, not everyone is an ascii ninja like you :)).

// raw nodes). In the case the entire file fits into just one node it
// will be formatted as a (single) leaf node (without parent) with the
// possible representations already mentioned, this is the only scenario
// where the root can be of a type different that the UnixFS node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure what “Filestore nodes (supported by raw nodes)” means.

In “possible representations already mentioned, this is the only scenario”, replace the comma with a period or put the “this is the only scenario…” part in parentheses. That helps keep each sentence to one clear idea and makes them much more readable.

It also might be helpful to say “(see the go-ipfs/unixfs package for details of UnixFS.)” at the end. Or something like that to point to more details on UnixFS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure what “Filestore nodes (supported by raw nodes)” means.

I left a vague description on purpose because I don't know exactly how Filestore operates, just that it's stored on raw nodes, support in communication protocols terminology is which type of lower layer packet carries your data, but I can just say that it's a format stored on a raw node.

In “possible representations already mentioned, this is the only scenario”, replace the comma with a period or put the “this is the only scenario…” part in parentheses. That helps keep each sentence to one clear idea and makes them much more readable.

It also might be helpful to say “(see the go-ipfs/unixfs package for details of UnixFS.)” at the end. Or something like that to point to more details on UnixFS.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a vague description on purpose because I don't know exactly how Filestore operates

Ha! That’s reasonable, but can you point to a package where the reader can dig more into it (like I suggested with UnixFS)?

support in communication protocols terminology is which type of lower layer packet carries your data, but I can just say that it's a format stored on a raw node.

This part matches what I thought, but I think your more plain language version (“format stored on a raw node”) is better. 👍

//
// Balanced DAGs are generalistic DAGs in which all leaves
// are at the same distance from the root.
// The nodes are filled by recursion (in a DFS way) so the DAG is built
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure what DFS is (guessing “depth-first search,” but also this isn’t really a seach). Maybe spell this one out? (“DAG” is ok because it’s not ambiguous in this context and we’re repeating it a lot.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second thought, maybe you don’t need to say “(in a DFS way)” at all here. It doesn’t seem like it’s saying anything more useful than what “so the DAG is build from the bottom up” already makes clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a manner similar to a depth-first search, I think that concept can be much more familiar to the developer than the DAG term (even if we repeated a lot, it's not that common compared to a basic tree search). But yes, the "bottom up" term might be more clear here, I'll put the depth-first search at the end of the sentence to not distract the reader from the main point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that concept can be much more familiar to the developer than the DAG term

Totally! Sorry, I meant that I don’t think the abbreviation is as clear. I’d heard of “depth-first [search/traversal/construction/whatever]” long before I’d ever learned about “directed acyclic graphs,” but I’ve never seen “DFS” used to refer to it (at least I don’t recall it).

// root). In this way the DAG acts like a B-tree (a more appropriate data
// structure may be in order for this analogy), with each internal node
// using the file size of its children as an index to perform the search
// (seek) operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this rewording:

The nodes are filled recursively, so the DAG is built from the bottom up. Leaf nodes are created first using the chunked file data and its size. The size is then bubbled up to the parent (internal) node, which aggregates all the sizes of its children and bubbles that combined size up to its parent, and so on up to the root. This way, a balanced DAG acts like a B-tree when seeking to a byte offset in the file the graph represents: each internal node uses the file size of its children as an index when seeking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then yes, I better lose the DFS term, because that will clash with the explanation of the actual search, which is not of a DFS type, I'll just put an emphasis on the recursive aspect of the construction (that doesn't belong to the search).

// (seek) operation.
//
// Balanced DAGs are generalistic DAGs in which all leaves are at the
// same distance from the root.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I guess this answers my question from the top :)

Maybe a good idea to move this paragraph up there.

return node, nil
}

// FSNodeOverDag encapsulates an `ft.FSNode` that will be stored in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might just be my general newness to Go, but I find the ft here confusing (I know I can look up the fact that it’s referring to the unixfs package at the top, but I keep having to do that when I’m new to files AND none of that shows up in the Godoc). What if we said:

FSNodeOverDag encapsulates a UnixFS fs.FSNode

? Is that any better, or is it just confusing in a different way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definetely not you (see #5056).

? Is that any better, or is it just confusing in a different way?

I'd rather rename the package to its full name and have unixfs.FSNode (and yes, the FS in FSNode is also a confusing redundancy, that's what namespaces are for).

// internal `FSNode` in the process of creating a UnixFS DAG, this
// structure stores an `FSNode` cache to manipulate it (add child nodes)
// directly , and only when the node has reached its final (immutable)
// state it is committed to a single (indivisible) `ipld.Node`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“state it is committed” → “state is it committed”

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might add that moving to its final state is done by calling Commit:

…and only when the node has reached its final state (signaled by calling Commit()) is it committed…

Otherwise it sounds like that happens automatically as a result of doing some other routine action.

// representations of data leaf nodes (that don't use raw nodes or
// Filestore).
//
// It aims at replacing the `UnixfsNode` structure which encapsulated too
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“at replacing” → “to replace”

}

// SetFileData stores the `fileData` in the `ft.FSNode`. It
// is used only when `FSNodeOverDag` represents a leaf node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“is used” → “should be used” ?

This code doesn’t control how it’s called, so this should be a recommendation to the reader about how to use it, not an assertion about what some other part of the codebase might or might not do.

// that represents them: the `ft.FSNode` is encoded inside the
// `dag.ProtoNode`.
//
// TODO: Evaluate making it read-only after committing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea (or having a dirty flag) :)

@schomatis
Copy link
Contributor Author

@Mr0grog 😍

@schomatis
Copy link
Contributor Author

With code covering a conceptually complicated area like this, I think diagrams can help a lot. They take up a lot of space (and a little time to create), but they are more than worth the effort for others trying to understand the code. (I used a similar technique extensively in a couple geometry libraries I’ve worked on in the past and had lots of folks say it was very helpful.)

I agree 100% on this, that's why I asked about tools to programmatically construct graphs like the ones you did (ipfs-inactive/docs#78), graphmd didn't seem to be suited for this, I'll take a closer look at Graphviz to see what kind of text graphs it can generate.

@ghost ghost added the status/in-progress In progress label Jun 25, 2018
@schomatis
Copy link
Contributor Author

schomatis commented Jun 25, 2018

@Mr0grog Could you take a look at the documentation fixes in the last commit?

I moved most of it to the Layout and fillNodeRec functions, leaving only a basic description of the balanced DAG in the package doc. Thinking about what you said, the consumer of the package will only be interested in what layout its data will be organized in, not how that layout is built, that should be left for someone reading the actual functions in charge of that.

@schomatis
Copy link
Contributor Author

I've made several changes to the comments of fillNodeRec you gave me, they weren't totally accurate with what the functions does, but my changes didn't end up being very convincing either, so take a closer look there please.

@schomatis
Copy link
Contributor Author

@Stebalien @whyrusleeping While we give the final touches to the documentation, could you take a look at the code please? Especially the added FSNodeOverDag and ProcessFileStore.

Copy link
Member

@whyrusleeping whyrusleeping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty clean to me, great job!

}

// Convert this leaf to a `FilestoreNode` if needed.
node = db.ProcessFileStore(node, dataSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the downside of doing things this way is that we hold the node in memory for longer than we really need to (technically just need to hash the data, could be done entirely on stack if go let us)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following, this way refers to this PR or in general? From what I understand, we normally hold work-in-progress nodes to avoid constantly (un)packing while working on them, reducing computation time at the cost of more memory, but I'm not sure if you're referring to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i'm just referring to the fact here that this 'node' is a full-on dag RawNode that holds a reference to the 256k buffer where it technically doesnt need to (since its just a raw data buffer under the hood).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something that I'm seeing throughout the entire code is this mixture of node structures that are used interchangeably for (i) reading/writing and for (ii) building/cache. Ideally, if you are retrieving content (or storing what you received from another peer) the first type is just fine, but if you are building stuff yourself (and that is the very common operation of just adding content with ipfs add) then you need an entirely different object, that yes, will look like the first one (and the first one will even be its end product) but actually will behave very differently: it will cache many things while in the process of modifying them, it will have information in a mutable state, it will allow (what the first type would see as) invalid states.

Ideally again, I would expect to have different structures (with a common base or shared methods) to reflect that. And going back to the original raw node comment, yes, someone reading (or writing) the node needs the buffer, but someone building it may only need partial information like the hash.

Sorry, I'm not sure if you're original remark was going in this direction, but it made me think of this situation that I was encountering repeatedly while working on this PR.

@schomatis
Copy link
Contributor Author

@Stebalien PTAL

1 similar comment
@schomatis
Copy link
Contributor Author

@Stebalien PTAL

// are at the same distance from the root.
// Package balanced provides methods to build balanced DAGs, which are generalistic
// DAGs in which all leaves (nodes representing chunks of data) are at the same
// distance from the root. Nodes can have only a maximum number of children, to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really minor, but the comma after children should be a period or a semicolon.

// | | | | |
// +=========+ +=========+ +=========+ +=========+ +=========+
// | Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 | | Chunk 5 |
// +=========+ +=========+ +=========+ +=========+ +=========+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you indent this whole block, it renders much nicer in godoc. Current:

screen shot 2018-07-06 at 12 29 34 pm

Indented:

screen shot 2018-07-06 at 12 29 19 pm

// `fillNodeRec` will return and the loop will be repeated: the `newRoot` created
// will become the old `root` and a new root will be created again to increase the
// depth of the DAG. The process is repeated until there is no more data to add
// (`Done()`).
Copy link
Contributor

@Mr0grog Mr0grog Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit about Done() doesn’t seem totally clear. Maybe:

…no more data to add (i.e. the DagBuilderHelper’s Done() function returns true).

// and depth only increases when the tree is full, that is, when
// the root node has reached the maximum number of links.
// Layout builds a balanced DAG layout. In a balanced DAG of depth 1, leaf nodes
// with data are added to a single `root` until the maximum number of links is
Copy link
Contributor

@Mr0grog Mr0grog Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, especially in the context of the generated godoc, I wonder if most of this should move to a comment inside the the function so it doesn’t read as public documentation (it’s definitely still helpful clarification for someone working on the balanced builder here). Otherwise, it reads as a lot of detail about how this function works that isn’t necessary (and might be confusing) for a user of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally true, but I can't find any other place in the function for this information, Layout actually has only a few code lines, none of which could be easily associated with the two introductory paragraphs, the "magic" is happening in the recursion call, which is hiding a lot behind the scenes (something I'm not really a fan of). Anyway, I will have this advice in mind when writing documentation (that was actually my motivation when moving much of the package description inside the individual functions).

//
// It returns the `ipld.Node` representation of the passed `node` filled with
// children and the `nodeFileSize` with the total size of the file chunk (leaf)
// nodes stored under this node (to propagate it upwards in the DAG to its parent).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of:

(to propagate it upwards in the DAG to its parent)

I might say:

(parent nodes store this to enable efficient seeking through the DAG when reading data later)

I think it’s self-evident that the data is being propagated upward, and this describes a little more clearly why we want that information.

@schomatis
Copy link
Contributor Author

Thanks @Mr0grog for the very thorough review and invaluable diagrams!! This is already shaping up to be a nice reading reference for a new user who wants to get a deeper understanding of how files are handled in IPFS.

@schomatis schomatis added topic/UnixFS Topic UnixFS RFM and removed status/in-progress In progress labels Jul 8, 2018
@schomatis
Copy link
Contributor Author

schomatis commented Jul 8, 2018

/cc @whyrusleeping RFM

(edit: strike that, I've just seen that 0.4.16 still hasn't been released, let's leave this PR for the next one.)

@whyrusleeping
Copy link
Member

In need of a rebase now that other PRs have landed

@ghost ghost added the status/in-progress In progress label Jul 16, 2018
@schomatis
Copy link
Contributor Author

Fixed, waiting on the tests..

@schomatis schomatis removed the status/in-progress In progress label Jul 16, 2018
@ghost ghost added the status/in-progress In progress label Jul 16, 2018
@schomatis
Copy link
Contributor Author

Broke the build.

@whyrusleeping
Copy link
Member

@schomatis whoops. You got a fix or should I handle?

@schomatis
Copy link
Contributor Author

@whyrusleeping Fixed in #5239, should be merged first.

The `UnixfsNode` structure has multiple pointers to many (non-complementary)
mutually exclusive node types, only some of them are active (not-`nil`) at a
given time in the code path which made the code too convoluted. Specifically,
the most important distinction between node types was being hidden: leaf nodes
vs internal (non-leaf) nodes.

Remove entirely the use of `UnixfsNode` from the `balanced` package replacing it
in turn with the newly created `FSNodeOverDag` structure that represents the
UnixFS node encoded inside the DAG node, primarily used for internal node
representations. Leaf nodes are handled exclusively in the `NewLeafDataNode`
encapsulating its multiple representations (that we're previously exposed in
`UnixfsNode` as conflicting pointers).

The `builder.go` file has been completely rewritten, although the basic DAG
creation algorithm has been preserved (extending a full DAG by creating a new
root and linking the old one as its child), the most significant modification
has been in the loop of `Layout` that now only handles internal nodes (i.e.,
nodes with `depth` bigger than zero) to be able to adapt `fillNodeRec` to only
that scenario (avoiding the replace logic of the zero `depth` case with the
defective `Set` function, now removed). The `fillNodeRec` now explicitly returns
the `ipld.Node` and the size of the file data it's storing to propagate it
upwards into the DAG.

The `DagBuilderHelper` was heavily extended to incorporate `ipld.Node` functions
that would replace the `UnixfsNode` ones used by the balanced builder:
`NewLeafNode()`, `NewLeafDataNode()` and `AddNodeAndClose()`. Also, the
`ProcessFileStore` function was incorporated to encapsulate all the logic
related to the Filestore support which was scattered throughout the builder
logic, the `offset` that was being passed through most functions is now a part
of the `DagBuilderHelper`.

This has turned out to be a rather big commit, it should have been split into
more smaller and logically cohesive commits, but the `UnixfsNode` was too
entangled inside the logic and that would have required a progressive
modification of the `UnixfsNode` structure as well, which wasn't possible as it
is still being used by the balanced builder (the same reason why most of the
`UnixfsNode`-related functions cannot yet be removed, leaving the `helpers.go`
file mostly untouched).

License: MIT
Signed-off-by: Lucas Molas <[email protected]>
@schomatis
Copy link
Contributor Author

@whyrusleeping RFM now

@whyrusleeping whyrusleeping merged commit 0ab36f0 into ipfs:master Jul 16, 2018
@ghost ghost removed the status/in-progress In progress label Jul 16, 2018
@whyrusleeping
Copy link
Member

@schomatis Woo! Thanks for all of these PRs. Making things easier to read and understand is so great

@schomatis schomatis deleted the feat/importer/remove-unixfsnode branch July 16, 2018 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFM topic/files Topic files topic/UnixFS Topic UnixFS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants