-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importer: remove UnixfsNode
from the balanced builder
#5118
importer: remove UnixfsNode
from the balanced builder
#5118
Conversation
ACK |
@whyrusleeping I mean, take a look if you want, but right now I've basically expanded the |
UnixfsNode
from the balanced builder
@Mr0grog Could you take a look at the comments please to check if they make sense? Especially the commit message, the package doc, the |
@schomatis I’ll try and take a look later today or tomorrow. I think it’s safe to say you know a lot more about all these parts than I do now, though! |
@Mr0grog Thanks, there's no rush, you don't need to check that what it says it's true but if actually some coherent meaning can be grasped from those lines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! I didn’t review everything in too much depth, but focused on the particular docs you pointed me to @schomatis.
With code covering a conceptually complicated area like this, I think diagrams can help a lot. They take up a lot of space (and a little time to create), but they are more than worth the effort for others trying to understand the code. (I used a similar technique extensively in a couple geometry libraries I’ve worked on in the past and had lots of folks say it was very helpful.)
importer/balanced/builder.go
Outdated
// first child), which proceeds to be filled up (link) to more nodes. In | ||
// all cases, the data (chunk) is stored only at the leaves, with the | ||
// rest of (internal) nodes only storing links to their children and | ||
// sizes of the data stored under them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think diagrams could be enormously helpful here.
The way I read this, it’s saying something slightly different from the old text. I’m assuming functionality hasn’t changed here, though. Here’s what I’m reading:
Old:
until the maximum number of links is reached… Then, a new root is created, and points to the old root, and incorporates a new child, which proceeds to be filled up (link) to more leaves.
Which I think means:
With maximum links = 2
+-------------+
| Root 1 |
+-------------+
|
+------+------+
| |
+=========+ +=========+ + - - - - +
| Chunk 1 | | Chunk 2 | | Chunk 3 |
+=========+ +=========+ + - - - - +
↯
+-------------+
| Root 3 |
+-------------+
|
+-------------+-------------+
| |
+-------------+ +-------------+
| Root 1 | | Root 2 |
+-------------+ +-------------+
| |
+------+------+ +------+ - - -+
| | | |
+=========+ +=========+ +=========+ + - - - - +
| Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 |
+=========+ +=========+ +=========+ + - - - - +
While the new text:
until the maximum number of links is reached… Then, a new root is created to point to the old root (incorporating it as its first child), which proceeds to be filled up (link) to more nodes.
Sounds more like:
With maximum links = 2
+-------------+
| Root 1 |
+-------------+
|
+------+------+
| |
+=========+ +=========+ + - - - - +
| Chunk 1 | | Chunk 2 | | Chunk 3 |
+=========+ +=========+ + - - - - +
↯
+-------------+
| Root 4 |
+-------------+
|
+-------------+ - - - -- - -+
| |
+-------------+ +=============+
| Root 3 | | Chunk 4 |
+-------------+ +=============+
|
+-------------+-------------+
| |
+-------------+ +=============+
| Root 1 | | Chunk 3 |
+-------------+ +=============+
|
+------+------+
| |
+=========+ +=========+
| Chunk 1 | | Chunk 2 |
+=========+ +=========+
I think the first interpretation is correct, so how about something like:
In a balanced DAG, nodes representing chunks of data are added to a single root until the maximum number of links is reached. Then, two new nodes are created: a second node like the first (to which we add new nodes representing chunks of data until it reaches the maximum number of links), and a new root that links to it and to the original root. In all cases, the data (chunk nodes) is stored only at the leaves, with the rest of the nodes (internal nodes) only storing links to their children and the size of the data stored under them.
[diagram]
That naturally leads to me on one more big question, though, which is: what happens when the maximum number of links in a root higher than the first level is reached? i.e., we add a 5th chunk to the diagram above?
Do we mirror the structure so the whole graph has the same depth:
+-------------+
| Root 6 |
+-------------+
|
+--------------------------+----------------------------+
| |
+-------------+ +-------------+
| Root 3 | | Root 5 |
+-------------+ +-------------+
| |
+-------------+-------------+ +-------------+ - - - -
| | |
+-------------+ +-------------+ +-------------+
| Root 1 | | Root 2 | | Root 4 |
+-------------+ +-------------+ +-------------+
| | |
+------+------+ +------+------+ +------+ - - -
| | | | |
+=========+ +=========+ +=========+ +=========+ +=========+
| Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 | | Chunk 5 |
+=========+ +=========+ +=========+ +=========+ +=========+
Or do we only make it as deep as it needs to be for the data we have:
+-------------+
| Root 5 |
+-------------+
|
+---------------------+--------------------+
| |
+-------------+ +-------------+
| Root 3 | | Root 4 |
+-------------+ +-------------+
| |
| +------+ - - -
| |
| +=========+
| | Chunk 5 |
| +=========+
|
+-------------+-------------+
| |
+-------------+ +-------------+
| Root 1 | | Root 2 |
+-------------+ +-------------+
| |
+------+------+ +------+------+
| | | |
+=========+ +=========+ +=========+ +=========+
| Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 |
+=========+ +=========+ +=========+ +=========+
Are leaves always at the same depth, or only as deep as they need to be? (And, in the second diagram above, when we add the 7th chunk, do we create a new root at the top of the whole graph, or do we insert a parent between root 4 and root 5?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we mirror the structure so the whole graph has the same depth
Great question! You mirror what you already had, and that term definitely needs to go in the description (yes, one nice graph can clarify all of that), the balanced adjective already implies it (the 1st graph is balanced and the 2nd isn't) but may not be clear enough. The problem with the old explanation:
Then, a new root is created, and points to the old root, and incorporates a new child, which proceeds to be filled up (link) to more leaves.
Is that it is correct for the first iteration (depth of 1), but saying that a root's child (or any child for that matter) will be filled with leaves sound (to me) more like:
+-------------+
| Root 4 |
+-------------+
|
+-------------+ - - - -- - -+
| |
+-------------+ +=============+
| Root 3 | | Chunk 4 |
+-------------+ +=============+
|
+-------------+-------------+
| |
+-------------+ +=============+
| Root 1 | | Chunk 3 |
+-------------+ +=============+
|
+------+------+
| |
+=========+ +=========+
| Chunk 1 | | Chunk 2 |
+=========+ +=========+
because in deeper depths a root's grandchild won't be a chunk. This graph is actually the one you associated with the new explanation that talks about nodes (generic, not internal or leaves, because that depends on the depth), I'm interested in why you reached that conclusion. But the new explanation may also be too plain, let me iterate a bit on your version and ping you back. I think the mirror term is the key (and the graph, no question), that's why I talk about a B-tree in the package doc, but that may not be the most appropriate data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This graph is actually the one you associated with the new explanation that talks about nodes… I'm interested in why you reached that conclusion.
Ah, sorry if I didn’t make that clear. Here’s the sticky bit:
Then, a new root is created to point to the old root (incorporating it as its first child), which proceeds to be filled up
So according to that sentence, the new root is the the thing that “gets filled up (link) to more nodes.” That’s technically true, but since it doesn’t say anything about creating intermediary parents/roots/internals/whatever, it sounds like we are filling the rest of the new root with leaf nodes.
(The old explanation was clearer for me on this point because it says “incorporates a new child, which proceeds to be filled up.” It’s explicit about it being the new child that is filled.)
I think the mirror term is the key
I do think “mirror” is useful, BUT I also wrote all of this comment before getting to the bottom where you wrote “Balanced DAGs are generalistic DAGs in which all leaves are at the same distance from the root,” which totally resolved the question for me. Just moving that from the bottom to be directly below this introductory paragraph (or below a diagram which is below the intro paragraph) would also be plenty clear, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+-------------+
| Root 6 |
+-------------+
|
+--------------------------+----------------------------+
| |
+-------------+ +-------------+
| Root 3 | | Root 5 |
+-------------+ +-------------+
| |
+-------------+-------------+ +-------------+ - - - -
| | |
+-------------+ +-------------+ +-------------+
| Root 1 | | Root 2 | | Root 4 |
+-------------+ +-------------+ +-------------+
| | |
+------+------+ +------+------+ +------+ - - -
| | | | |
+=========+ +=========+ +=========+ +=========+ +=========+
| Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 | | Chunk 5 |
+=========+ +=========+ +=========+ +=========+ +=========+
I'm just realizing that most of the graphs have too many Root
nodes, if by that we're meaning a node that was at some point a root node created by Layout
(and has since become an internal node when the DAG grew in depth). Root 1
is correct but Root 3
is actually Root 2
(if we're numbering in chronological order), the one drawn as 2 was never actually a root node, just an internal node generated by the fillNodeRec
function at the time that (what you're drawing as) Root 3
was the root node and Root 1
was just an internal node (fillNodeRec
was mirroring that internal node that used to be a root). Is that clear? Maybe we could add a more descriptive legend, e.g.,
+-------------+
| Root 2 |
+-------------+
|
+-------------+---------+
| |
+---------------+ +--------------+
| Internal 1 | | Internal 3 |
| (ex Root 1) | | |
+---------------+ +--------------+
(BTW, have you been doing all these graphs manually? I just did a partial redrew of one of yours and became exhausted, we definitely need a tool for this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the graphs have too many Root nodes, if by that we're meaning a node that was at some point a root node created by Layout… the one drawn as 2 was never actually a root node
Ah, I didn’t mean anything that specific by root; only that it is the root of some sub-graph. I was trying not to change the name of a node between two diagrams that represent the graphs at different points in time and also trying not to introduce too much complexity. If you found that confusing, though, maybe it was a bad approach.
Maybe just “node?” (Still keep the leaves as “chunk,” though.) Not really sure.
+-------------+
| Node 1 |
+-------------+
|
+------+------+
| |
+=========+ +=========+ + - - - - +
| Chunk 1 | | Chunk 2 | | Chunk 3 |
+=========+ +=========+ + - - - - +
↯
+------------+
| Node 2 |
+------------+
|
+-------------+-------------+
| |
+-------------+ +-------------+
| Node 1 | | Node 3 |
+-------------+ +-------------+
| |
+------+------+ +------+ - - -+
| | | |
+=========+ +=========+ +=========+ + - - - - +
| Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 |
+=========+ +=========+ +=========+ + - - - - +
Root 1 is correct but Root 3 is actually Root 2 (if we're numbering in chronological order)
Yeah, after reading the code, I realized this was the case. I was sort of numbering them chronologically, but it wasn’t clear from the description which was the right order (I don’t think the description actually needs to make it clear — it’s not important to a consumer of the package and would add unnecessary complex detail to a package-level explanation). It would be a good idea to renumber the chronologically anyway, though :)
have you been doing all these graphs manually?
Yes, I have probably drawn too many weird ASCII diagrams at this point in my life :P
For diagrams like this, I’m not sure they’re so complex that finding and making a tool do what you want is really all that worthwhile (any more complex and it is, though). I will say that an editor with rectangular selections (TextMate, VS Code, etc.) is essential, though. You have to be able to pick up chunks of the graph and move them around, otherwise manipulating these graphs is insanely painful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry if I didn’t make that clear. Here’s the sticky bit:
Then, a new root is created to point to the old root (incorporating it as its first child), which proceeds to be filled up
So according to that sentence, the new root is the the thing that “gets filled up (link) to more nodes.” That’s technically true, but since it doesn’t say anything about creating intermediary parents/roots/internals/whatever, it sounds like we are filling the rest of the new root with leaf nodes.
(The old explanation was clearer for me on this point because it says “incorporates a new child, which proceeds to be filled up.” It’s explicit about it being the new child that is filled.)
Yes, I see it now, thanks. I think I will also reserve the term fill for the parent of the leaf nodes to make that extra clear (although any node can be filled, but just for the sake of the explanation).
Just moving that from the bottom to be directly below this introductory paragraph (or below a diagram which is below the intro paragraph) would also be plenty clear, I think.
Ok. I think the mirror term can be especially useful in the Layout
explanation when we're talking about how DAGs are built, and leave the balanced term for the presentation of the full DAG as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a good idea to renumber the chronologically anyway, though :)
Yes, let's use correct numbering without explaining them (at least in the intro, maybe explain them in the Layout
function, which might be read more by an interested developer than by the consumer of the package).
For diagrams like this, I’m not sure they’re so complex that finding and making a tool do what you want is really all that worthwhile (any more complex and it is, though). I will say that an editor with rectangular selections (TextMate, VS Code, etc.) is essential, though. You have to be able to pick up chunks of the graph and move them around, otherwise manipulating these graphs is insanely painful.
Yes, I use the Sublime editor for that. My worry is not complexity but time consumption, as of now I haven't seen any diagrams like these in the code base, and if we want to incentivize their incorporation we need to make it damn easy for the developer to do it, otherwise it won't be done; and we need to find a tool for that, so the developer can focus exclusively (or as much as possible) in what is being done (the layout of the DAG) instead how it will be done (the editing part, not everyone is an ascii ninja like you :)).
importer/balanced/builder.go
Outdated
// raw nodes). In the case the entire file fits into just one node it | ||
// will be formatted as a (single) leaf node (without parent) with the | ||
// possible representations already mentioned, this is the only scenario | ||
// where the root can be of a type different that the UnixFS node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure what “Filestore nodes (supported by raw nodes)” means.
In “possible representations already mentioned, this is the only scenario”, replace the comma with a period or put the “this is the only scenario…” part in parentheses. That helps keep each sentence to one clear idea and makes them much more readable.
It also might be helpful to say “(see the go-ipfs/unixfs package for details of UnixFS.)” at the end. Or something like that to point to more details on UnixFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure what “Filestore nodes (supported by raw nodes)” means.
I left a vague description on purpose because I don't know exactly how Filestore operates, just that it's stored on raw nodes, support in communication protocols terminology is which type of lower layer packet carries your data, but I can just say that it's a format stored on a raw node.
In “possible representations already mentioned, this is the only scenario”, replace the comma with a period or put the “this is the only scenario…” part in parentheses. That helps keep each sentence to one clear idea and makes them much more readable.
It also might be helpful to say “(see the go-ipfs/unixfs package for details of UnixFS.)” at the end. Or something like that to point to more details on UnixFS.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a vague description on purpose because I don't know exactly how Filestore operates
Ha! That’s reasonable, but can you point to a package where the reader can dig more into it (like I suggested with UnixFS)?
support in communication protocols terminology is which type of lower layer packet carries your data, but I can just say that it's a format stored on a raw node.
This part matches what I thought, but I think your more plain language version (“format stored on a raw node”) is better. 👍
importer/balanced/builder.go
Outdated
// | ||
// Balanced DAGs are generalistic DAGs in which all leaves | ||
// are at the same distance from the root. | ||
// The nodes are filled by recursion (in a DFS way) so the DAG is built |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure what DFS is (guessing “depth-first search,” but also this isn’t really a seach). Maybe spell this one out? (“DAG” is ok because it’s not ambiguous in this context and we’re repeating it a lot.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on second thought, maybe you don’t need to say “(in a DFS way)” at all here. It doesn’t seem like it’s saying anything more useful than what “so the DAG is build from the bottom up” already makes clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a manner similar to a depth-first search, I think that concept can be much more familiar to the developer than the DAG term (even if we repeated a lot, it's not that common compared to a basic tree search). But yes, the "bottom up" term might be more clear here, I'll put the depth-first search at the end of the sentence to not distract the reader from the main point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that concept can be much more familiar to the developer than the DAG term
Totally! Sorry, I meant that I don’t think the abbreviation is as clear. I’d heard of “depth-first [search/traversal/construction/whatever]” long before I’d ever learned about “directed acyclic graphs,” but I’ve never seen “DFS” used to refer to it (at least I don’t recall it).
importer/balanced/builder.go
Outdated
// root). In this way the DAG acts like a B-tree (a more appropriate data | ||
// structure may be in order for this analogy), with each internal node | ||
// using the file size of its children as an index to perform the search | ||
// (seek) operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this rewording:
The nodes are filled recursively, so the DAG is built from the bottom up. Leaf nodes are created first using the chunked file data and its size. The size is then bubbled up to the parent (internal) node, which aggregates all the sizes of its children and bubbles that combined size up to its parent, and so on up to the root. This way, a balanced DAG acts like a B-tree when seeking to a byte offset in the file the graph represents: each internal node uses the file size of its children as an index when seeking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then yes, I better lose the DFS term, because that will clash with the explanation of the actual search, which is not of a DFS type, I'll just put an emphasis on the recursive aspect of the construction (that doesn't belong to the search).
importer/balanced/builder.go
Outdated
// (seek) operation. | ||
// | ||
// Balanced DAGs are generalistic DAGs in which all leaves are at the | ||
// same distance from the root. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I guess this answers my question from the top :)
Maybe a good idea to move this paragraph up there.
importer/helpers/dagbuilder.go
Outdated
return node, nil | ||
} | ||
|
||
// FSNodeOverDag encapsulates an `ft.FSNode` that will be stored in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might just be my general newness to Go, but I find the ft
here confusing (I know I can look up the fact that it’s referring to the unixfs
package at the top, but I keep having to do that when I’m new to files AND none of that shows up in the Godoc). What if we said:
FSNodeOverDag encapsulates a UnixFS
fs.FSNode
? Is that any better, or is it just confusing in a different way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is definetely not you (see #5056).
? Is that any better, or is it just confusing in a different way?
I'd rather rename the package to its full name and have unixfs.FSNode
(and yes, the FS
in FSNode
is also a confusing redundancy, that's what namespaces are for).
importer/helpers/dagbuilder.go
Outdated
// internal `FSNode` in the process of creating a UnixFS DAG, this | ||
// structure stores an `FSNode` cache to manipulate it (add child nodes) | ||
// directly , and only when the node has reached its final (immutable) | ||
// state it is committed to a single (indivisible) `ipld.Node`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“state it is committed” → “state is it committed”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might add that moving to its final state is done by calling Commit
:
…and only when the node has reached its final state (signaled by calling
Commit()
) is it committed…
Otherwise it sounds like that happens automatically as a result of doing some other routine action.
importer/helpers/dagbuilder.go
Outdated
// representations of data leaf nodes (that don't use raw nodes or | ||
// Filestore). | ||
// | ||
// It aims at replacing the `UnixfsNode` structure which encapsulated too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“at replacing” → “to replace”
importer/helpers/dagbuilder.go
Outdated
} | ||
|
||
// SetFileData stores the `fileData` in the `ft.FSNode`. It | ||
// is used only when `FSNodeOverDag` represents a leaf node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“is used” → “should be used” ?
This code doesn’t control how it’s called, so this should be a recommendation to the reader about how to use it, not an assertion about what some other part of the codebase might or might not do.
// that represents them: the `ft.FSNode` is encoded inside the | ||
// `dag.ProtoNode`. | ||
// | ||
// TODO: Evaluate making it read-only after committing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea (or having a dirty flag) :)
@Mr0grog 😍 |
I agree 100% on this, that's why I asked about tools to programmatically construct graphs like the ones you did (ipfs-inactive/docs#78), |
@Mr0grog Could you take a look at the documentation fixes in the last commit? I moved most of it to the |
I've made several changes to the comments of |
@Stebalien @whyrusleeping While we give the final touches to the documentation, could you take a look at the code please? Especially the added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty clean to me, great job!
} | ||
|
||
// Convert this leaf to a `FilestoreNode` if needed. | ||
node = db.ProcessFileStore(node, dataSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the downside of doing things this way is that we hold the node in memory for longer than we really need to (technically just need to hash the data, could be done entirely on stack if go let us)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following, this way refers to this PR or in general? From what I understand, we normally hold work-in-progress nodes to avoid constantly (un)packing while working on them, reducing computation time at the cost of more memory, but I'm not sure if you're referring to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i'm just referring to the fact here that this 'node' is a full-on dag RawNode that holds a reference to the 256k buffer where it technically doesnt need to (since its just a raw data buffer under the hood).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something that I'm seeing throughout the entire code is this mixture of node structures that are used interchangeably for (i) reading/writing and for (ii) building/cache. Ideally, if you are retrieving content (or storing what you received from another peer) the first type is just fine, but if you are building stuff yourself (and that is the very common operation of just adding content with ipfs add
) then you need an entirely different object, that yes, will look like the first one (and the first one will even be its end product) but actually will behave very differently: it will cache many things while in the process of modifying them, it will have information in a mutable state, it will allow (what the first type would see as) invalid states.
Ideally again, I would expect to have different structures (with a common base or shared methods) to reflect that. And going back to the original raw node comment, yes, someone reading (or writing) the node needs the buffer, but someone building it may only need partial information like the hash.
Sorry, I'm not sure if you're original remark was going in this direction, but it made me think of this situation that I was encountering repeatedly while working on this PR.
@Stebalien PTAL |
1 similar comment
@Stebalien PTAL |
importer/balanced/builder.go
Outdated
// are at the same distance from the root. | ||
// Package balanced provides methods to build balanced DAGs, which are generalistic | ||
// DAGs in which all leaves (nodes representing chunks of data) are at the same | ||
// distance from the root. Nodes can have only a maximum number of children, to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really minor, but the comma after children should be a period or a semicolon.
importer/balanced/builder.go
Outdated
// | | | | | | ||
// +=========+ +=========+ +=========+ +=========+ +=========+ | ||
// | Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 | | Chunk 5 | | ||
// +=========+ +=========+ +=========+ +=========+ +=========+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
importer/balanced/builder.go
Outdated
// `fillNodeRec` will return and the loop will be repeated: the `newRoot` created | ||
// will become the old `root` and a new root will be created again to increase the | ||
// depth of the DAG. The process is repeated until there is no more data to add | ||
// (`Done()`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit about Done()
doesn’t seem totally clear. Maybe:
…no more data to add (i.e. the DagBuilderHelper’s
Done()
function returnstrue
).
// and depth only increases when the tree is full, that is, when | ||
// the root node has reached the maximum number of links. | ||
// Layout builds a balanced DAG layout. In a balanced DAG of depth 1, leaf nodes | ||
// with data are added to a single `root` until the maximum number of links is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, especially in the context of the generated godoc, I wonder if most of this should move to a comment inside the the function so it doesn’t read as public documentation (it’s definitely still helpful clarification for someone working on the balanced builder here). Otherwise, it reads as a lot of detail about how this function works that isn’t necessary (and might be confusing) for a user of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally true, but I can't find any other place in the function for this information, Layout
actually has only a few code lines, none of which could be easily associated with the two introductory paragraphs, the "magic" is happening in the recursion call, which is hiding a lot behind the scenes (something I'm not really a fan of). Anyway, I will have this advice in mind when writing documentation (that was actually my motivation when moving much of the package description inside the individual functions).
importer/balanced/builder.go
Outdated
// | ||
// It returns the `ipld.Node` representation of the passed `node` filled with | ||
// children and the `nodeFileSize` with the total size of the file chunk (leaf) | ||
// nodes stored under this node (to propagate it upwards in the DAG to its parent). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of:
(to propagate it upwards in the DAG to its parent)
I might say:
(parent nodes store this to enable efficient seeking through the DAG when reading data later)
I think it’s self-evident that the data is being propagated upward, and this describes a little more clearly why we want that information.
Thanks @Mr0grog for the very thorough review and invaluable diagrams!! This is already shaping up to be a nice reading reference for a new user who wants to get a deeper understanding of how files are handled in IPFS. |
(edit: strike that, I've just seen that 0.4.16 still hasn't been released, let's leave this PR for the next one.) |
In need of a rebase now that other PRs have landed |
Fixed, waiting on the tests.. |
Broke the build. |
@schomatis whoops. You got a fix or should I handle? |
@whyrusleeping Fixed in #5239, should be merged first. |
The `UnixfsNode` structure has multiple pointers to many (non-complementary) mutually exclusive node types, only some of them are active (not-`nil`) at a given time in the code path which made the code too convoluted. Specifically, the most important distinction between node types was being hidden: leaf nodes vs internal (non-leaf) nodes. Remove entirely the use of `UnixfsNode` from the `balanced` package replacing it in turn with the newly created `FSNodeOverDag` structure that represents the UnixFS node encoded inside the DAG node, primarily used for internal node representations. Leaf nodes are handled exclusively in the `NewLeafDataNode` encapsulating its multiple representations (that we're previously exposed in `UnixfsNode` as conflicting pointers). The `builder.go` file has been completely rewritten, although the basic DAG creation algorithm has been preserved (extending a full DAG by creating a new root and linking the old one as its child), the most significant modification has been in the loop of `Layout` that now only handles internal nodes (i.e., nodes with `depth` bigger than zero) to be able to adapt `fillNodeRec` to only that scenario (avoiding the replace logic of the zero `depth` case with the defective `Set` function, now removed). The `fillNodeRec` now explicitly returns the `ipld.Node` and the size of the file data it's storing to propagate it upwards into the DAG. The `DagBuilderHelper` was heavily extended to incorporate `ipld.Node` functions that would replace the `UnixfsNode` ones used by the balanced builder: `NewLeafNode()`, `NewLeafDataNode()` and `AddNodeAndClose()`. Also, the `ProcessFileStore` function was incorporated to encapsulate all the logic related to the Filestore support which was scattered throughout the builder logic, the `offset` that was being passed through most functions is now a part of the `DagBuilderHelper`. This has turned out to be a rather big commit, it should have been split into more smaller and logically cohesive commits, but the `UnixfsNode` was too entangled inside the logic and that would have required a progressive modification of the `UnixfsNode` structure as well, which wasn't possible as it is still being used by the balanced builder (the same reason why most of the `UnixfsNode`-related functions cannot yet be removed, leaving the `helpers.go` file mostly untouched). License: MIT Signed-off-by: Lucas Molas <[email protected]>
@whyrusleeping RFM now |
@schomatis Woo! Thanks for all of these PRs. Making things easier to read and understand is so great |
The
builder.go
file has been completely rewritten, should be reviewed as an entirely new file.The
dagbuilder.go
has mostly additions and can be reviewed as a diff.Closes #5106.