-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAG Import/Export #7011
DAG Import/Export #7011
Conversation
c6f9e3b
to
f5d5715
Compare
f5d5715
to
ac54fc3
Compare
ac54fc3
to
aa50194
Compare
@Stebalien @rvagg @warpfork : this is the final implementation of the import/export functionality. The only thing missing are the sharness tests, which are coming some time Saturday: I am still not entirely happy with my test cases. Sorry it took forever, there were a lot of... dusty corners to connect properly. I specifically want someone to check my concurrency work ( around the progress meters ). I have not worked with such code a whole lot, and while there was a lot to crib from existing progress meters in ther commands, nothing was quite close to what I needed. I will convert this into a proper PR once I push the sharness bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏎 💨
Co-Authored-By: Will <[email protected]>
// It is not guaranteed that a root in a header is actually present in the same ( or any ) | ||
// .car file. This is the case in version 1, and ideally in further versions too | ||
// Accumulate any root CID seen in a header, and supplement its actual node if/when encountered | ||
// We will attempt a pin *only* at the end in case all car files were well formed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attempt a pin only at the end in case all car files were well formed
I can't quite see this in the code. It looks like it's done on a per-root basis, not per-car.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - each individual root is opportunistically pinned, to maximize the benefit of what we just ingested being pinned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better change the comment then, it seems misleading talking about the "car files" being "well formed" since you may find one root and pin that and not find another and not pin that.
// | ||
// if err := api.Pin().Add(req.Context, rp, options.Pin.Recursive(true)); err != nil { | ||
|
||
ret := RootMeta{Cid: &c, PresentInCar: seen} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if seen
is false
but the CID exists in the blockstore already. I think this would mean that it gets pinned regardless of the "malformed" CAR. Should the next block of if/else's be contained in an if seen
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this goes back to the question of "malformed" then, are you sure you want to have a lose edge like this? might be better to be a little more strict about the contents of a car - "you don't have the root you say you do, bad luck", instead you are opening up edges that could become important features for some in the future when we decide to be more strict with a future CAR format, or could lead to unexpected behaviour for others.
core/commands/dag/dag.go
Outdated
return sess.GetBlock(req.Context, c) | ||
} | ||
sb := gipselectorbuilder.NewSelectorSpecBuilder(gipfree.NodeBuilder()) | ||
car := gocar.NewSelectiveCar( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about the case where you don't have a complete graph available? I believe that the *
selector is going to result in an error if it reaches an edge that sess.GetBlock()
can't return for a CID. We just can't currently disambiguate the error type from ipld-prime to know whether that may be acceptable (e.g. if a --allow-incomplete
type flag was introduced), but we could probably change it to get one if that's needed. (https://github.com/ipld/go-ipld-prime/blob/eb71617f4aeb7e73f36276f14e76e47fd90241ef/traversal/walk.go#L119)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvagg on export we want to either output the complete graph ( potentially grabbing it from the network ), or to fail. I.e. at present we do not at all have --allow-incomplete
in spirit, so that the .car files produced by go-ipfs are exclusively "well formed and deterministic" ( deterministic as dictated by go-ipld-prime ).
The import of course is a complete free-for-all ( also by design ) as can be seen here:
https://github.com/ipfs/go-ipfs/blob/03fe9bde87462aa11775bf6e8361a56fd967f89d/test/sharness/t0054-dag-car-import-export.sh#L41-L55
based on the datasets: https://github.com/ipfs/go-ipfs/blob/03fe9bde87462aa11775bf6e8361a56fd967f89d/test/sharness/t0054-dag-car-import-export-data/README.md
Let me know if this is potentially problematic...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment above, it's not necessarily problematic, it just opens up API feature space that may be difficult to claw back in the future if we decide to be more strict. My personal inclination with such things these days is to be explicit and strict and open up feature space only in response to requests and demonstrated need (it also gives you a much more explicit space to construct tests around). Not my call here, so just a word of caution.
- Stop printing during Run: it does not execute on local tty - Use tabs for easier parsing - Ensure countstats do not appear in json events
It is not possible to stream out "binary" and emit events at the same time. Thus switch from listing out how many objects did we export, to how many *bytes* are being spat out. Move everything to PostRun to keep emitting on Stderr of the calling process.
On relatively large amount of objects things get into a livelock ( spinning fans, no progress ) specifically after importing: https://ipfs.io/ipfs/QmbnqojkPFjZtkCtzzCEmUbAVoKEkJzdY5u7Z7anR6vAte this locks up: ipfs dag export bafy2bzaced4ueelaegfs5fqu4tzsh6ywbbpfk3cxppupmxfdhbpbhzawfw5oy
Failcases to come as a separate PR
We do not lose any blocks ( we are able to re-export ), so that's great But the fact that `repo gc` comes back with non-existent Qm CIDs... ugh
@@ -54,6 +54,7 @@ require ( | |||
github.com/ipfs/go-unixfs v0.2.4 | |||
github.com/ipfs/go-verifcid v0.0.1 | |||
github.com/ipfs/interface-go-ipfs-core v0.2.6 | |||
github.com/ipld/go-car v0.0.5-0.20200316204026-3e2cf7af0fab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go ahead and cut a CAR release when ready.
roots = ret.roots | ||
|
||
progressTicker.Stop() | ||
if err := res.Emit(&CarImportOutput{ObjectCounts: &counts}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't safe. Emit
can do anything it wants with the object, including send it to another goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason &counts
is a pointer ( here and your next comment ) is that this is the only way to activate omitempty
higher up: https://github.com/ipfs/go-ipfs/pull/7011/files#diff-15172926c4147422f472139313467179R73
Want to make sure you agree with the rationale before I add a bunch of mutexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine although I'm not sure why omitempty is important in this case. My point is that we should be copying this object safely instead of sharing it.
|
||
retCh := make(chan importResult, 1) | ||
var counts ObjectCounts | ||
go importWorker(req, &res, &api, &counts, retCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely not safe. We can't update counts from a different thread without a lock.
for { | ||
len, readErr := r.Read(buf) | ||
if len > 0 { | ||
if err := re.Emit(bytes.NewBuffer(buf[:len])); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may happen to work (should add a check to prevent it) when not running the daemon, but simply won't work when actually running the daemon. When running the daemon, we can only return:
- Repeated results of the same type (encoded as json). To do this, we have to specify the type.
- Streaming bytes (no type specified).
We can't return both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure abut both not being possible, so opted for the loop. Will simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misread this the first time. We can repeatedly stream new byte readers.
However, we really shouldn't read 4MiB, write 4MiB, etc. Ideally, we'd print out a rate and amount downloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not read/write 4MiB. This is the max we can read from the pipe at once, but other than that we just block and pull through whatever slice is placed on the pipe.
https://github.com/ipfs/go-ipfs/pull/7037/files#diff-15172926c4147422f472139313467179R418-R421
Returning progress on Additionally, the current method simply won't work with the current commands library. We can't mix data blocks and objects. |
Let's just drop progress on get. Progress on add is useful and should be pretty simple to implement. |
All I am doing is copying bytes from one pipe to another ( with a complication-conditional that I now realise is not needed ).
This code works 100%, I just elected not to add tests for it. I could do that if you are unconvinced :)
I can do that, but having long-running network operations without any way to tell "is this thing alive?" is pretty bad UX. Please let me know this is really what you want to do.
Progress on add ( import ) is already in this PR... what am I missing? |
|
||
if silent || encType != "text" { | ||
// force-disable progress unless on non-silent CLI | ||
req.Options[progressOptionName] = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I ask for progress but set the encoding type to json
, I'd expect json progress updates. I wouldn't mess with this in pre-run.
// enable progress implicitly if a TTY | ||
errStat, _ := os.Stderr.Stat() | ||
if 0 != (errStat.Mode() & os.ModeCharDevice) { | ||
req.Options[progressOptionName] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just set the progress option default to true in the options. That way, the client can override it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can override it ( slightly rewritten ): https://github.com/ipfs/go-ipfs/pull/7037/files#diff-15172926c4147422f472139313467179R363-R373
Ok, I've been mixing up parts of import/export. I'll need to re-read this but it looks more reasonable on a second pass. However, I'd still like to punt some of this logic till a future version. This is very tricky code with subtle issues and addressing them all in one PR will be frustrating for all of us.
You're right, I was mixing up import/export and https://github.com/ipfs/go-ipfs/pull/7011/files#r398257738 (I thought this case was reachable). |
c, err := cid.Decode(req.Arguments[0]) | ||
if err != nil { | ||
return fmt.Errorf( | ||
"unable to parse selector (currently only bare CIDs are supported): %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the format of this argument when proper selectors are allowed? I assume it'll allow parsing of @creationix's new string-form selectors at some point, but aren't we still going to need to specify a root separate to the selector? Is that going to need some extra syntax to squish CID+selector together or will this need to be 2 arguments, one for CID and one for selector? And in that case, should this argument just be renamed from "selector" to "root" for now with the expected addition of a selector argument to come later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also https://github.com/ipld/go-car/blob/3e2cf7af0fab7d59ac0eb2ebc3dcad1c06fa68d0/selectivecar.go#L21-L24 where these two things are distinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 When selector support is added, I'd advocate for a tuple here, yes.
Selectors are a declarative document which, when interpreted, navigates from one starting node, to a series of reached nodes (being all those visited) and matched nodes (being those which the selector... selects).
Starting a selector at a particular CID is a common operation -- great; it composes cleanly with the above.
Appearances of func foo(CID, Selector)
would look totally reasonable and idiomatic to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed as follows: https://github.com/ipfs/go-ipfs/pull/7036/files#diff-15172926c4147422f472139313467179R254-R263
The error text: https://github.com/ipfs/go-ipfs/pull/7036/files#diff-15172926c4147422f472139313467179R269
This should alleviate the duality ( @warpfork , @rvagg please confirm )
Been pondering the import semantics all day and want to continue on from my inline comment earlier: I think the first version of this should be fairly exclusive and not attempt to handle too much novelty. Rejecting zero-root CARs is probably a good idea for now (since it's essentially baked in go CARv1) and being more picky about how to handle the root-not-in-body case is probably a good idea too - it's badly formed. Don't even bother trying to pin that root. There's two main reasons for this line of thinking:
The two areas I can see justified for not making import a strict mirror of export in terms of its exclusiveness are: well ordered, structured and complete DAG ("deterministic", too hard, but we will want to make that reasonable to check in CARv2) and >1 roots (although export could easily be expanded here to accept more than one root already). |
I have addressed all EX port pieces in a separate PR to make thins simpler to review. |
New PR's: Regarding @rvagg comment earlier: I still feel that If the group feels this is not the direction we should pursue: I will add further strictness checks as requested, and will in parallel write something else to provide the "more convenient |
@Stebalien the sharness test is a fluke ( pubsub stuff failed ). In any case this PR is now too different from the "broken up" ones, I am closing it, and killing the branch, as all suggestions here got incorporated ( aside from the discussion on "is import ok as-is", which we will do in the import PR #7038 |
Implements #6870.