-
Notifications
You must be signed in to change notification settings - Fork 4
Conversation
Does it know how not to choke itself? (i.e. trying to batch 10GB at the same time?) |
This is great but I can't help but think it doesn't belong at this layer in the stack, better in |
it should never batch large amounts of data unless the user changes the chunker, even then it would be the same as without batching. |
i know what you mean and agree, i tried to make it like take but with the current code this is the only place we can do it without adding complexity. if we extract blockstore to unixfs it would be much easier. |
After a quick look it seems that js-ipld is only used for
Adding Also I don't see much benefit in using js-ipld for unixfs. One of the points of js-ipld is, that it is codec agnostic and can work with any provided one and would do the right thing. In unixfs we exactly know which codec it is, and it is always only DagPB. So I would be in favour of using the BlockService directly. |
I think it's important to have boundaries and defined responsibilities in a system. If we have everything depend on everything else it becomes a mess. I would like this module to be responsible for creating dags from streams of buffers, I would like other modules to be responsible for serializing and persisting nodes and they are then free to apply whatever storage optimisations they see fit, including batching writes. |
i understand and agree but its called can we decouple ipld from blockservice, and in unixfs serialize manually without ipld and use the blockservice directly to persist ? ill port this to the blockservice. What you think ? |
@vmx IPLD should be making things like Unixfs importers & exporters more easy to do, not say to not use it. There are multiple operations here that Unixfs has to do manually (fetching node by node, checking which nodes it points to) that can be done by js-ipld, it's knowledge of the graph and its understanding of GraphSync. |
That won't be easy. That's planned for the next generation js-ipld, which won't be a thing soon. Is that actually needed? Why not just passing the BlockService into unixfs instead of js-ipld? |
thats what i meant ^^ |
I would agree for > UnixFSv1. But as UnixFSv1 is now opimized for perf, it just doesn't make sense to use js-ipld, as DagPB isn't even a format that supports the full IPLD Data Model.
Those things used to be considered part of IPLD, but that's not the case anymore. IPLD has a different (smaller?) scope these days. Those things would now be on a different layer (which then IPLD will probably use). |
The benefit of a performance improvement of this magnitude will impact adoption in significant ways. Thanks for digging into it! |
@vmx and @achingbrain should i go forward with this #38 (comment) ? |
Do you mean:
If so, yes please.
It's true, but we want the performance improvement to remain even after this module is retired for The responsibilities are:
We should not be optimising how we store blocks at a high level - we should pass stream of blocks to ipld, which should pass a stream of blocks to the block service, which should pull them into batches and write as appropriate, if that's the fastest way to write them to the available storage. |
An implication of this change is that importing files is no longer atomic, because we start returning CIDs before they've been written to the block store. The time window is small but applications will be able to receive CIDs from blocks that they've just added, request the block with the CID and find that it doesn't exist. If the application crashes or otherwise can't write the blocks (disk full, solar flare, I don't know), future users of the block store may not find the blocks they expect to be there. |
🔦 Great thing to highlight! Before merging this PR, there must be a test with a faulty repo that fails some of the writes and than the test compares the CIDs that got returned with the CIDs that are in the repo. All of them should be there. |
const timer = setIntervalAsync(async () => { | ||
const temp = nodesToPersist | ||
nodesToPersist = [] | ||
await ipld.putBatch(temp, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the error go if this throws? Should we retry the batch? What if retrying fails?
A few quick notes on We put a lot of thought into how we can produce libraries that are useful both in IPFS as dependencies but also useful independently in other libraries or even in something like There’s actually 2 libraries I’ll be producing.
|
Is this also going to break back-pressure during large file imports? If we’re continuing to import data without flushing to disc then we’re going to eventually run out of memory once the disc is under enough load. |
This is probably the wrong place to get into
It might minimise disruption to add
Interesting stuff. We already have something to prevent re-writing existent blocks in The existing mfs implementation uses the same unixfs importer/exporter as |
@vmx this is the kind of painpoints that any user of IPLD will hit and the reason why our Ethereum friends had to create https://github.com/ipld/js-ipld-graph-builder because the Perf of using js-ipld directly to add lots of dag nodes was terrible (which is a similar problem to what @hugomrdias is trying to fix, the only difference is that the graph-builder is designed for graphs that suffer a lot of changes, while unixfs is just dealing with big graphs). graph-builder is what lead for the consideration of ipld/js-ipld#66 |
@daviddias I agree. The problem with the current js-ipld is that it didn't get the layers of abstractions quite right. That will be fixed with future iterations of the js ipld stack. |
what is the plan on this PR? |
@hugomrdias was going to make these changes in the blockstore instead so the optimisation benefits other codecs and can live on with UnixFSv2. |
Also I don't think we got to the bottom of the conversation around whether we are ok with no longer having atomic writes. There's code in the filestore ensuring all writes are atomic, it'd all be redundant if this is how we want to do things going forward. |
Adds two new options: `fileImportConcurrency` This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files. `blockWriteConcurrency` This controls how many blocks from each file we write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when `fileImportConcurrency` is also high can swamp the process. It also: 1. Flattens module options because validating deep objects was clunky and the separation of access to config sub objects within this module isn't very good 1. Replaces `superstruct` and `deep-extend` with `merge-options` which is better suited for merging options and is smaller` 1. Replaces `async-iterator-*` modules with the more zeitgeisty `it-*` namespace Supersedes #38, sort of.
Adds two new options: `fileImportConcurrency` This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files. `blockWriteConcurrency` This controls how many blocks from each file we write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when `fileImportConcurrency` is also high can swamp the process. It also: 1. Flattens module options because validating deep objects was clunky and the separation of access to config sub objects within this module isn't very good 1. Replaces `superstruct` and `deep-extend` with `merge-options` which is better suited for merging options and is smaller` 1. Replaces `async-iterator-*` modules with the more zeitgeisty `it-*` namespace Supersedes #38, sort of.
Adds two new options: `fileImportConcurrency` This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files. `blockWriteConcurrency` This controls how many blocks from each file we write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when `fileImportConcurrency` is also high can swamp the process. It also: 1. Flattens module options because validating deep objects was clunky and the separation of access to config sub objects within this module isn't very good 1. Replaces `superstruct` and `deep-extend` with `merge-options` which is better suited for merging options and is smaller 1. Replaces `async-iterator-*` modules with the more zeitgeisty `it-*` namespace Supersedes #38, sort of.
* perf: concurrent file import Adds two new options: `fileImportConcurrency` This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files. `blockWriteConcurrency` This controls how many blocks from each file we write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when `fileImportConcurrency` is also high can swamp the process. It also: 1. Flattens module options because validating deep objects was clunky and the separation of access to config sub objects within this module isn't very good 1. Replaces `superstruct` and `deep-extend` with `merge-options` which is better suited for merging options and is smaller 1. Replaces `async-iterator-*` modules with the more zeitgeisty `it-*` namespace Supersedes #38, sort of. No batching but atomicity guarantees are maintained and performance gains are broadly similar with the right tuning.
I'm going to close this because importing speed has massively increased with #41 and will increase further with ipfs/js-ipfs#2771 neither of which compromise atomicity of writes. |
This PR adds support for batch puts of blocks into the datastore.
I duplicated the buildFile function with support for batch to be easier to benchmark, we can dedupe before merging this PR.
Two new options were added
batch
(Boolean) andbatchInterval
(number).batch
is just a toggle to enable batchingbatchInterval
is the time interval to accumulate Blocks before putting in the datastoreBatching improves performance and below you will see some numbers, but please run the benchmark files i added to the repo and test in your machines and provide feedback.
browser
node
Needs:
ipld/js-ipld#238