Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article: options and tradeoffs around data import parameters #1176

Open
Tracked by #428
lidel opened this issue Jun 10, 2022 · 9 comments
Open
Tracked by #428

Article: options and tradeoffs around data import parameters #1176

lidel opened this issue Jun 10, 2022 · 9 comments
Assignees
Labels
need/author-input Needs input from the original author need/triage Needs initial labeling and prioritization status/blocked Unable to be worked further until needs are met

Comments

@lidel
Copy link
Member

lidel commented Jun 10, 2022

Most people are ok with whatever chunker and hash function is the current default in commands that import data to IPFS.
In case of go-ipfs, these are ipfs add, ipfs dag put, and ipfs block put.

However, one can not only use custom --chunker and --hash function when doing ipfs add, but also choose to produce TrickleDAG instead of MErkleDAG by passing --trickle, enable or disable --raw-leaves, or even write own software that chunks and hashes and assembles UnixFS DAG in novel ways.

One can go beyond that, and import a JSON data as dag-json or dag-cbor, creating data structures beyond regular files and directories.

We need an article that explains:

  • what is the current default when importing files and why
    • chunker (why we use size-based, when to use rabin or buzzhash)
    • hash (why we use sha2-256)
    • raw leaves (possible and default when cidv1 is used, but legacy implementations used cidv0 without this)
    • cid version
      • we should document cid v1 as the default, but note that legacy implementations may use v0
    • dag type ( --trickle better suited for append-only data such as logs?)
  • what are the knobs one can change during import, and what is their impact/tradeoffs
  • things to hitn at, but no need to go to deep
    • note dag-pb alternatives exist, mention dag-json and dag-cbor, and hint when using non-Unixfs DAGs make sense

Prior art:

@lidel lidel added the need/triage Needs initial labeling and prioritization label Jun 10, 2022
@RubenKelevra
Copy link
Contributor

Hey @lidel ,feel free to assign me. Got time tomorrow for that. :)

@lidel lidel added dif/hard Having worked on the specific codebase is important P2 Medium: Good to have, but can wait until someone steps up effort/days Estimated to take multiple days, but less than a week and removed need/triage Needs initial labeling and prioritization labels Jun 10, 2022
@RubenKelevra
Copy link
Contributor

@lidel wrote:

what is the current default when importing files and why

  • chunker (why we use size-based...)

I may need some input here. I actually can't think of a reasonable explanation why size-based is better than a rolling chunker.

Maybe someone like @Stebalien can chime in here and tell me why the decision was made to use a size-based chunker by default. :)

@RubenKelevra
Copy link
Contributor

  • dag type ( --trickle better suited for append-only data such as logs?)

Correct me when I'm wrong, but it's just a little bit less overhead for data which is read from front to back anyway. So any file type with random access will be slowed down.

Logs are not large enough to make any significant difference here, as you can easily fit a list of all chunks of a log in one block.

So while one may think of zip-like archives, iso files or videos, that's also actually not the case. Zip files are random access and iso files can be mounted without reading the full iso as a whole, and video streaming with seeking is pretty much the norm.

I also cannot think of a really good usecase here - so I would flag it as "stable, but experimental" option.

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Jun 14, 2022

  • hash (why we use sha2-256)

I feel like I may not be the right person after all to write this article :D I wrote a ticket to change this default actually – and I still think blake2b is the better default. :)

So I guess "standards?" Or "legacy stuff we not dare to change?"

@RubenKelevra
Copy link
Contributor

So overall, just the "why?" and rationale is the blocker for me to write it.

As, I have the opinion that these should be the standards – and don't see good reason to use anything else. :)

  • Rolling chunker aka buzhash
  • cidv1
  • raw-leaves
  • blake2b-256

And I use them everywhere.

So @lidel if you could just give some rationale for the whys (doesn't even need to be full sentences) I'm happy to write it. Just stop me if it gets too detailed ;)

@lidel
Copy link
Member Author

lidel commented Jun 14, 2022

@RubenKelevra no need to write the whole thing, it is perfectly fine if you only write sections that you care about (even if it is only chunker) and open a PR draft with that, we will fill the gaps :)

You are right, many choices like default chunker are legacy decisions – just write that and note that different implementations of IPFS are free to choose different defaults (e.g. blake2b).

Totally, will be useful to even give some "Recipes" like the one you listed with blake and buzzhash, and elaborate why one would prefer that over the "safe"/legacy defaults. :)

@RubenKelevra
Copy link
Contributor

@RubenKelevra no need to write the whole thing, it is perfectly fine if you only write sections that you care about (even if it is only chunker) and open a PR draft with that, we will fill the gaps :)

Alright. :)

You are right, many choices like default chunker are legacy decisions – just write that and note that different implementations of IPFS are free to choose different defaults (e.g. blake2b).

Totally, will be useful to even give some "Recipes" like the one you listed with blake and buzzhash, and elaborate why one would prefer that over the "safe"/legacy defaults. :)

Maybe we should just add a "--use-legacy-defaults" to the daemon (and as global flags for all commands) as a flag to free us up from those considerations that people rely on them.

This would also free us up for the long discussed default ports for example, which we also not dare to change for similar reasons. :)

This way we can document the "legacy defaults" once and why they were chosen and then elaborate why the new defaults are better.

I feel that would make more sense when reading - and also more sense when using ipfs.

@johnnymatthews johnnymatthews moved this to Needs triage in Protocol Docs Jul 8, 2022
@johnnymatthews johnnymatthews moved this from Needs triage to Backlog in Protocol Docs Jul 8, 2022
@ElPaisano ElPaisano removed the P2 Medium: Good to have, but can wait until someone steps up label Apr 4, 2023
@ElPaisano
Copy link
Contributor

@lidel triaging old issues, would you say this is still relevant?

@ElPaisano ElPaisano added need/author-input Needs input from the original author need/triage Needs initial labeling and prioritization status/blocked Unable to be worked further until needs are met and removed dif/hard Having worked on the specific codebase is important effort/days Estimated to take multiple days, but less than a week labels Aug 22, 2023
@lidel
Copy link
Member Author

lidel commented Aug 22, 2023

@ElPaisano yes, I believe that this is untapped potential in IPFS ecosystem, and having some introductory docs might empower people to innovate in this area.

There is need for two articles (or one with two sections):

  • introductory style that explains on defaults and knobs in software like Kubo and Helia
  • DYI style on writing your own data onboarding tools which do custom chunking (good example in specs here and JS code here)

The goal would be to convey that chunking details are userland feature: anyone can use default chunking or roll their own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/author-input Needs input from the original author need/triage Needs initial labeling and prioritization status/blocked Unable to be worked further until needs are met
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants