Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge wip-v2 into master #178

Merged
merged 101 commits into from
Jul 16, 2021
Merged

Merge wip-v2 into master #178

merged 101 commits into from
Jul 16, 2021

Conversation

mvdan
Copy link
Contributor

@mvdan mvdan commented Jul 16, 2021

We've been working on CARv2 in the wip-v2 branch for some weeks.

In preparation for the first v2 beta, merge all of those commits into master.

The commits are mostly as they were originally, minus some minor squashing to clean up the history.

masih and others added 30 commits June 10, 2021 11:46
Since we are releasing a new major version, define a new go module for
the CAR v2 implementation, following the recommendations here:
- https://blog.golang.org/v2-go-modules

The module is versioned as go 1.15 to maintain compatibility as pointed
out by comment here:
- #75 (comment)
We might want to get this fixed at source; it is valid for a module to
have no go files i think, and CI should tolerate that.
Define the magic CAR v2 prefix proposed in ipld/specs#248 and assert
that it remains a valid CAR v1 header, along with other tests that
verify the expected length, content and version number.

Define basic structs to represent CAR v2 header, along with placeholder
for the future "characteristics" bitfield, and the optional padding
between CAR v1 dump and index added to the end of a CAR v2 archive.

Relates to:
- ipld/specs#248 (comment)
Reflecting on PR review comments, no harm in exposing this and it may be
useful to the users of the library.
Reflecting on the review comment, adding this field could provide
optimisation opportunities in the future in the context of block
alignment.

See:
- https://github.com/ipld/go-car/pull/80/files#r649241583
Now that the `CarV1Offset` is added to the header, the size is increased
from `32` to `40` bytes. Make it so in the test case name.
Define a constructor with sensible defaults, and the ability to
customize the defaults conveniently.

Implement `io.WriterTo` interface for both header and characteristics to
provide a consistent standard API for writing data into a given
`io.Writer`.
Implement a CAR v2 writer, that produces binary structure corresponding
to the specification of car V2, consisting of:
 1. Version prefix
 2. CAR v2 header
 3. Dump of Car v1
 4. Carbs index

The implementation also facilities optional padding before dump of CAR
v1 and Carbs index to provide scope for future optimisations. The
padding is defined as a dedicated type that writes zero-valued bytes for
a given padding size. The CAR v2 header, then captures the offsets from
the beginning of the file for both the CAR v1 dump and Carbs index. This
allows the user to quickly skip to the part of the CAR v2 they need as
well as offset alignment, if necessary, by altering the value of
padding.

Note, this is an intermediate implementation, and does not correctly
count the number of bytes written by the writer, pending the
transfer/refactor of Carbs implementation in this repo. In the meantime,
This implementation depends on Carbs as a go module. The implementation
of extensive writer tests is postponed until Carbs is transferred and
the written byte-count can be returned from the index marshaller.

Address review comments

- Avoid representing `Characteristics` as a pointer for easier casting
of memory regions.

- Remove anonymous fields in structs as a human readable way to document
what interfaces a struct implements.

- Simplify Header writing and written byte count calculation. The change
reduces the number of lines by assuming that `writeUint64To` will always
write 8 bytes.

- Avoid `_` in package names. Use `carv1` and `carv2` to name
corresponding packages.

- Rename `Marshal` in tests to reflect `WriteTo` related tests. The
method was renamed in the implementation but tests were not renamed.

- Write padding in bulk to reduce redundant memory allocation. Write
padding in bulks of `1024` bytes to reduce large memory allocation when
padding itself is large.

- Avoid using # in docs. Use go syntax instead, i.e. `.`.

- Remove redundant type in constants, since it is implicitly converted.

- Simplify over-refactored prefix write Since it is called only once.

- Avoid `Header` pointer, since its size is small and this would reduce
unnecessary allocations.

- Use buffer in writer to store encoded Car V1. This is to avoid
reallocation of bytes buffer.

- Add TODO re optimisation of index generation. The current
implementation reads the entire CAR v1 into memory in order to index it,
because `carbs` API requires `io.Reader`. Once `a`carbs` is incorporated
into this repo consider refactoring the API to make this a streaming
operation that avoids copying all the bytes since CAR v1 can be large.

- Remove a dedicated var for empty characteristics, since, we assume the
default to be all zero, and it is easy enough to construct a zero valued
Characteristics.

- Unexport PrefixBytesSize, because we can, and if it is public it might
have to stay public forever. So, unexport until we know we need it
exported.

- Remove `.Size()` on CAR v2 structs. Because we have the constant for
them and don't want to expose them twice.

- Use CamelCase for sub-test naming. This is to establish CamelCase for
sub-test naming as the convention for this repo since it keeps the test
names consistent in code and in the output ot `go test`.

- Unexport padding as a type since it is only used internally
The implementation comes from here:
- https://github.com/willscott/carbs

The rationale for copying the code here is to be able to taylor the
indexing API to the needs of CAR v2 in one place.
Remove unused structs, convert error messages to lower case and remove
redundant `return` statements.

Address review comments

- Reoder imports using `gofumpt`.

- Use consistent import alias for `carv1`.

- Rename structs for better readability.

- Add TODO to fix logging, tests, etc.

- Move test related files to `testdata`.
Because `SectionReader` requires the max number of bytes to read, since
it implements `Seek`. We need something like the `SectionReader` to read
the index at the end of a CAR v2 that does not require the user to know
the number of readable bytes. This is because, we do not store the size
of index in CAR v2 header, since it is always added as the last section.

The `OffsetReader` works just like `SectionReader`, except if `n`, the
number of bytes to read, is set to zero it simply carries on reading
until the underlying `io.ReaderAt` returns EOF. Consequently,
`OffsetReader` does not implement `Seek`.
Implement a CAR v2 reader that allows access to each section of the CAR,
i.e. CAR v1 dump and Index, as well as a higher level API that provides
a `blockstore.BlockStore` from a CAR v2 file.

Address Review comments

- Simplify naming of section size constants
- Simplify OffsetReader by removing the dual SectionReader functionality
- Improve read efficiency by reading header in one chunk
- Add TODOs to improve write efficiency in a similar manner
Make the copied carbon code part of the go-car/v2 module, and address
`staticcheck` issues.

Move README content from the original implementation into `doc.go`.
Remove the carbs copied without history
partial

finish interface

license / readme

add cli for hydration

indirection

extensible indexes

fixed int sizes

additional typed int

expose car Roots

fix issues with index restoration

add gob-based hash index

typo in util/hydrate

Create go.yml

Create codeql-analysis.yml

Create dependabot.yml

work on testing round trip

test passes

mmap idx

fix issue in sorted index gets

add progress bar
Bumps [github.com/ipfs/go-ipld-cbor](https://github.com/ipfs/go-ipld-cbor) from 0.0.4 to 0.0.5.
- [Release notes](https://github.com/ipfs/go-ipld-cbor/releases)
- [Commits](ipfs/go-ipld-cbor@v0.0.4...v0.0.5)

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [github.com/cheggaaa/pb/v3](https://github.com/cheggaaa/pb) from 3.0.5 to 3.0.7.
- [Release notes](https://github.com/cheggaaa/pb/releases)
- [Commits](cheggaaa/pb@v3.0.5...v3.0.7)

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [github.com/cheggaaa/pb/v3](https://github.com/cheggaaa/pb) from 3.0.7 to 3.0.8.
- [Release notes](https://github.com/cheggaaa/pb/releases)
- [Commits](cheggaaa/pb@v3.0.7...v3.0.8)

Signed-off-by: dependabot[bot] <[email protected]>
Bumps [github.com/ipld/go-car](https://github.com/ipld/go-car) from 0.2.2 to 0.3.0.
- [Release notes](https://github.com/ipld/go-car/releases)
- [Commits](v0.2.2...v0.3.0)

Signed-off-by: dependabot[bot] <[email protected]>
masih and others added 15 commits July 16, 2021 14:56
Remove dependency to `bufio.Reader` in internal `carv1` package that
seems to be mainly used for peeking a byte to return appropriate error
when stream abruptly ends, relating to #36. This allows simplification
of code across the repo and remove all unnecessary wrappings of
`io.Reader` with `bufio.Reader`. This will also aid simplify the
internal IO utilities which will be done in future PRs. For now we
simply remove dependency to `bufio.Reader`

See:
- #36
Implement the ability to generate index from a CARv1 payload given only
an `io.Reader`, where the previous implementation required
`io.ReadSeeker`. The rationale is to be minimal in what we expect in the
API, since index generation from a CARv1 payload never need to rewind
the reader and only moves forward in the stream.

Refactor utility IO functions that convert between types in one place.
Implement constructor functions that only instantiate wrappers when the
passed argument does not satisfy a required interface.

Fixes:
- #146

Relates to:
- #145
Unexport the CARv2 writer API until it is reimplemented using WriterAt
or WriteSeeker to be more efficient. This API is not used anyway and we
can postpone releasing it until SelectiveCar writer API is figured out.
For now we unexport it to carry on with interface finalization and alpha
release.
All the "New" APIs take IO interfaces, so they don't open any file by
themselves. However, the APIs that take a path to disk and return a
blockstore or a reader need closing, so the prefix "Open" helps clarify
that.

Plus, it makes names more consistent.
Add tests that asserts:
 - when padding options are set the payload is as
expected
 - when finalized, blockstore calls panic
 - when resumed from mismatching padding error is as expected
 - when resumed from non-v2 file error is as expected

Remove redundant TODOs in code
Use both CID v0 and v1 in testing `ReadWrite` blockstore.

Use consistent import package name `merkledag` across tests.
Fix an issue where single width index not finding a given key returns
`0` offset instead of `index.ErrNotFound`.

Reflect the changes in blockstore to return appropriate IPFS blockstore
error.

Add tests that asserts error types both in index and blockstore
packages.

Remove redundant TODOs.

Relates to:
- #158
Remove the duplicate test CARv1 file that existed both in `testdata` and
`blockstore/testdata` in favour of the one in root.

Reflect change in tests.

Duplicity checked using `md5` digest:

```sh
$ md5 testdata/sample-v1.car
MD5 (testdata/sample-v1.car) = 14fcffc271e50bc56cfce744d462a9bd

$ md5 blockstore/testdata/test.car
MD5 (blockstore/testdata/test.car) = 14fcffc271e50bc56cfce744d462a9bd
```
Improve reader type conversion by checking if type satisfies
ReaderAt to avoid unnecessary wrapping.

Move io converters into one place.

Fixes:
- #145
We've agreed to have unified options, since many will be shared between
the root and blockstore packages.

Include docs, and update the tests.
Add examples that show how to read and write an index to/from file.

Test marshalling and unmarshalling index files.

Run `gofumt` on repo.
Remove the memory intensive CARv2 writer implementation since it needs
to be reimplemented anyway and not used. We will release writers if
needed when it comes to SelectiveCar writer implementation. Just now it
is not clear if we want a CARv2 writer that works with the deprecated
`format.NodeGetter`.

Add tests for V1 wrapping functionality.

Reformat code in repo for consistency using `gofumpt`.
Rename terminology to match what the spec uses to describe the inner
CARv1 payload, i.e. Data payload.

Update docs to use CARv1 and CARv2 consistently.

Fix typo in Options API.

See:
- https://ipld.io/specs/transport/car/carv2/
@mvdan
Copy link
Contributor Author

mvdan commented Jul 16, 2021

Please do a merge commit when merging this, because it's a lot of commits and we want the merge to be recorded.

@masih masih requested review from willscott and masih July 16, 2021 13:59
@masih masih added this to the CARv2 v2.0.0 milestone Jul 16, 2021
Copy link
Member

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

If appendage is needed there is already `car.AttachIndex`.
@mvdan
Copy link
Contributor Author

mvdan commented Jul 16, 2021

We are continuing to send PRs to wip-v2 while this is waiting for a final ACK.

Copy link
Member

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with merging. I believe we also got an implicit acknowledgement/approval from @Stebalien.

@mvdan
Copy link
Contributor Author

mvdan commented Jul 16, 2021

Great, thank you! We have two other PRs in flight targeting wip-v2, so we'll merge those soon and then merge this.

masih and others added 3 commits July 16, 2021 17:47
It's two lines of code extra and we already have read from and write to
APIs for index.

Remove `hydrate` cli since the assumptions it makes about appending
index are no longer true. header needs updating for example. Removing to
be re-added when needed as part of a propper CARv2 CLI.
Implement examples that show how to open a read-only blockstore, and a
read-write blockstore along with resumption from the same file.

The example surfaced a race condition where if the `AllKeysChan` is
partially consumed and file is closed then the gorutie started by chan
population causes the race condition. Locking on Close will make the
closure hang. Better solution is needed but choices are limited due to
`AllKeysChan` signature.

Fixes:
- #124
@mvdan mvdan merged commit 0ae4f9c into master Jul 16, 2021
@mvdan mvdan deleted the wip-v2 branch July 16, 2021 17:19
@aschmahmann aschmahmann mentioned this pull request Dec 1, 2021
80 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants