Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update go-ipld-git to a go-ipld-prime codec #46

Merged
merged 22 commits into from
Aug 12, 2021
Merged

Update go-ipld-git to a go-ipld-prime codec #46

merged 22 commits into from
Aug 12, 2021

Conversation

willscott
Copy link
Contributor

@willscott willscott commented Feb 23, 2021

The existing go-ipld-git codec allows for translation between git file data, and go-ipld-format node representations of the IPLD dag. This work attempts to instead provide a go-ipld-prime representation of that dag.

  • Decoder codec (git -> ipld-prime)
  • Encoder code (ipld-prime -> git)
  • Round-trip tests

fix #45

@willscott willscott self-assigned this Feb 23, 2021
@willscott willscott marked this pull request as ready for review April 1, 2021 00:19
@willscott willscott requested a review from mvdan April 2, 2021 16:30
Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, though take it with a grain of salt as this is the first time I look into git object formats. I also assume that the existing test suite is extensive enough to give us confidence of no regressions.

unmarshal.go Outdated Show resolved Hide resolved
blob.go Outdated Show resolved Hide resolved
blob.go Outdated Show resolved Hide resolved
personinfo.go Outdated Show resolved Hide resolved
personinfo.go Outdated Show resolved Hide resolved
tree.go Outdated Show resolved Hide resolved
_, ok := t.entries[p]
if !ok {
return nil
t := Type.Tree__Repr.NewBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow why we have a Tree__Repr builder here, and its result is built and assigned into a Tree builder, which is then again built. It seems like the other decode funcs follow the same pattern. Wouldn't a single Build per Decode be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this came out of tests:
the default is to use a basic.any as the assembler, and if that was directly used to make a list / map you end up with a top level basic node rather than the schema'd type that we have in this package.
When that's true, the decode path can't fast-path, and has to re assign that basic node into the schema type.

Copy link
Member

@warpfork warpfork Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with mvdan here -- it's a little odd to have a concrete reference like this in a method that is already accepting a NodeAssembler as a parameter. When there is a NodeAssembler as a parameter, I would usually expect it to just be used directly. (And maybe documentation can state something about "it is recommended to use a Tree__ReprAssembler here for best results".)

This doesn't seem dangerously incorrect as written, but does seem a little baroque. And it is probably doing a little more memory dancing than is necessary or optimal, too.

tree.go Outdated
return err
}
for {
node, err := DecodeTreeEntry(rd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before; why not pass the la.AssembleValue() assembler to the decoder and let it assign directly into it?

tree.go Outdated Show resolved Hide resolved

func cidToSha(c cid.Cid) []byte {
h := c.Hash()
return h[len(h)-20:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, isn't this a bit wasteful if we only end up using part of the result? Is this to strip the multihash prefix? Might be good to document it a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. this just uses the sha1 part of the multihash without the prefix to provide the git-compatible hash.

@willscott willscott mentioned this pull request May 19, 2021
@rvagg
Copy link
Member

rvagg commented Jul 21, 2021

@willscott @mvdan this codec changes all of the struct fields to camel-case, object -> Object, etc. Would you mind if we change it back for consistency with the old codec and the JS one? It's also closer to the native format's idea of naming as well. I can make the change here if you're both OK with that.

@rvagg
Copy link
Member

rvagg commented Jul 21, 2021

Some additional thoughts here where the differences in forms can be seen from a pathing perspective: ipfs/kubo#8287

If you're OK with it, I'd like to try and get the forms here back to the original.

@mvdan
Copy link
Contributor

mvdan commented Jul 21, 2021

That sounds fine to me - I'm not sure if ipld-prime codegen expects capitalized names, for the sake of exported names in the generated Go code.

@willscott
Copy link
Contributor Author

I also have no problem with changing name case back for consistency

@rvagg
Copy link
Member

rvagg commented Jul 22, 2021

0436bc7 now has the field names matching the original (lower-case, and some other minor adjustments). tagType is unfortunate since we can't use type in the codegen. Tree is now a map too so you can look up by directory name as per the original codec, which lets us path the same way back in the sharness test.

I think we need to do something about this strict need to match Go conventions on these names since it's not always going to match what you want your data model form to look like.

@mvdan
Copy link
Contributor

mvdan commented Jul 22, 2021

Out of curiosity, would it be enough to add "repr" mapping of field names so that the repr layer uses lowercase names?

Aside from that, I wonder if it's a bit weird for the schema to have the field names be lowercase, but the type names be uppercase.

@rvagg
Copy link
Member

rvagg commented Jul 22, 2021

Out of curiosity, would it be enough to add "repr" mapping of field names so that the repr layer uses lowercase names?

Sadly, no, because it's pathing at the data model layer that we want to care about, not encoding. Having renames is only going to help if we cared about these objects going into a standard codec (dag-cbor, dag-json), otherwise they're kind of useless here.

@rvagg
Copy link
Member

rvagg commented Jul 22, 2021

I wonder if it's a bit weird for the schema to have the field names be lowercase, but the type names be uppercase.

This seems fine to me, it actually looks pretty reasonable to my eyes (eyes that have been trained on C-family-style and aren't as accustomed to the upper-case first character rule, granted). Have a look at the schemas that Hannah wrote up in https://github.com/ipfs/go-unixfsnode/blob/main/data/gen/main.go (although the schema field names don't match what's being generated, but you can see the same thing going on there).

@mvdan
Copy link
Contributor

mvdan commented Jul 22, 2021

SGTM, thanks for confirming on both accounts. I've replied to ipld/go-ipld-prime#206 :)

rvagg added a commit to ipld/go-ipld-prime that referenced this pull request Jul 26, 2021
Found in ipfs/go-ipld-git#46 integrating into go-ipfs
which wants to load typed nodes when it encounters typed links. It errors
because it's trying to load the typed node into the type of the typed link
rather than the referenced type.
rvagg added a commit to ipld/go-ipld-prime that referenced this pull request Jul 29, 2021
Found in ipfs/go-ipld-git#46 integrating into go-ipfs
which wants to load typed nodes when it encounters typed links. It errors
because it's trying to load the typed node into the type of the typed link
rather than the referenced type.
@rvagg
Copy link
Member

rvagg commented Jul 29, 2021

@mvdan @willscott I've brought this up to what I think is a mergable state:

  • rebased on master - which brought in a bunch of stuff to get it working with the PL actions workflows
  • updated the README
  • done some minor jiggering of the types to match the README, and therefore closer to the original, (one rename of "text" to "message" and a shuffle of some of the struct fields, which mostly shouldn't have any impact on anything).
  • brought back the JSON tests using dag-json
  • updated to latest go-ipld-prime which fixes the typed link bug

and a few other minor things.

The main difference that I'm in two minds about is date handling - we're pulling out date and timezone and keeping them faithfully in their original format so round-trips work properly. The original codec presented them as a nice stringified date but with the original values kept internally.
An approach I've toyed with in the Bitcoin codec (in JS at least and I was thinking of doing the same in Go) was to do both - keep originals and present the nice usable forms. It just raises questions about what happens when/if they get out of sync. Two possible approaches would be: (1) error if they're out of sync when serializing, or (2) ignore the usable forms entirely when serializing.

For now I think it'd be fine to just leave this as is and leave it to the user to reform the stringified version if they want it, we could add back in a utility function to do that if needed, or evolve this code further in the future.

IMO this can be merged now.

Copy link
Contributor Author

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these changes, mostly the lower-casing of fields, looks good to me.

We should get some sort of nod from stewards before merging, i think?

gen/gen.go Outdated
}, schema.SpawnStructRepresentationStringjoin(" ")))
schema.SpawnStructField("date", "String", false, false),
schema.SpawnStructField("timezone", "String", false, false),
schema.SpawnStructField("email", "String", false, false),
schema.SpawnStructField("name", "String", false, false),
}, schema.SpawnStructRepresentationMap(map[string]string{})))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this to a map rather than a stringjoin?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's not being used as a stringjoin in any way? We have a custom PersonInfo#GitString() that's used to encode it in the git format and parsePersonInfo() to parse the input string for decoding. Stringjoin here is only going to mean that when it's encoded with dag-cbor or dag-json that it comes out as <date> <timezone> <email> <name>, which IMO is lossy and probably not what we're wanting out of this. Unless I'm missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<date> <timezone> <email> <name> is the git format though. I would hope that wouldn't be lossy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's $name <$email> $maybedate $maybetimezone - it's lossy because the email has a <> around it and we can use the < to delineate the name which itself can contain any number of spaces, so simple stringjoin isn't enough

Copy link
Member

@warpfork warpfork Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Rod's take here: this seems better to do at the codec layer and yield up to data model as a map.

If I was going to transliterate some git data to dag-json, I'd probably want this to become a map in dag-json. So that means the codec should do this parse/transform immediately.

Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

type GpgSig string

type PersonInfo struct {
date String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you're not keeping dates in a human-friendly single string format, I wonder if you could make this unix timestamp an Int too.

Copy link

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, left a few comments/questions. I'm sure you folks have much greater context on the go-ipld-prime usage than I do.

It looks like there are some breaking changes to the data output by this codec. We should be dropping the list of breaking changes into the release tag for when we release this so I think it'd be great if we could writeup the list of breaking changes to the codec (at the spec level) for anyone who happens to be affected. For the purposes of review having an explanation of why a breaking change was made would also be appreciated.

I don't know if we have guidelines on how codec names should match to the internal names of the format we're representing with IPLD, but I'd suspect in general we'd try to match the native names.

README.md Show resolved Hide resolved
Comment on lines +64 to +65
"date": "1503667703",
"timezone": "+0200",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that this is easier to work with an a single object? It doesn't seem to match how the Git spec handles things internally. Although perhaps it's fair for us to try and make things easier for our users here.

I was talking with @Stebalien about this and as far as we can tell there hasn't been much tooling developed around this codec so making a breaking change that helps people out and seems sane is probably fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(delayed reply to this) the main idea here is to keep the data lossless and not do too many tricks to make it palatable to the user.

Encoded form looks something like:

author A U Thor <[email protected]> 1465981137 +0000

But there's even some flexibility in what those final two fields can contain and apparently whether we get two, one or zero of them (I'm not sure about that detail though). So we treat them as strings and present them to the user as they are.

The current version of this codec keeps them internally but only presents a nice human-readable form as a DAG node, which isn't going to work when we're treating DAG traversal as traversal over the data model like we're doing with ipld-prime.

We did discuss (in this thread and other places) some options for making the human-readable form emerge out of the data to be available, but @willscott made the case that since this codec is a bit of a reference codec that we'd want to point others to for how to implement one that we'd better keep it simple for now.

So, lossless data, pure data model, no trickery, user gets to interpret the nodes as they want.

README.md Outdated Show resolved Hide resolved
Copy link
Member

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big picture, LGTM.

+1 big happy on the fix to capitizalization (field names being downcase). That makes more sense to me.

@willscott
Copy link
Contributor Author

Making sure we've got all the loose ends covered here:

  • i looked at the at date handling code again. I think i prefer the current splitting out of timezone in the data model, and i believe it actually is backwards compatible - previous codec encodings will have the full date (with timezone) in their date field, and the prime codec will take that encoding and produce the correct git representation of that data.
  • We should see if we can clean up the the tagType field nomenclature per Field name restrictions tied too closely to Go language rules ipld/go-ipld-prime#206 (comment)

@aschmahmann
Copy link

Yes, and a review for other breaking changes. mergeTag seems like a breaking change that seems like something we can just restore. I'm not sure if there are others at the moment.

Posting what the release notes for this release would look like would also be great since as soon as this is merged we're going to cut a release anyway and that release is going to need release notes 😄.

@rvagg
Copy link
Member

rvagg commented Jul 31, 2021

Have fixed up tagType to type thanks to Eric's hint, and mergeTag is now mergetag, plus a few more readability and clarity tweaks.

Original code has a MergeTag as well as a Tag with the main difference that one has a text field and the other a message field but otherwise they are the same. This one just has Tag now with message which I think is more in line with the git model. That'll be a breaking change from a DAG traversal perspective so we should note it in the release notes.

Will put together some basic release notes next week.

* mergeTag -> mergetag
* tagType -> type
@rvagg
Copy link
Member

rvagg commented Aug 2, 2021

Release notes:


BREAKING CHANGES

  • go-ipld-git is now a go-ipld-prime IPLD codec. Use Decode(na ipld.NodeAssembler, r io.Reader) error and Encode(n ipld.Node, w io.Writer) error for direct use if required.
  • There is now only one Tag type, MergeTag has been removed which had a text property. Use Tag's message property instead to retrieve the tag message from a commit's mergetag. i.e. <commit>/mergetag/message instead of <commit>/mergetag/text.
  • PersonInfo no longer exposes the human-readable RFC3339 format date field as a DAG node. The date and timezone fields are kept as their original string forms (to enable precise round-trips) as they exist in encoded Git data. e.g. <commit>/author/date now returns seconds in string form rather than an RFC3339 date string. Use this value and <commit>/author/timezone to reconstruct the original if needed.

@warpfork
Copy link
Member

warpfork commented Aug 3, 2021

Might I suggest we drop that change info into a CHANGELOG.md file right in the diff? PRs disappear from memory quickly and aren't easily indexed over.

(I'm pretty happy with the loose format in go-ipld-prime/CHANGELOG.md#released-changes, if having an outline to riff from greases the wheels. :))

@warpfork
Copy link
Member

warpfork commented Aug 3, 2021

(Informational note, no action required: I found myself nerdsniped by mergetag. Yes, that's a thing. Yes, keeping it alllowercase is correct. Docs here: http://github.com/git/git/blob/master/Documentation/technical/signature-format.txt .)

@rvagg
Copy link
Member

rvagg commented Aug 4, 2021

Have added a changelog and fixed the readme bork.

Normally I don't like changelogs because it's too easy for them to get out of sync, but recently I've adopted the release-on-every-merge practice in JS and it includes both tag + github release + changelog updates and all I need to do is make sure my commit messages are informative and standardised, e.g. https://github.com/rvagg/cborg/blob/master/CHANGELOG.md & https://github.com/rvagg/cborg/releases & https://www.npmjs.com/package/cborg all from pressing the merge button. Can recommend, although Go folks have other things to worry about in the release process I suppose.

@warpfork
Copy link
Member

warpfork commented Aug 4, 2021

Sweet. I think this is officially ready for the mergeparty (later this week? early next week?) where we land a bunch of these similar PRs in other repos at once.

Comment on lines +17 to +20
func init() {
mc.RegisterEncoder(cid.GitRaw, Encode)
mc.RegisterDecoder(cid.GitRaw, Decode)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary for functioning, or just a convenience function for updating the global registry via module imports?

@warpfork I recall your aversion to global registries. Does encouraging codec creators to do this, but codec users to use custom LinkSystems when needed seem like a reasonable compromise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this was the compromise we ended up at, yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's exactly it.

Global registries have downsides, but using package-init-time's natural synchronization is just too valuable. And having ways to opt-out of using the global registry, if some really needs to, seems to mitigate the biggest downsides handily. So: 👍

@rvagg
Copy link
Member

rvagg commented Aug 10, 2021

I have created a draft release & tag for this, v0.1.0: https://github.com/ipfs/go-ipld-git/releases

I'm working on some assumptions here about Go (and PL) release practices that may not be quite correct. So someone correct me in the next day or two before I press merge and create the release & tag. I would prefer to use semver properly and do a v1.0.0 but I don't think we're in the habit of doing that with Go so I'm assuming v0.1.0 is appropriate, a tag is sufficient and a GitHub release is a bonus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create IPLD-Prime codec that can surplant the ipld-format functionality in this repo
5 participants