-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evmnfts, evmtokens: utf-8-clean strings from chain #568
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Worth mentioning why we're implementing this PR:
We found invalid metadata JSON in the wild (in particular, it was invalid UTF-8 in that it contained \x00
, but the Go JSON parser didn't balk at that, but Postgres does).
Hope I got that right :)
analyzer/evmnfts/evm_nfts.go
Outdated
nftData.Name, | ||
nftData.Description, | ||
nftData.Image, | ||
// If SanitizeString can affect JSON validity, we're doomed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:(
It does seem low risk.
We could have another wrapper, ensureValidJSON()
, that returns the input as-is if it's valid JSON, otherwise logs a warning and returns {}
or null
. Your call on whether that's warranted. I think you already made the call with this comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok turns out the round trip through json unmarshal+marshal is enough to filter out bad UTF-8 and NUL bytes. updated the comment. I'll keep the SanitizeStringP here out of principle
e2e regression flaked, http 500 from ipfs gateway |
oh clarification about @mitjat 's explanation: \x00 NUL bytes are valid UTF-8, but they're not allowed in PostgreSQL due to a technical limitation. incidentally, they're also not allowed in JSON, as strings aren't allowed to contain control characters (\uxxxx escapes of them are allowed), and the grammar doesn't allow them anywhere else either. |
values that come from chain: use our SanitizeString, as they may contain postgresql-displeasing nul characters
edit: as well as values that come from servers 🙄