Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target fails with chr(0) in issue body #44

Closed
laurentS opened this issue Nov 10, 2021 · 5 comments · Fixed by #46
Closed

Target fails with chr(0) in issue body #44

laurentS opened this issue Nov 10, 2021 · 5 comments · Fixed by #46

Comments

@laurentS
Copy link
Contributor

I've run into an annoying bug with target-postgres (datamill variant) where it crashes because a unicode char \x000 is sent to postgresql, causing it to reject the data.

The reason I'm reporting this here is that the data comes from this tap, and I'm not entirely sure what to do about it.

To reproduce:

  • run tap-github to fetch this issue: kv/kvserver: TestStoreRangeMergeConcurrentRequests failed cockroachdb/cockroach#68703 (you can get to the bug faster by hardcoding the issue number in the IssuesStream url, so as to only get this one issue)
  • The body of this issue contains a weird character (see screenshot from github below) which curl returns as \u0000 and python requests as \x00 (this might be the first time I write an issue report which might cause the bug to happen again 🤯 )
    image
  • Through the various encoding/decoding, this char ends up being sent to postgresql which chokes on it, and thows psycopg2.errors.CharacterNotInRepertoire: invalid byte sequence for encoding "UTF8": 0x00

On the one hand, I feel like the target should clean the data before sending it to postgres, but the same reasoning can apply to the tap. I'm not sure what the best practice is here. The interface is JSON, and I found no evidence that \x00 is illegal in JSON, as long as it's escaped.

Any thoughts?

@aaronsteers
Copy link
Contributor

aaronsteers commented Nov 10, 2021

@laurentS - Just my two cents here. We had another encoding-based issue come up today in slack.

If we expect data to be loaded into a string field, I think it's probably healthy to coerce values to utf-8 and define some type of escaping to utf-8 string if a field may not otherwise be translated.

Alternatively, there's a JSON Schema spec for alternative encodings like BASE64, but I don't think that's as generically useful vs applying utf-8 upfront. https://json-schema.org/understanding-json-schema/reference/non_json_data.html#id2

The potential downside I can see of coercing to utf-8 is that the validation+escaping process may be expensive computationally. It might make sense to have this as an optional config option, similar to some prior conversations around optionally validating each record against the json schema as a opt-in behavior at runtime.

If we build this into the SDK, it could potentially be implemented either at the tap level or at the target level. You are correct that json itself will allow any encodings, but the tap may want to assert a specific encoding type, and the target may actually require it in order to serialize the records.

@ericboucher
Copy link
Contributor

ericboucher commented Nov 10, 2021

@laurentS @aaronsteers In this case, I have a pretty high confidence that these errors will be super localized and "only "happen in issue_body, issue_comment or repo_readme.

As a fix at the tap-github level, what do you think of doing something simple like this (eg. for IssuesStream):

    def post_process(self, row: dict, context: Optional[dict] = None) -> dict:
        row["type"] = "pull_request" if "pull_request" in row else "issue"
        # interesting part below:
        row["body"] = row["body"].decode('utf-8', 'ignore')
        return row

@aaronsteers
Copy link
Contributor

@ericboucher - Looks good from my side. If you can test on the problem row(s) and it works correctly, that looks like it would do the job. 👍

@laurentS
Copy link
Contributor Author

I was just doing that, but no luck: AttributeError: 'str' object has no attribute 'decode'

On the other hand row['body'] = row['body'].encode('utf-8') seems to work on the case at hand. The resulting data in postgres has the faulty character removed, but I think this is acceptable. I will let the tap run on longer list of issues and report back if I see any other problems.

@laurentS
Copy link
Contributor Author

@laurentS @aaronsteers In this case, I have a pretty high confidence that these errors will be super localized and "only "happen in issue_body, issue_comment or repo_readme.

Actually, repo_readme is base64 encoded on all of the nearly 1k repos I fetched while testing, so I think that one is safe.
My impression of the cause of the error is users copy/pasting raw error logs into issues, maybe with from some source with a different encoding. So I'll guess we won't be seeing this beyond the issue bodies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants