Target fails with chr(0) in issue body #44

laurentS · 2021-11-10T00:26:26Z

I've run into an annoying bug with target-postgres (datamill variant) where it crashes because a unicode char \x000 is sent to postgresql, causing it to reject the data.

The reason I'm reporting this here is that the data comes from this tap, and I'm not entirely sure what to do about it.

To reproduce:

run tap-github to fetch this issue: kv/kvserver: TestStoreRangeMergeConcurrentRequests failed cockroachdb/cockroach#68703 (you can get to the bug faster by hardcoding the issue number in the IssuesStream url, so as to only get this one issue)
The body of this issue contains a weird character (see screenshot from github below) which curl returns as \u0000 and python requests as \x00 (this might be the first time I write an issue report which might cause the bug to happen again 🤯 )
Through the various encoding/decoding, this char ends up being sent to postgresql which chokes on it, and thows psycopg2.errors.CharacterNotInRepertoire: invalid byte sequence for encoding "UTF8": 0x00

On the one hand, I feel like the target should clean the data before sending it to postgres, but the same reasoning can apply to the tap. I'm not sure what the best practice is here. The interface is JSON, and I found no evidence that \x00 is illegal in JSON, as long as it's escaped.

Any thoughts?

The text was updated successfully, but these errors were encountered:

aaronsteers · 2021-11-10T02:02:53Z

@laurentS - Just my two cents here. We had another encoding-based issue come up today in slack.

If we expect data to be loaded into a string field, I think it's probably healthy to coerce values to utf-8 and define some type of escaping to utf-8 string if a field may not otherwise be translated.

Alternatively, there's a JSON Schema spec for alternative encodings like BASE64, but I don't think that's as generically useful vs applying utf-8 upfront. https://json-schema.org/understanding-json-schema/reference/non_json_data.html#id2

The potential downside I can see of coercing to utf-8 is that the validation+escaping process may be expensive computationally. It might make sense to have this as an optional config option, similar to some prior conversations around optionally validating each record against the json schema as a opt-in behavior at runtime.

If we build this into the SDK, it could potentially be implemented either at the tap level or at the target level. You are correct that json itself will allow any encodings, but the tap may want to assert a specific encoding type, and the target may actually require it in order to serialize the records.

ericboucher · 2021-11-10T03:29:58Z

@laurentS @aaronsteers In this case, I have a pretty high confidence that these errors will be super localized and "only "happen in issue_body, issue_comment or repo_readme.

As a fix at the tap-github level, what do you think of doing something simple like this (eg. for IssuesStream):

    def post_process(self, row: dict, context: Optional[dict] = None) -> dict:
        row["type"] = "pull_request" if "pull_request" in row else "issue"
        # interesting part below:
        row["body"] = row["body"].decode('utf-8', 'ignore')
        return row

aaronsteers · 2021-11-10T11:20:03Z

@ericboucher - Looks good from my side. If you can test on the problem row(s) and it works correctly, that looks like it would do the job. 👍

laurentS · 2021-11-10T11:35:29Z

I was just doing that, but no luck: AttributeError: 'str' object has no attribute 'decode'

On the other hand row['body'] = row['body'].encode('utf-8') seems to work on the case at hand. The resulting data in postgres has the faulty character removed, but I think this is acceptable. I will let the tap run on longer list of issues and report back if I see any other problems.

laurentS · 2021-11-10T12:12:33Z

@laurentS @aaronsteers In this case, I have a pretty high confidence that these errors will be super localized and "only "happen in issue_body, issue_comment or repo_readme.

Actually, repo_readme is base64 encoded on all of the nearly 1k repos I fetched while testing, so I think that one is safe.
My impression of the cause of the error is users copy/pasting raw error logs into issues, maybe with from some source with a different encoding. So I'll guess we won't be seeing this beyond the issue bodies.

laurentS referenced this issue in oviohub/tap-github-meltano Nov 10, 2021

Clean control chars from issue body field (#44)

f17152d

laurentS mentioned this issue Nov 10, 2021

Clean control chars from issue body field #46

Merged

aaronsteers closed this as completed in #46 Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Target fails with chr(0) in issue body #44

Target fails with chr(0) in issue body #44

laurentS commented Nov 10, 2021

aaronsteers commented Nov 10, 2021 •

edited

Loading

ericboucher commented Nov 10, 2021 •

edited

Loading

aaronsteers commented Nov 10, 2021

laurentS commented Nov 10, 2021

laurentS commented Nov 10, 2021

Target fails with chr(0) in issue body #44

Target fails with chr(0) in issue body #44

Comments

laurentS commented Nov 10, 2021

aaronsteers commented Nov 10, 2021 • edited Loading

ericboucher commented Nov 10, 2021 • edited Loading

aaronsteers commented Nov 10, 2021

laurentS commented Nov 10, 2021

laurentS commented Nov 10, 2021

aaronsteers commented Nov 10, 2021 •

edited

Loading

ericboucher commented Nov 10, 2021 •

edited

Loading