-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Target fails with chr(0) in issue body #44
Comments
@laurentS - Just my two cents here. We had another encoding-based issue come up today in slack. If we expect data to be loaded into a string field, I think it's probably healthy to coerce values to utf-8 and define some type of escaping to utf-8 string if a field may not otherwise be translated. Alternatively, there's a JSON Schema spec for alternative encodings like BASE64, but I don't think that's as generically useful vs applying utf-8 upfront. https://json-schema.org/understanding-json-schema/reference/non_json_data.html#id2 The potential downside I can see of coercing to utf-8 is that the validation+escaping process may be expensive computationally. It might make sense to have this as an optional config option, similar to some prior conversations around optionally validating each record against the json schema as a opt-in behavior at runtime. If we build this into the SDK, it could potentially be implemented either at the tap level or at the target level. You are correct that json itself will allow any encodings, but the tap may want to assert a specific encoding type, and the target may actually require it in order to serialize the records. |
@laurentS @aaronsteers In this case, I have a pretty high confidence that these errors will be super localized and "only "happen in issue_body, issue_comment or repo_readme. As a fix at the tap-github level, what do you think of doing something simple like this (eg. for IssuesStream):
|
@ericboucher - Looks good from my side. If you can test on the problem row(s) and it works correctly, that looks like it would do the job. 👍 |
I was just doing that, but no luck: On the other hand |
Actually, repo_readme is base64 encoded on all of the nearly 1k repos I fetched while testing, so I think that one is safe. |
I've run into an annoying bug with target-postgres (datamill variant) where it crashes because a unicode char \x000 is sent to postgresql, causing it to reject the data.
The reason I'm reporting this here is that the data comes from this tap, and I'm not entirely sure what to do about it.
To reproduce:
tap-github
to fetch this issue: kv/kvserver: TestStoreRangeMergeConcurrentRequests failed cockroachdb/cockroach#68703 (you can get to the bug faster by hardcoding the issue number in theIssuesStream
url, so as to only get this one issue)\u0000
and python requests as\x00
(this might be the first time I write an issue report which might cause the bug to happen again 🤯 )psycopg2.errors.CharacterNotInRepertoire: invalid byte sequence for encoding "UTF8": 0x00
On the one hand, I feel like the target should clean the data before sending it to postgres, but the same reasoning can apply to the tap. I'm not sure what the best practice is here. The interface is JSON, and I found no evidence that
\x00
is illegal in JSON, as long as it's escaped.Any thoughts?
The text was updated successfully, but these errors were encountered: