Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination S3: Correctly generate int64 values #17564

Closed
edgao opened this issue Oct 4, 2022 · 3 comments · Fixed by #23466
Closed

Destination S3: Correctly generate int64 values #17564

edgao opened this issue Oct 4, 2022 · 3 comments · Fixed by #23466
Assignees
Labels
needs-triage team/destinations Destinations team's backlog type/bug Something isn't working

Comments

@edgao
Copy link
Contributor

edgao commented Oct 4, 2022

As noted in #14362 (comment), dest-s3's avro and parquet output formats currently use int32. This causes problems for users with int64 values. We should have dest-s3 produce int64 values for integer inputs.

The reason we use int32 right now is because we want to support union types, which behave kind of weirdly in edge cases. Consider this JSON schema:

{
  "oneOf": [
    {"type": "string", "airbyte_type": "timestamp_with_timezone"},
    {"type": "number", "airbyte_type": "integer"}
  ]
}

Ideally, this would resolve to an Avro union:

[
  {"type": "long", "logicalType": "timestamp-micros"},
  "long"
]

The problem is that Avro will detect that both types are actually long under the hood, so this isn't actually a valid union type. Furthermore, the json<>avro converter behaves differently based on the logic type:

  • {"created_at": 1634982000} can be parsed as "long", but not as {"type": "long", "logicalType": "timestamp-micros"}
  • {"created_at": "1634982000"} (note the quotes around 1634982000) can be parsed as {"type": "long", "logicalType": "timestamp-micros"}, but not as "long".

Options:

  • Decide that the union-typing is too niche of a usecase (maybe true) and stop supporting it
  • Otherwise, we need to:
    • Decide what the correct output type is (either long or timestamp)
    • Patch the converter library to parse values correctly, given that decision
@edgao edgao added type/bug Something isn't working needs-triage team/destinations Destinations team's backlog labels Oct 4, 2022
@shrodingers
Copy link
Contributor

Hello ! wanted to create an issue towards this but found this ticket. This causes issues on destinations like databricks as well (because based on S3 parquet), and it can make the data less safe in cases of int64 ids, since there can be a clash between 2 ids then (zendesk support ids don't fit into 32 bits).

Nice to know that the issue is known and taken into account anyways, GL with that

@grishick
Copy link
Contributor

grishick commented Feb 9, 2023

Notes from grooming:

  • if source declares a field as oneOf timestamp/integer then JSON to Avro converter should make it long
  • testing should include all staging destinations (Snowflake, BQ, BQ denormalized, Redshift, Databricks), S3 destination, and GCS destination
  • Update tests for V0 as well as V1 protocols

@grishick
Copy link
Contributor

There is a potential duplicate issue: #10240.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage team/destinations Destinations team's backlog type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants