-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RecordBatchTransformer: Handle schema migration and column re-ordering in table scans #602
Conversation
I need to go back to the drawing board on this. The current implementation breaks when not all columns in the file are in the list of projected fields. |
OK, I've addressed the problem with projections other than the equiv of |
73c6724
to
bf6d2ef
Compare
@liurenjie1024 and @Xuanwo - ready for review when you get chance |
69e2f95
to
36a00fc
Compare
53a1267
to
ef3a9fc
Compare
@Xuanwo and @liurenjie1024: PTAL, I've rebased and refactored this to better handle the pass-through case. It would be great to get this merged - without it, I'm experiencing some annoying issues, as table scans against tables that have had new columns added where pre-existing data exists without the added column, and then had new data written in the new schema, results in record batches being streamed back that can have different schemas in the same stream, which is very difficult to deal with. |
ef3a9fc
to
e0a1ac8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this!
Waiting for @liurenjie1024 to take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdd for this high quality pr, looks great to me! I believe current approach works well without nested types, and we can add support for nested types later. Just one minor point about arrow
dependency, others look great!
… required but columns can remain unmodified
e0a1ac8
to
3c9bbd1
Compare
3c9bbd1
to
61d4bdc
Compare
Hi @liurenjie1024 - I addressed your comment around not importing the whole of Arrow. Additionally, I realised that there was an optimisation that could be made in the case that only the column names differed between the source and target schemas, so I've added code for that use case. Also, the schemas can contain metadata which, where present, would cause the equality comparison to fail, so I changed the equality check to only compare on the aspects of the schema that we're interested in (data type, nullability, and column name). This is now ready for re-review (FAO @Xuanwo for this also). Thanks! :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sdd for the great work! Also thanks @liurenjie1024 for the review. Let's move.
…g in table scans (apache#602) * feat: Add skeleton of RecordBatchEvolutionProcessor * feat: Add initial implementation of RecordBatchEvolutionProcessor * feat: support more column types. Improve error handling. Add more comments * feat(wip): adress issues with reordered / skipped fields * feat: RecordBatchEvolutionProcessor handles skipped fields in projection * chore: add missing license header * chore: remove unneeded comment * refactor: rename to RecordBatchTransformer. Improve passthrough handling * feat: more performant handling of case where only schema transform is required but columns can remain unmodified * refactor: import arrow_cast rather than arrow
Addresses parts 2 and 3 of #405.
Add support for type promotion and default values to the read pipeline.
Float
, and some rows were written. The table's schema was changed so that field "a" is now aDouble
. When a table scan is performed, record batches coming from files written prior to the type promotion will be dynamically converted so that field "a" is of typeDouble
, matching any row batches returned from files written after the schema change.initial-default-value
for the field, then all rows will have that value for the new column instead.projected_field_ids
is provided, the columns in the response will be re-ordered to match the order in the projection.