Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-6825] Use UTF_8 to encode String to byte array in all places #9634

Merged
merged 2 commits into from
Sep 12, 2023

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Sep 7, 2023

Change Logs

This PR unifies the encoding of Java String to byte array in Hudi, especially for writing bytes to the storage, by using UTF_8 encoding only. There are places calling String.getBytes() which are fixed by this PR. String.getBytes() uses the platform's default charset and encoding scheme. Note that the default character encoding scheme on Windows is ANSI, while the default character encoding scheme on Linux is UTF-8. The PR has no impact on Linux systems writing and reading Hudi tables.

These are the places that used String.getBytes() before, which are fixed to use UTF_8 encoding.

MercifulJsonConverter#generateBytesTypeHandler
HoodieJsonPayload#compressData
SimpleBloomFilter#write
HFileBootstrapIndexWriter#writeNextSourceFileMapping
HoodiePartitionMetadata#writeMetafile
HoodieHFileDataBlock#serializeRecords
HoodieLogBlock#getLogMetadataBytes
HoodieDefaultTimeline#setInstants
RocksDBDAO
HoodieAvroHFileReader
HoodieAvroHFileWriter
HoodieAvroOrcWriter
HoodieTableMetadataUtil#convertMetadataToBloomFilterRecords
HoodieTableMetadataUtil#readBloomFilter

We don't intend to make these backwards compatible for Windows which can be broken because of different default charset.

Impact

Make sure the encoding of Java String and storage bytes in a Hudi table does not depend on platforms.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua force-pushed the HUDI-6825-string-encoding branch 4 times, most recently from 0e1ad1e to ac411ef Compare September 7, 2023 01:47
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a breaking change for Windows? Any path for users to safely migrate?

@codope codope force-pushed the HUDI-6825-string-encoding branch from ac411ef to 75b9207 Compare September 12, 2023 04:59
@apache apache deleted a comment from hudi-bot Sep 12, 2023
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit fe71659 into apache:master Sep 12, 2023
@yihua
Copy link
Contributor Author

yihua commented Sep 12, 2023

So this is a breaking change for Windows? Any path for users to safely migrate?

I've updated the PR description. We don't intend to make these backwards compatible for Windows which can be broken because of different default charset.

leosanqing pushed a commit to leosanqing/hudi that referenced this pull request Sep 13, 2023
…pache#9634)

Unify the encoding of Java `String` to byte array in Hudi, 
especially for writing bytes to the storage, 
by using `UTF_8` encoding only.

---------

Co-authored-by: Sagar Sumit <[email protected]>
yihua added a commit that referenced this pull request Feb 27, 2024
…9634)

Unify the encoding of Java `String` to byte array in Hudi,
especially for writing bytes to the storage,
by using `UTF_8` encoding only.

---------

Co-authored-by: Sagar Sumit <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants