Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review all Encoding usage for BOM compatibility #1027

Closed
1 task done
paulirwin opened this issue Nov 17, 2024 · 2 comments · Fixed by #1075
Closed
1 task done

Review all Encoding usage for BOM compatibility #1027

paulirwin opened this issue Nov 17, 2024 · 2 comments · Fixed by #1075
Assignees
Labels
is:task A chore to be done pri:normal

Comments

@paulirwin
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Task description

Java's StandardCharsets.UTF_8 does not write a Byte-Order Mark (BOM), while .NET's System.Text.Encoding.UTF8 does include a BOM by default. We have ensured that the IOUtils.CHARSET_UTF_8 does not include a BOM to match Java, and as part of #1018 we've added an internal Support class to allow for using StandardCharsets.UTF_8, but we need to review all usage of System.Text.Encoding.UTF8 to determine if it should be replaced with StandardCharsets.UTF_8 or IOUtils.CHARSET_UTF_8 (whatever best matches the corresponding Java Lucene code) to avoid BOM issues.

@paulirwin paulirwin added the is:task A chore to be done label Nov 17, 2024
@paulirwin paulirwin added this to the 4.8.0-beta00018 milestone Nov 17, 2024
@paulirwin paulirwin mentioned this issue Nov 17, 2024
4 tasks
@paulirwin paulirwin self-assigned this Dec 24, 2024
paulirwin added a commit to paulirwin/lucene.net that referenced this issue Dec 24, 2024
@paulirwin
Copy link
Contributor Author

I have reviewed usages of Encoding (most commonly Encoding.UTF8), and determined that most usages do not need to be changed. The following cases do not result in a BOM being generated:

The following cases ignore a BOM if present, and do not fail if there is not a BOM, and thus do not need to be changed to a BOM-less Encoding:

  • Any TextReader use (such as StreamReader)
  • IOUtils.GetDecodingReader(...)
  • Encoding.UTF8.GetString(byte[])
  • FileStream with FileAccess.Read

So you'll see in the PR that the amount of changes to address BOM issues are not very many; that's because most fall into those buckets above.

@NightOwl888
Copy link
Contributor

Looks like you missed OfflineSorter. The tests specifically failed when it was configured to use a BOM, although I didn't analyze it at a high level to find out why that was the case. No objections if you wish to investigate this, but it definitely makes a difference as far as the tests are concerned.

It has gone through several rounds of refactoring since then, but currently it has a DEFAULT_ENCODING field that we added to ensure the tests pass. So, we have a couple of options:

  1. Remove the DEFAULT_ENCODING field and replace it with IOUtils.CHARSET_UTF_8. Update the OfflineSorter documentation for ByteSequencesReader and ByteSequencesWriter to indicate that constructor overloads that accept BinaryReader and BinaryWriter should use IOUtils.CHARSET_UTF_8.
  2. Initialize the DEFAULT_ENCODING field with the same instance as IOUtils.CHARSET_UTF_8.

Given the fact that we added this field specifically because OfflineSorter requires there to be no BOM (which difers from the .NET default), this could go either way. Given that we recently changed IOUtils.CHARSET_UTF_8 to remove the BOM, using it wasn't an option when the DEFAULT_ENCODING field was added. If it were, it would have been reused in this case and the field wouldn't have been added.

Side note: perhaps we should also rename IOUtils.CHARSET_UTF_8 because it is public and "CharSet" is Java nomenclature. ENCODING_UTF8_NO_BOM would be a better name.

paulirwin added a commit to paulirwin/lucene.net that referenced this issue Dec 31, 2024
paulirwin added a commit that referenced this issue Jan 6, 2025
* SWEEP: Use BOM-less UTF-8 encoding for writes, #1027

* Remove OfflineSorter.DEFAULT_ENCODING field and replace with IOUtils.CHARSET_UTF_8

* Rename IOUtils.CHARSET_UTF_8 to ENCODING_UTF_8_NO_BOM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is:task A chore to be done pri:normal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants