Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ldiff binary files detection #104

Merged

Conversation

Annih
Copy link
Contributor

@Annih Annih commented Jan 28, 2025

The former code was performing scan on the first 4K of each file to see if one of them has a '\0' char in it and consider it as a binary file.

This commit does not change this heuristic just the implementation. Instead of using the scan method with a regexp, use a simple include?.

This not only fix compatibility issues with UTF8 escape sequences (see #102), but also the performance:

  1. it does not leverage a Regexp system.
  2. it stops at first occurence worst case is O(n).
  3. it does not store much.

Also instead of using empty? which would signal a non-binary file, the call to include? invert the boolean test.
IMHO it is clearer.
Note: this could have been achieved simply by replacing empty? by any? but the other improvements listed above motivated the change.

The former code was performing scan on the first 4K of each file to see
if one of them has a '\0' char in it and consider it as a binary file.

This commit does not change this heuristic just the implementation.
Instead of using the scan method with a regexp, use a simple include?.

This not only fix compatibility issues with UTF8 escape sequences, but
also the performance:
  1. it does not leverage a Regexp system.
  2. it stops at first occurence worst case is O(n).
  3. it does not store much.

Also instead of using .empty? which would signal a non-binary file, the
call to include? invert the boolean test.
IMHO it is clearer.
Note: this could have been achieved simply by replacing .empty by .any?
but the other improvements listed above motivated the change.
Copy link
Owner

@halostatue halostatue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent change, thank you and is much easier to read.

@halostatue halostatue merged commit bc14f1d into halostatue:main Jan 29, 2025
50 of 58 checks passed
@Annih Annih deleted the improve_ldiff_binary_files_detection branch January 29, 2025 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ldiff does not behave well with empty files ldiff does not behave well with binary files
2 participants