Improve ldiff binary files detection #104

Annih · 2025-01-28T23:04:38Z

The former code was performing scan on the first 4K of each file to see if one of them has a '\0' char in it and consider it as a binary file.

This commit does not change this heuristic just the implementation. Instead of using the scan method with a regexp, use a simple include?.

This not only fix compatibility issues with UTF8 escape sequences (see #102), but also the performance:

it does not leverage a Regexp system.
it stops at first occurence worst case is O(n).
it does not store much.

Also instead of using empty? which would signal a non-binary file, the call to include? invert the boolean test.
IMHO it is clearer.
Note: this could have been achieved simply by replacing empty? by any? but the other improvements listed above motivated the change.

The former code was performing scan on the first 4K of each file to see if one of them has a '\0' char in it and consider it as a binary file. This commit does not change this heuristic just the implementation. Instead of using the scan method with a regexp, use a simple include?. This not only fix compatibility issues with UTF8 escape sequences, but also the performance: 1. it does not leverage a Regexp system. 2. it stops at first occurence worst case is O(n). 3. it does not store much. Also instead of using .empty? which would signal a non-binary file, the call to include? invert the boolean test. IMHO it is clearer. Note: this could have been achieved simply by replacing .empty by .any? but the other improvements listed above motivated the change.

halostatue

This is an excellent change, thank you and is much easier to read.

halostatue approved these changes Jan 29, 2025

View reviewed changes

halostatue merged commit bc14f1d into halostatue:main Jan 29, 2025
50 of 58 checks passed

halostatue mentioned this pull request Jan 29, 2025

Extract ldiff display logic to reuse it as a lib #103

Merged

Annih deleted the improve_ldiff_binary_files_detection branch January 29, 2025 09:05

This was linked to issues Feb 2, 2025

ldiff does not behave well with binary files #102

Closed

ldiff does not behave well with empty files #100

Open

Annih mentioned this pull request Feb 2, 2025

Improve ldiff binary support #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ldiff binary files detection #104

Improve ldiff binary files detection #104

Annih commented Jan 28, 2025

halostatue left a comment

Improve ldiff binary files detection #104

Improve ldiff binary files detection #104

Conversation

Annih commented Jan 28, 2025

halostatue left a comment

Choose a reason for hiding this comment