Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: Skip string normalization when possible #10116

Merged
merged 1 commit into from
Feb 26, 2024

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Feb 25, 2024

Summary

Most strings don't contain any quotes or other characters that need to be normalized. This PR makes use of this fact and short circuits choose_quotes and normalize for strings that contain no such characters. This is done by having one simple loop that searches for the occurrences of any character that needs normalization. This is much faster than the complicated normalization loop.

This removes the calls to StringNormalizer almost entirely from the flamegraphs (except for users that use \r or \r\n newlines that always need normalisation ;( )

Test Plan

cargo test

@MichaReiser MichaReiser added performance Potential performance improvement formatter Related to the formatter labels Feb 25, 2024
Copy link

codspeed-hq bot commented Feb 25, 2024

CodSpeed Performance Report

Merging #10116 will improve performances by 18.79%

Comparing perf-string-normalization (f7a3f92) with main (15b87ea)

Summary

⚡ 2 improvements
✅ 28 untouched benchmarks

Benchmarks breakdown

Benchmark main perf-string-normalization Change
formatter[numpy/ctypeslib.py] 10.2 ms 9.7 ms +4.62%
formatter[numpy/globals.py] 1,176.1 µs 990.1 µs +18.79%

@MichaReiser MichaReiser force-pushed the perf-string-normalization branch from 1c7e244 to b87fbfb Compare February 25, 2024 10:26
Copy link
Contributor

github-actions bot commented Feb 25, 2024

ruff-ecosystem results

Formatter (stable)

ℹ️ ecosystem check encountered format errors. (no format changes; 1 project error)

openai/openai-cookbook (error)

warning: Detected debug build without --no-cache.
error: Failed to read examples/How_to_handle_rate_limits.ipynb: Expected a Jupyter Notebook, which must be internally stored as JSON, but this file isn't valid JSON: trailing comma at line 47 column 4

Formatter (preview)

ℹ️ ecosystem check encountered format errors. (no format changes; 1 project error)

openai/openai-cookbook (error)

ruff format --preview

warning: Detected debug build without --no-cache.
error: Failed to read examples/How_to_handle_rate_limits.ipynb: Expected a Jupyter Notebook, which must be internally stored as JSON, but this file isn't valid JSON: trailing comma at line 47 column 4

@MichaReiser MichaReiser marked this pull request as ready for review February 25, 2024 10:42
let raw_content = locator.slice(string.content_range());
let first_quote_or_normalized_char_offset = raw_content
.bytes()
.position(|b| matches!(b, b'\\' | b'"' | b'\'' | b'\r' | b'{'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not an f-string, you can omit {, I think? Similarly, if it's a raw string, you can omit \\... (In that case, you could actually use memchr3, but perhaps not worth it.)

Copy link
Member Author

@MichaReiser MichaReiser Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered special casing but decided against it because I want to avoid the extra complexity and it doesn't have to be perfect, for as long as it avoids the "expensive" normalization for most strings.

Using memchr would be nice... but normalize is already now almost gone from the benchmark profiles. That's why I consider this as "good enough".

@MichaReiser MichaReiser force-pushed the perf-string-normalization branch from b87fbfb to f7a3f92 Compare February 26, 2024 17:28
@MichaReiser MichaReiser enabled auto-merge (squash) February 26, 2024 17:28
@MichaReiser MichaReiser merged commit 8dc22d5 into main Feb 26, 2024
17 checks passed
@MichaReiser MichaReiser deleted the perf-string-normalization branch February 26, 2024 17:35
nkxxll pushed a commit to nkxxll/ruff that referenced this pull request Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
formatter Related to the formatter performance Potential performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants