Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-100227: Move the Dict of Interned Strings to PyInterpreterState #102339

Merged

Conversation

ericsnowcurrently
Copy link
Member

@ericsnowcurrently ericsnowcurrently commented Feb 28, 2023

Copy link
Contributor

@kumaraditya303 kumaraditya303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is wrong, it breaks the invariant that interned strings are global, with this they are per interpreter. When a string is interned, its interned bit is set to 1 and if it is shared across interpreters like if it's immortal then it breaks the invariant that interned bit set means that the string is really interned.

The correct solution for this is to let the interned dict be global and add a fine grained lock around mutating it which IIRC only happens in two places.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@ericsnowcurrently
Copy link
Member Author

Thanks for the feedback. I want to be sure we're in agreement before I proceed on this.

IMO this is wrong, it breaks the invariant that interned strings are global, with this they are per interpreter.

Are you talking about process-global or something else? Process-global is problematic for interpreter isolation, especially under a per-interpreter GIL. Non-immortal str objects must not be shared between interpreters, which means many strings that are currently interned must either be made immortal or not be interned. Making the interned dict per-interpreter solves this.

When a string is interned, its interned bit is set to 1 and if it is shared across interpreters like if it's immortal then it breaks the invariant that interned bit set means that the string is really interned.

All immortal strings are still interned in each interpreter. However, non-immortal string instances are necessarily unique to a single interpreter and consequently interning them to just that interpreter is still correct.

The trade-off here is that we incur an extra CPU cost to startup for subinterpreters (due to the static strings) and an extra memory/CPU cost when the same (non-static) string is interned in additional interpreters. I expect there is a more efficient way to solve this, but that doesn't mean we can't go with a less efficient approach (as I've proposed) in the meantime.

Note that, most importantly, there is no impact to the status quo for the main interpreter in either case. That matters since the vast majority of Python usage does not involve subinterpreters currently.

The correct solution for this is to let the interned dict be global and add a fine grained lock around mutating it which IIRC only happens in two places.

Such a lock must also guard reads since they might race with mutation. That leads to a larger performance impact.

@kumaraditya303
Copy link
Contributor

I'll see it by next weekend.

@ericsnowcurrently
Copy link
Member Author

BTW, benchmarks show no significant change in performance.

@kumaraditya303
Copy link
Contributor

Are you talking about process-global or something else? Process-global is problematic for interpreter isolation, especially under a per-interpreter GIL. Non-immortal str objects must not be shared between interpreters, which means many strings that are currently interned must either be made immortal or not be interned. Making the interned dict per-interpreter solves this.

Hmm, I am thinking of the opposite, make it mandatory that all interned strings are immortal. If we guarantee this then we don't need to make it per interpreter, the main interpreter will clear the interned table at runtime finalization. This has other benefits that it fits out current mental model that an interned string is a process wide global unique string and shared across interpreter as it is currently done.

Note that, most importantly, there is no impact to the status quo for the main interpreter in either case. That matters since the vast majority of Python usage does not involve subinterpreters currently.

Yeah, all of this is for subinterpreters.

This is how with immortal objects, I think this should happen:

  • At runtime startup all static immortal strings are added to the global interned dict. This can be done without holding lock as only one interp exists so no performance loss.
  • PyUnicode_InternInPlace is called, it immortalizes the string and adds it to the interned dict while holding a fine grained lock. Since this string is immortalized, it can be shared across interpreters without any book-keeping. This can be done from any interpreter as the dict mutation is protected via a critical section aka lock.
  • At runtime finalization, after all interpreters are destroyed, the main interpreter clears the interned dict and frees all the interned strings.

To me this seems much simpler to understand and implement. I feel we should table this for now until immortal objects are implemented in main branch.

@ericsnowcurrently
Copy link
Member Author

make it mandatory that all interned strings are immortal.

I see what you're saying. Furthermore, this wouldn't be so big a deal since you can't [normally] intern a string directly from Python code, so it isn't a huge number of strings that get interned. I'll let this stew in my brain for a day or two and get back to this. Thanks for explaining!

@ericsnowcurrently
Copy link
Member Author

make it mandatory that all interned strings are immortal.

This does mean that, under a per-interpreter GIL, we'd need a granular global for any use of the interned dict, but that's okay. It also means that we'd have to ensure resizing is done with the main interpreter's obmalloc state, but that is manageable.

@ericsnowcurrently
Copy link
Member Author

@kumaraditya303, does gh-102858 seem like what you had in mind?

@ericsnowcurrently
Copy link
Member Author

I'm closing this in favor of gh-102925, which keeps the interned strings dict global.

@ericsnowcurrently
Copy link
Member Author

The keep-it-global approach has turned out to be quite complex. We can look into that further, but in the meantime I'm planning on going with the simpler approach here.

@ericsnowcurrently ericsnowcurrently marked this pull request as ready for review March 28, 2023 00:31
@ericsnowcurrently ericsnowcurrently merged commit ba65a06 into python:main Mar 28, 2023
@ericsnowcurrently ericsnowcurrently deleted the isolate-interned-dict branch March 28, 2023 18:52
warsaw pushed a commit to warsaw/cpython that referenced this pull request Apr 11, 2023
…ate (pythongh-102339)

We can revisit the options for keeping it global later, if desired.  For now the approach seems quite complex, so we've gone with the simpler isolation solution in the meantime.

python#100227
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants