-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterating over a string builder by index becomes ~exponentially slow for large builders #24395
Comments
You might see different behavior on Core than on Desktop here. |
ToString (or string comparison) is measurably faster on Core, but indexing seems just as slow.
|
Oops, desktop was on 32 bit. Let me try 64 bit. |
Edited to add 64 bit for desktop above. Now Core and Desktop are basically identical. |
The discrepancy is much smaller when the StringBuilder is presized (gist):
Both 1000 and 5000 are approximately 4x, suggesting a more natural linear effect consistent with eg., bounds checks. |
This is presumably because every index has to walk the chain of chunks, checking the offset for each until it finds the one with the indexed location. In the 5000 case above, there are 250K characters so 32 chunks of 8000 characters. On average it has to loop 16 times to find the right chunk - for every index. No doubt this was a conscious compromise when making stringbuilder chunky. @vancem ? |
When we did stuff like this in the past (for chunky sparse bitvectors in a now-defunct compiler) we would cache the result of the last index lookup; subsequent index values tended to be "nearby". Random access would still be slow but this happens less often, but sequential accesses were now good. This might require some invalidation logic elsewhere if the chunk structure can be messed with. |
Another option is to store the chunk offsets in a shared array that could be binary searched. But I like your idea, it is simple and unlikely to hurt other cases. |
The user always works against the StringBuilder that owns the end chunk and that has is a singly linked list of chunks back to the start (so incidentally the smaller the index the more blocks to traverse to find it). In the first idea, there would be two fields in the end StringBuilder: the last accessed index and a pointer to the chunk for it. Whenever the chunk chain is changed (by setting This scheme could also speed up @vancem @jkotas does this seem a reasonable approach and unlikely to measurably affect other perf? Ideally these 2 fields could somehow be put to use in the StringBuilders representing the chunks other than the final one. |
Would it be better to just document that StringBuilder indexer is not fast? StringBuilder is designed for building strings. It is not designed to inspect them - it will never work great for that. And trying to make it work ok for that just hurts the primary purpose what it is meant to be used for. |
Indexing as fast as a string or array would compromise its goals for sure. I just hate to think of people getting O(n^2) indexing when a few field gets and sets could get it to O(n) and potentially help other operations like StartsWith. If not the pit of success then at least the pit of mediocrity. I looked at Azure profiler data and fortunately the usage of get_Chars, set_Chars, StartsWith, FindChunkForIndex are vastly lower than Append, etc. So I'll close this for now... |
This should definitely be fixed. I have some build targets which have 8 minutes of "compile" time due to a large string with 900KB length which is compared in MSBuild properties. The resulting string is always the same but trying to intern it takes much longer than to create it and be done with with it. The actual work is < 40s which has definitely some room for improvement. The MSBuild syntax to batch targets could definitely help to prevent stuffing a large list into a single string but the syntax is rather awkward and only understood by a few. This did lead to less than optimal msbuild targets which were wasting a significant amount of time and computing power at our dev shop. |
@Alois-xx It would be better to file that issue over on MSBuild so that we can fix its interning behavior. Fixing that issue at the StringBuilder level isn't the right change. |
There is already dotnet/msbuild#1593 to add a cut over to To String at say 50K chars. For the long term I have proposed a method on String builder to equate efficiently with String. |
Thanks! You are faster solving issues than I can find them ;-). |
@Alois-xx any interest in offering the PR for that MSBuild issue? Cutover to string with a simple test. Not sure if they want a perf measurement. |
@danmosemsft: Sure. I need something to work with over christmas. |
@danmosemsft |
@danmosemsft
So, I can iterat the list like this:
This is way better than using the Indexer which starts from the first element each time (the last chunk in the case of SB)!.. The MoveTo method allow the user to jumb to the start he wants and continue iteration. |
To optimize the Indexer also, you should do this: Again: this indexer is not suitable for iteration, and foreach must be used instead. |
This is the full LinkedList Class:
|
@danmosemsft @AndyAyersMS @jkotas @Alois-xx @davkean |
@MohammadHamdyGhanem The efficient string builder iterator was proposed and being worked on by @vancem in https://github.com/dotnet/corefx/issues/21248 / dotnet/coreclr#17530. |
@jkotas
This will corrupt the chunk list, and if you try to sync it, it may affect the performance badly. So, as I proposed before, a double linked list maybe the solution, in addition to using MoveToXXCunk methode to traverse the chunks, and MoveToXXChar to traverse the chars. The stringBuilder should keep the value for currentChunkIndex, and currentCharIndex so these two sets of methods can use to work. |
@jkotas |
None of the proposals considered had the characteristics you are describing. The latest version of the proposal is in fact enumerator. Check the proposed implementation in dotnet/coreclr#17530. |
@jkotas |
The first proposal didn't have those characteristics, either. |
This is motivated by this issue originally opened against MSBuild, which internally in its reuseable string builder factory compares string builders by index. It happens to start from the end, but that doesn't seem to matter. The opener observed this can be enormously slower that creating a string first.
Using this benchmark program I get the following results:
64 bit
Clearly indexing is going to have a lot of overhead such as bounds checking. But I would expect that to be proportional to N, and we see a big jump when N gets large. Related to fragmentation of the SB? Can we do better?
The text was updated successfully, but these errors were encountered: