[Performance] Mind the potential negative consequences of String.slice() #2598

gorhill · 2017-05-09T12:23:53Z

Consider those two strings:

quebec
québec

Those two strings have the same number of characters, however the first form will occupy half the memory of the second form, because quebec is only made of ASCII characters, while québec has at least one Unicode character.

This is currently how javascript in modern browsers internalize strings: if a string has at least one character outside the ASCII realm, the javascript engine will use a 2-byte-per-character string, otherwise it will use a 1-byte-per-character string. (See "Slimmer and faster JavaScript strings in Firefox").

When uBO compiles a filter list, all the hostnames are converted to punycode, so this eliminates a lot of Unicode characters from the resulting compiled list.

However, there may still be Unicode characters in other parts of some filters which can't be easily normalized to ASCII. For example, found in EasyList at time of writing:

flashgot.net###head a[target="_blаnk"]
flashgot.net##.content a[rel="nofollow"][target="_blаnk"]
noscript.net##a[target="_blаnk"][href$="?MT"]

These are not normalized to ASCII by uBO because the Unicode characters are found in the CSS selector part of the filters (the а in _blank is actually the Cyrillic A character). CSS selectors could be normalized to ASCII by uBO, however this would not work for procedural cosmetic filters.

Now because of these mere three instances of Unicode characters in the whole resulting compiled EasyList file, the memory footprint required to hold the string instance[1] in memory is 6,109,248 bytes (as reported by Chromium):

With a quick (short-term) code change to ensure no Unicode character in the output compiled list, the memory required to hold the string instance is now halved at 3,017,232 bytes:

The gain is of course better for larger compiled filter list, such as Fanboy Ulitmate.

Since EasyList is selected by default, ensuring that no Unicode character end up in the compiled form of EasyList would allow to easily lower further uBO's memory footprint.

[1] Javascript engine will hold that one single string instance in memory even when it's not used directly, because all the filters will hold references to substrings in that one string instance.

The text was updated successfully, but these errors were encountered:

lewisje · 2017-05-11T08:55:16Z

Those cases aren't even proper uses of the _blank keyword in the HTML target attribute (the keyword is all-ASCII), so I have reported it: https://forums.lanik.us/viewtopic.php?f=62&t=36726

gorhill · 2017-05-11T12:38:26Z

I have reported it: https://forums.lanik.us/viewtopic.php?f=62&t=36726

The problem is not EasyList, that's how the elements to hide are crafted on flashgot.net and noscript.net sites.

gorhill · 2017-05-19T14:19:33Z

Fixed with 0232382.

gorhill · 2017-05-25T12:32:27Z

I am reopening this to investigate a possibly higher-level solution.

Taking a larger view of the issue, the problem is that the large parent string (loaded in memory as a result of loading the compiled filter lists) stays in memory even when no longer explicitly referenced anywhere, but as a result of all the child substrings internally referencing it.

Surely ensuring that this huge parent string is made only of ASCII characters helps to halve the memory needed to hold that large string (assuming it had Unicode characters), but what if that large parent string could be completely flushed out of memory by forcing all child strings to no longer internally reference the parent string?

Benefits:

No longer need to worry about whether a compiled filter list holds only ASCII characters.
Potentially better memory saving than the previous solution -- especially in the case where a lot of duplicates are detected by uBO.

Challenge: how can uBO prevent the javascript engine from creating substrings with internal reference to the large parent compiled filter list string? (rhetorical, I experimented and found a way).

Background information which has been enlightening:

"Adventures in the land of substrings and RegExps"
- Especially, go to the section starting with Given such a drastic performance improvement.
"Substring of huge string retains huge string in memory" (referenced in above article)
An extreme example of the issue at hand: https://github.com/mrdoob/three.js/issues/9679 (not linked to, to prevent noise on their issue tracker).

gorhill closed this as completed May 19, 2017

gorhill reopened this May 25, 2017

gorhill added a commit that referenced this issue May 25, 2017

fix #2598: refactor to address the cause rather than the symptom

faf4b74

gorhill closed this as completed in f3e6057 May 25, 2017

gorhill changed the title ~~[Performance] Ensure a compiled filter list contains only ASCII characters~~ [Performance] Mind the potential negative consequences of String.slice() May 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Mind the potential negative consequences of String.slice() #2598

[Performance] Mind the potential negative consequences of String.slice() #2598

gorhill commented May 9, 2017 •

edited

Loading

lewisje commented May 11, 2017

gorhill commented May 11, 2017

gorhill commented May 19, 2017 •

edited

Loading

gorhill commented May 25, 2017 •

edited

Loading

[Performance] Mind the potential negative consequences of String.slice() #2598

[Performance] Mind the potential negative consequences of String.slice() #2598

Comments

gorhill commented May 9, 2017 • edited Loading

lewisje commented May 11, 2017

gorhill commented May 11, 2017

gorhill commented May 19, 2017 • edited Loading

gorhill commented May 25, 2017 • edited Loading

gorhill commented May 9, 2017 •

edited

Loading

gorhill commented May 19, 2017 •

edited

Loading

gorhill commented May 25, 2017 •

edited

Loading