Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Mind the potential negative consequences of String.slice() #2598

Closed
gorhill opened this issue May 9, 2017 · 4 comments
Closed

Comments

@gorhill
Copy link
Owner

gorhill commented May 9, 2017

Consider those two strings:

  1. quebec
  2. québec

Those two strings have the same number of characters, however the first form will occupy half the memory of the second form, because quebec is only made of ASCII characters, while québec has at least one Unicode character.

This is currently how javascript in modern browsers internalize strings: if a string has at least one character outside the ASCII realm, the javascript engine will use a 2-byte-per-character string, otherwise it will use a 1-byte-per-character string. (See "Slimmer and faster JavaScript strings in Firefox").

When uBO compiles a filter list, all the hostnames are converted to punycode, so this eliminates a lot of Unicode characters from the resulting compiled list.

However, there may still be Unicode characters in other parts of some filters which can't be easily normalized to ASCII. For example, found in EasyList at time of writing:

  • flashgot.net###head a[target="_blаnk"]
  • flashgot.net##.content a[rel="nofollow"][target="_blаnk"]
  • noscript.net##a[target="_blаnk"][href$="?MT"]

These are not normalized to ASCII by uBO because the Unicode characters are found in the CSS selector part of the filters (the а in _blank is actually the Cyrillic A character). CSS selectors could be normalized to ASCII by uBO, however this would not work for procedural cosmetic filters.

Now because of these mere three instances of Unicode characters in the whole resulting compiled EasyList file, the memory footprint required to hold the string instance[1] in memory is 6,109,248 bytes (as reported by Chromium):

a

With a quick (short-term) code change to ensure no Unicode character in the output compiled list, the memory required to hold the string instance is now halved at 3,017,232 bytes:

a

The gain is of course better for larger compiled filter list, such as Fanboy Ulitmate.

Since EasyList is selected by default, ensuring that no Unicode character end up in the compiled form of EasyList would allow to easily lower further uBO's memory footprint.

[1] Javascript engine will hold that one single string instance in memory even when it's not used directly, because all the filters will hold references to substrings in that one string instance.

@lewisje
Copy link

lewisje commented May 11, 2017

Those cases aren't even proper uses of the _blank keyword in the HTML target attribute (the keyword is all-ASCII), so I have reported it: https://forums.lanik.us/viewtopic.php?f=62&t=36726

@gorhill
Copy link
Owner Author

gorhill commented May 11, 2017

I have reported it: https://forums.lanik.us/viewtopic.php?f=62&t=36726

The problem is not EasyList, that's how the elements to hide are crafted on flashgot.net and noscript.net sites.

@gorhill
Copy link
Owner Author

gorhill commented May 19, 2017

Fixed with 0232382.

@gorhill gorhill closed this as completed May 19, 2017
@gorhill gorhill reopened this May 25, 2017
@gorhill
Copy link
Owner Author

gorhill commented May 25, 2017

I am reopening this to investigate a possibly higher-level solution.

Taking a larger view of the issue, the problem is that the large parent string (loaded in memory as a result of loading the compiled filter lists) stays in memory even when no longer explicitly referenced anywhere, but as a result of all the child substrings internally referencing it.

Surely ensuring that this huge parent string is made only of ASCII characters helps to halve the memory needed to hold that large string (assuming it had Unicode characters), but what if that large parent string could be completely flushed out of memory by forcing all child strings to no longer internally reference the parent string?

Benefits:

  • No longer need to worry about whether a compiled filter list holds only ASCII characters.
  • Potentially better memory saving than the previous solution -- especially in the case where a lot of duplicates are detected by uBO.

Challenge: how can uBO prevent the javascript engine from creating substrings with internal reference to the large parent compiled filter list string? (rhetorical, I experimented and found a way).

Background information which has been enlightening:

@gorhill gorhill changed the title [Performance] Ensure a compiled filter list contains only ASCII characters [Performance] Mind the potential negative consequences of String.slice() May 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants