Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid object instantiation in lexers #1147

Closed
pyrmont opened this issue Jun 2, 2019 · 23 comments
Closed

Avoid object instantiation in lexers #1147

pyrmont opened this issue Jun 2, 2019 · 23 comments
Labels
enhancement-request A request for an enhancement to be developed. stale-issue There has been no activity for a year.

Comments

@pyrmont
Copy link
Contributor

pyrmont commented Jun 2, 2019

One way to reduce the memory usage of Rouge is to avoid instantiation of objects in individual lexers. Based on the report generated by rake profile_memory, it would seem the biggest causes of unnecessary object instantiation in a lexer are:

  1. a failure to wrap arrays of keywords in class methods so that you do this:

    keywords = ['if', 'elsif', 'else']

    rather than this:

    def self.keywords
      @keywords ||= ['if', 'elsif', 'else']
    end
  2. a failure to wrap regexs (especially involving interpolation) in class methods so that you do this:

    reserved = /(\b#{keywords.join('|')}\b)/

    rather than this:

    def self.reserved
      @reserved ||= /(\b#{keywords.join('|')}\b)/
    end

Initial testing suggested this can lead to a substantial reduction in memory used by the lexers. The AppleScript lexer (the largest lexer by memory usage) can be reduced from 65KB to 2.22KB by simply putting the regular expressions defined as local variables behind class methods.

Do you have any interest in doing this thankless task, @ashmaroli? (I promise I will thank you!)

@pyrmont pyrmont added the enhancement-request A request for an enhancement to be developed. label Jun 2, 2019
@jneen
Copy link
Member

jneen commented Jun 2, 2019

Yes! And also, as far as possible, reducing the amount of giant regexes we generate by interpolating large arrays would go a long way.

@ashmaroli
Copy link
Contributor

@pyrmont I have no issues undertaking this task. Optimizing memory will help everyone.
I'll submit PR(s) towards resolving this.

@pyrmont
Copy link
Contributor Author

pyrmont commented Jun 2, 2019

Thanks @ashmaroli!

I've been playing around with seeing the impact of deleting certain lexers. I reduced memory allocation almost 1.2MB (!!!) just by deleting all the lexers staring with L.

I haven't looked at which lexer in particular was causing problems but if it was because of unnecessary object instantiation, this could be a huge help :)

@ashmaroli

This comment has been minimized.

@pyrmont
Copy link
Contributor Author

pyrmont commented Jun 2, 2019

Deleting P and S also reduces memory allocation by around 1MB per letter as well. In general, I'd estimate that it was about 300KB or so per letter.

@ashmaroli
Copy link
Contributor

I see a lot of inconsistencies in the lexers. Some have keywords =, while some have KEYWORDS =
while some have def self.keywords;
I think the first step here would be to decide on a consistent design..

@pyrmont
Copy link
Contributor Author

pyrmont commented Jun 2, 2019

Using a class method was @jneen's comment in reviewing the lexer development guide and was the basis for the suggestion above. That feels idiomatic to me.

@ashmaroli
Copy link
Contributor

ashmaroli commented Jun 2, 2019

Great. Here's my plan: Convert all keywords = and similar that stores arrays to be class methods and all
id = which is storing simple regexps into CONSTANTS.. whatsay?

@pyrmont
Copy link
Contributor Author

pyrmont commented Jun 2, 2019

As a first pass, my preferred option would be to put all local variables (be they arrays of special words or single regular expressions) behind class methods. This has the benefit of not requiring any further editing of lexers that were using local variables (although, as you noted, they weren't all doing that so it's not going to be true universally).

Once that's done, further optimisation could involve replacing some of the individual regular expressions with references to commonly used ones that are centrally defined (as you proposed in #1139). I know @jneen expressed concern about unnecessary obfuscation with that approach, though, and I think it would be worth testing how much it improves things.

If the early indications are correct, 'wrapping' variables should drastically reduce memory allocation for most uses cases. Users are rarely going to be invoking Rouge to syntax highlight more than a handful of languages and so the actual number of regular expressions instantiated in practice even without centrally defined expressions should be low (I think).

@ashmaroli
Copy link
Contributor

@pyrmont I won't be able to finish this task. The various lexers are structured differently. Some specs fail on changing keywords = to def self.keywords or def keywords. It is time-consuming to sync spec files and changes in the lexer source..

@pyrmont
Copy link
Contributor Author

pyrmont commented Jun 2, 2019

No problem at all! Thanks for taking a look at it!

@pyrmont
Copy link
Contributor Author

pyrmont commented Jun 7, 2019

So I kept playing around and now have this code as a kind of hacky proof of concept.

Approach

The important stuff is all in lib/rouge/lexer.rb. In short, rather than load in the individual lexer files, it:

  1. extracts the class names, tags and aliases from these files;
  2. saves these values as keys in a hash with the associated value being the relative path of the file;
  3. loads the lexer file for <Lexer Class> when the consumer either (a) attempts to access Lexers::<Lexer Class> or (b) attempts to retrieve an instance of <Lexer Class> from Lexer.registry.

All tests pass.

Results

Here's the stats for memory on master:

Total allocated: 11.5 MB (109823 objects)
Total retained:  4.37 MB (30845 objects)

and on this branch:

Total allocated: 2.34 MB (14721 objects)
Total retained:  480.52 kB (3446 objects)

The memory for just loading the library has been reduced to:

Total allocated: 2.14 MB (12720 objects)
Total retained:  380.64 kB (2804 objects)

Limitations

The information from the lexer files is extracted using a simple regular expression. If a lexer file does not express the information required (class name, tag, aliases) in the format expected, this won't work.

Future Work

A better approach may be to formally split lexers into two files: one that contains the details necessary for selection of the appropriate lexer and the other that contains the lexing logic. This might be something appropriate for version 4.0.

@ashmaroli

This comment has been minimized.

@pyrmont
Copy link
Contributor Author

pyrmont commented Jun 7, 2019

Rebased for your ease of comparison pleasure.

Memory usage is now:

Total allocated: 2.36 MB (14870 objects)
Total retained:  482.79 kB (3477 objects)

and for vanilla loading:

Total allocated: 2.15 MB (12867 objects)
Total retained:  382.91 kB (2835 objects)

@ashmaroli

This comment has been minimized.

@ashmaroli

This comment has been minimized.

@pyrmont
Copy link
Contributor Author

pyrmont commented Jun 7, 2019

@ashmaroli Vanilla loading means without actually running any highlighting. I introduced a rake task that just loads the library and that's what I mean by vanilla loading.

@jneen
Copy link
Member

jneen commented Jun 7, 2019

Another option that does away with that scary regex: keep the lexer keyword, but do not save the block. Instead, re-load the file later with a special thread-variable flag set, such that the lexer actually gets populated.

@pyrmont
Copy link
Contributor Author

pyrmont commented Jun 7, 2019

@jneen I feel like I've just been galaxy brained.

@jneen
Copy link
Member

jneen commented Jun 7, 2019

lol maybe. it might be a bit much.

also make sure to update RougeVersion in rouge.jneen.net so it can preload everything before removing the global constant (and the files)

@pyrmont
Copy link
Contributor Author

pyrmont commented Jul 23, 2019

I spent some time doing more experimentation today and I'm sorry to say that the 'don't save the block' approach didn't greatly reduce memory usage. It does reduce the total number of objects retained but only by a little and does not really reduce the number of objects allocated.

Here are the results from wrapping the state methods in the ABAP lexer.

Current Approach

Total allocated: 13.47 MB (131876 objects)
Total retained:  4.87 MB (34521 objects)

Unsaved Block Approach

Total allocated: 13.47 MB (131870 objects)
Total retained:  4.86 MB (34483 objects)

Other things I tried:

  1. putting lexing logic inside an always false if-statement (no real difference);
  2. putting lexing logic inside a method (no real difference);
  3. putting lexing logic inside a method behind an explicit return statement (no real difference).

Short of deferring parsing of the code, I can't see a way to substantially reduce memory usage in respect of the lexers.

@jneen
Copy link
Member

jneen commented Jul 23, 2019

The last resort is to use a Rake task to generate an index, then use const_missing to dynamically load classes. In this case I would like to have a require 'rouge/all' option to load them all in, for applications that expect to use them all.

@stale
Copy link

stale bot commented Jul 22, 2020

This issue has been automatically marked as stale because it has not had any activity for more than a year. It will be closed if no additional activity occurs within the next 14 days.
If you would like this issue to remain open, please reply and let us know if the issue is still reproducible.

@stale stale bot added the stale-issue There has been no activity for a year. label Jul 22, 2020
@stale stale bot closed this as completed Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement-request A request for an enhancement to be developed. stale-issue There has been no activity for a year.
Projects
None yet
Development

No branches or pull requests

3 participants