Avoid object instantiation in lexers #1147

pyrmont · 2019-06-02T01:00:22Z

One way to reduce the memory usage of Rouge is to avoid instantiation of objects in individual lexers. Based on the report generated by rake profile_memory, it would seem the biggest causes of unnecessary object instantiation in a lexer are:

a failure to wrap arrays of keywords in class methods so that you do this:

keywords = ['if', 'elsif', 'else']

rather than this:

def self.keywords
  @keywords ||= ['if', 'elsif', 'else']
end

a failure to wrap regexs (especially involving interpolation) in class methods so that you do this:

reserved = /(\b#{keywords.join('|')}\b)/

rather than this:

def self.reserved
  @reserved ||= /(\b#{keywords.join('|')}\b)/
end

Initial testing suggested this can lead to a substantial reduction in memory used by the lexers. The AppleScript lexer (the largest lexer by memory usage) can be reduced from 65KB to 2.22KB by simply putting the regular expressions defined as local variables behind class methods.

Do you have any interest in doing this thankless task, @ashmaroli? (I promise I will thank you!)

The text was updated successfully, but these errors were encountered:

jneen · 2019-06-02T05:34:45Z

Yes! And also, as far as possible, reducing the amount of giant regexes we generate by interpolating large arrays would go a long way.

ashmaroli · 2019-06-02T06:48:40Z

@pyrmont I have no issues undertaking this task. Optimizing memory will help everyone.
I'll submit PR(s) towards resolving this.

pyrmont · 2019-06-02T07:30:27Z

Thanks @ashmaroli!

I've been playing around with seeing the impact of deleting certain lexers. I reduced memory allocation almost 1.2MB (!!!) just by deleting all the lexers staring with L.

I haven't looked at which lexer in particular was causing problems but if it was because of unnecessary object instantiation, this could be a huge help :)

pyrmont · 2019-06-02T07:34:03Z

Deleting P and S also reduces memory allocation by around 1MB per letter as well. In general, I'd estimate that it was about 300KB or so per letter.

ashmaroli · 2019-06-02T07:35:10Z

I see a lot of inconsistencies in the lexers. Some have keywords =, while some have KEYWORDS =
while some have def self.keywords;
I think the first step here would be to decide on a consistent design..

pyrmont · 2019-06-02T07:38:18Z

Using a class method was @jneen's comment in reviewing the lexer development guide and was the basis for the suggestion above. That feels idiomatic to me.

ashmaroli · 2019-06-02T07:43:53Z

Great. Here's my plan: Convert all keywords = and similar that stores arrays to be class methods and all
id = which is storing simple regexps into CONSTANTS.. whatsay?

pyrmont · 2019-06-02T07:54:49Z

As a first pass, my preferred option would be to put all local variables (be they arrays of special words or single regular expressions) behind class methods. This has the benefit of not requiring any further editing of lexers that were using local variables (although, as you noted, they weren't all doing that so it's not going to be true universally).

Once that's done, further optimisation could involve replacing some of the individual regular expressions with references to commonly used ones that are centrally defined (as you proposed in #1139). I know @jneen expressed concern about unnecessary obfuscation with that approach, though, and I think it would be worth testing how much it improves things.

If the early indications are correct, 'wrapping' variables should drastically reduce memory allocation for most uses cases. Users are rarely going to be invoking Rouge to syntax highlight more than a handful of languages and so the actual number of regular expressions instantiated in practice even without centrally defined expressions should be low (I think).

ashmaroli · 2019-06-02T13:47:24Z

@pyrmont I won't be able to finish this task. The various lexers are structured differently. Some specs fail on changing keywords = to def self.keywords or def keywords. It is time-consuming to sync spec files and changes in the lexer source..

pyrmont · 2019-06-02T14:26:48Z

No problem at all! Thanks for taking a look at it!

pyrmont · 2019-06-07T08:19:16Z

So I kept playing around and now have this code as a kind of hacky proof of concept.

Approach

The important stuff is all in lib/rouge/lexer.rb. In short, rather than load in the individual lexer files, it:

extracts the class names, tags and aliases from these files;
saves these values as keys in a hash with the associated value being the relative path of the file;
loads the lexer file for <Lexer Class> when the consumer either (a) attempts to access Lexers::<Lexer Class> or (b) attempts to retrieve an instance of <Lexer Class> from Lexer.registry.

All tests pass.

Results

Here's the stats for memory on master:

Total allocated: 11.5 MB (109823 objects)
Total retained:  4.37 MB (30845 objects)

and on this branch:

Total allocated: 2.34 MB (14721 objects)
Total retained:  480.52 kB (3446 objects)

The memory for just loading the library has been reduced to:

Total allocated: 2.14 MB (12720 objects)
Total retained:  380.64 kB (2804 objects)

Limitations

The information from the lexer files is extracted using a simple regular expression. If a lexer file does not express the information required (class name, tag, aliases) in the format expected, this won't work.

Future Work

A better approach may be to formally split lexers into two files: one that contains the details necessary for selection of the appropriate lexer and the other that contains the lexing logic. This might be something appropriate for version 4.0.

pyrmont · 2019-06-07T08:51:09Z

Rebased for your ease of comparison pleasure.

Memory usage is now:

Total allocated: 2.36 MB (14870 objects)
Total retained:  482.79 kB (3477 objects)

and for vanilla loading:

Total allocated: 2.15 MB (12867 objects)
Total retained:  382.91 kB (2835 objects)

pyrmont · 2019-06-07T08:56:37Z

@ashmaroli Vanilla loading means without actually running any highlighting. I introduced a rake task that just loads the library and that's what I mean by vanilla loading.

jneen · 2019-06-07T13:03:50Z

Another option that does away with that scary regex: keep the lexer keyword, but do not save the block. Instead, re-load the file later with a special thread-variable flag set, such that the lexer actually gets populated.

pyrmont · 2019-06-07T13:11:09Z

@jneen I feel like I've just been galaxy brained.

jneen · 2019-06-07T13:12:13Z

lol maybe. it might be a bit much.

also make sure to update RougeVersion in rouge.jneen.net so it can preload everything before removing the global constant (and the files)

pyrmont · 2019-07-23T07:13:25Z

I spent some time doing more experimentation today and I'm sorry to say that the 'don't save the block' approach didn't greatly reduce memory usage. It does reduce the total number of objects retained but only by a little and does not really reduce the number of objects allocated.

Here are the results from wrapping the state methods in the ABAP lexer.

Current Approach

Total allocated: 13.47 MB (131876 objects)
Total retained:  4.87 MB (34521 objects)

Unsaved Block Approach

Total allocated: 13.47 MB (131870 objects)
Total retained:  4.86 MB (34483 objects)

Other things I tried:

putting lexing logic inside an always false if-statement (no real difference);
putting lexing logic inside a method (no real difference);
putting lexing logic inside a method behind an explicit return statement (no real difference).

Short of deferring parsing of the code, I can't see a way to substantially reduce memory usage in respect of the lexers.

jneen · 2019-07-23T07:18:02Z

The last resort is to use a Rake task to generate an index, then use const_missing to dynamically load classes. In this case I would like to have a require 'rouge/all' option to load them all in, for applications that expect to use them all.

stale · 2020-07-22T10:44:35Z

This issue has been automatically marked as stale because it has not had any activity for more than a year. It will be closed if no additional activity occurs within the next 14 days.
If you would like this issue to remain open, please reply and let us know if the issue is still reproducible.

pyrmont added the enhancement-request A request for an enhancement to be developed. label Jun 2, 2019

This comment has been minimized.

Sign in to view

pyrmont mentioned this issue Jun 3, 2019

Add 'lexer' and 'registered' DSL keywords #1150

Closed

This comment has been minimized.

Sign in to view

pyrmont mentioned this issue Jun 19, 2019

Autoloading lexers #217

Closed

pyrmont mentioned this issue Jul 26, 2019

Reduce memory usage with a pre-generated cache #1284

Closed

stale bot added the stale-issue There has been no activity for a year. label Jul 22, 2020

stale bot closed this as completed Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid object instantiation in lexers #1147

Avoid object instantiation in lexers #1147

pyrmont commented Jun 2, 2019 •

edited

Loading

jneen commented Jun 2, 2019

ashmaroli commented Jun 2, 2019

pyrmont commented Jun 2, 2019

This comment has been minimized.

pyrmont commented Jun 2, 2019

ashmaroli commented Jun 2, 2019

pyrmont commented Jun 2, 2019 •

edited

Loading

ashmaroli commented Jun 2, 2019 •

edited

Loading

pyrmont commented Jun 2, 2019 •

edited

Loading

ashmaroli commented Jun 2, 2019

pyrmont commented Jun 2, 2019

pyrmont commented Jun 7, 2019

This comment has been minimized.

pyrmont commented Jun 7, 2019

This comment has been minimized.

This comment has been minimized.

pyrmont commented Jun 7, 2019

jneen commented Jun 7, 2019

pyrmont commented Jun 7, 2019

jneen commented Jun 7, 2019

pyrmont commented Jul 23, 2019

jneen commented Jul 23, 2019

stale bot commented Jul 22, 2020

Avoid object instantiation in lexers #1147

Avoid object instantiation in lexers #1147

Comments

pyrmont commented Jun 2, 2019 • edited Loading

jneen commented Jun 2, 2019

ashmaroli commented Jun 2, 2019

pyrmont commented Jun 2, 2019

This comment has been minimized.

pyrmont commented Jun 2, 2019

ashmaroli commented Jun 2, 2019

pyrmont commented Jun 2, 2019 • edited Loading

ashmaroli commented Jun 2, 2019 • edited Loading

pyrmont commented Jun 2, 2019 • edited Loading

ashmaroli commented Jun 2, 2019

pyrmont commented Jun 2, 2019

pyrmont commented Jun 7, 2019

Approach

Results

Limitations

Future Work

This comment has been minimized.

pyrmont commented Jun 7, 2019

This comment has been minimized.

This comment has been minimized.

pyrmont commented Jun 7, 2019

jneen commented Jun 7, 2019

pyrmont commented Jun 7, 2019

jneen commented Jun 7, 2019

pyrmont commented Jul 23, 2019

Current Approach

Unsaved Block Approach

jneen commented Jul 23, 2019

stale bot commented Jul 22, 2020

pyrmont commented Jun 2, 2019 •

edited

Loading

pyrmont commented Jun 2, 2019 •

edited

Loading

ashmaroli commented Jun 2, 2019 •

edited

Loading

pyrmont commented Jun 2, 2019 •

edited

Loading