Is there a way to skip large files while indexing ? #1646

ChristopheBordieu · 2017-06-30T08:45:15Z

Hello,

Not a bug. Just a question.
When indexing, is there a way to skip large files whose sizes are greater than X kB (or MB or GB) ?

vladak · 2017-06-30T16:15:58Z

There is an enhancement #534 filed to track this. As a workaround, specify the files/pattern as ignored (-i).

ChristopheBordieu · 2017-06-30T21:23:13Z

Ok. Thanks for feedback.
Cannot use the workaround -i. I need something completely automatic and not set via configuration.

tarzanek · 2017-07-02T05:24:40Z

@ChristopheBordieu fwiw, 1.0 release has fixes for most of big files problems (ctags parsing is completely fixed, some languages - we limit length of tokens for parser - https://github.com/OpenGrok/OpenGrok/blob/master/build.xml#L270 )
I hoped I will add more lanugages to the limit of token size if need , but I need to know which other languages have problems with long tokens (and overflow) - if you have a list of problematic files, ev. can generate some stats (which file extension/analyzer fails), it'd help tremendously ... please file a new bug with this list, I make sure the respective analyzers will get fixed asap

ChristopheBordieu · 2017-07-03T19:42:43Z

Hi @tarzanek
I am following up all the issues you manage. Actually, the problem is not OpenGrok but my users :-D ! They put on their git repos some very big files. Use cases can be discussed for ever... So, like Bitbucket does when indexing, I would be interested by a way to skip any file, whatever the type, greater in size than X kB.
Despite the display of homepage is ugly for our instance (cf my ticket 1478), I will put in prod 1.1-rc5 this week because newer Lucene, file parser fix and plenty of useful fixes... Then, If I encounter parsing issues in my logfiles, I will report them for sure.
Thank!

tarzanek · 2017-07-03T20:09:17Z

So ... it could be a good improvement
we can perhaps add a simple option to indexer that will set this limit ... dare to write up a patch?
I am willing to give pointers and guidance.

vladak · 2017-07-04T06:22:39Z

As I wrote in #534 this is not as simple as it looks.

ChristopheBordieu · 2017-07-04T10:38:31Z

I do not know Java... So do not wait for me for a patch :-) And it is not simple !

jhaber · 2020-04-14T14:58:06Z

Just wanted to chime in that we're seeing slow search performance after upgrading from a very old version of OpenGrok. We believe that it's caused by a few large JSON files (5-10MB). For certain terms, we're seeing search take a very long time or time out. However, when we filter out JSON files from the search it returns very quickly. We'll try to find a workaround in the meantime.

EDIT: We're on OpenGrok 1.3.8

vladak · 2020-04-14T17:22:10Z

Incidentally, file processing times could be part of the statistics (#579).

jhaber · 2020-04-14T17:44:59Z

In our case we don't care much about processing times since it's offline (up to a certain point of course 😄 ). However we care a lot about search speed since our engineers rely on interactive searching of OpenGrok and want it to be fast.

It seems like these big JSON files are causing searches to be at least an order of magnitude slower. Using a file path or file type filter to exclude JSON files makes searches snappy again.

We recently upgraded from OpenGrok 0.12.1.5, which didn't seem to have a JsonAnalyzer. So our other hunch is that JsonAnalyzer is now doing more sophisticated indexing of JSON files, which leads to poor search performance on big JSON files. Does that seem plausible?

In the short term we're going to try switching JSON files to use the PlainAnalyzer, and also ignore any JSON file over 100KB when indexing.

idodeclare · 2020-04-14T18:05:00Z

We recently upgraded from OpenGrok 0.12.1.5, which didn't seem to have a JsonAnalyzer. So our other hunch is that JsonAnalyzer is now doing more sophisticated indexing of JSON files, which leads to poor search performance on big JSON files. Does that seem plausible?

In the short term we're going to try switching JSON files to use the PlainAnalyzer, and also ignore any JSON file over 100KB when indexing.

My guess is the Lucene Unified Highlighter, which forces to read full source into memory. That will still be active even if you use PlainAnalyzer for JSON. See my write-up in #3097.

jhaber · 2020-04-15T21:19:20Z

We recently upgraded from OpenGrok 0.12.1.5, which didn't seem to have a JsonAnalyzer. So our other hunch is that JsonAnalyzer is now doing more sophisticated indexing of JSON files, which leads to poor search performance on big JSON files. Does that seem plausible?
In the short term we're going to try switching JSON files to use the PlainAnalyzer, and also ignore any JSON file over 100KB when indexing.

My guess is the Lucene Unified Highlighter, which forces to read full source into memory. That will still be active even if you use PlainAnalyzer for JSON. See my write-up in #3097.

Seems like you were exactly right. Switching to PlainAnalyzer made no discernible improvement, but updating our indexing pipeline to skip JSON files >100KB took searches from 8 seconds to 30 milliseconds

…files

vladak added the question label Jun 30, 2017

vladak closed this as completed Jun 30, 2017

tarzanek mentioned this issue Apr 15, 2020

Alter handling of huge text files #3097

Open

idodeclare added a commit to idodeclare/OpenGrok that referenced this issue Oct 9, 2020

Fix oracle#534 Fix oracle#1646 Fix oracle#3097 : constrain huge text …

752a7ce

…files

ChristopheBordieu mentioned this issue Mar 12, 2021

When indexing, ignore too large files #3478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to skip large files while indexing ? #1646

Is there a way to skip large files while indexing ? #1646

ChristopheBordieu commented Jun 30, 2017 •

edited

Loading

vladak commented Jun 30, 2017

ChristopheBordieu commented Jun 30, 2017

tarzanek commented Jul 2, 2017

ChristopheBordieu commented Jul 3, 2017

tarzanek commented Jul 3, 2017

vladak commented Jul 4, 2017

ChristopheBordieu commented Jul 4, 2017

jhaber commented Apr 14, 2020 •

edited

Loading

vladak commented Apr 14, 2020

jhaber commented Apr 14, 2020

idodeclare commented Apr 14, 2020

jhaber commented Apr 15, 2020

Is there a way to skip large files while indexing ? #1646

Is there a way to skip large files while indexing ? #1646

Comments

ChristopheBordieu commented Jun 30, 2017 • edited Loading

vladak commented Jun 30, 2017

ChristopheBordieu commented Jun 30, 2017

tarzanek commented Jul 2, 2017

ChristopheBordieu commented Jul 3, 2017

tarzanek commented Jul 3, 2017

vladak commented Jul 4, 2017

ChristopheBordieu commented Jul 4, 2017

jhaber commented Apr 14, 2020 • edited Loading

vladak commented Apr 14, 2020

jhaber commented Apr 14, 2020

idodeclare commented Apr 14, 2020

jhaber commented Apr 15, 2020

ChristopheBordieu commented Jun 30, 2017 •

edited

Loading

jhaber commented Apr 14, 2020 •

edited

Loading