Skip indexing issue attachments #31

colinmollenhour · 2015-10-08T14:32:44Z

Issue attachments are potentially large and of a format that isn't trivially indexable (like JPG, PDF, etc). No reason to store these in elasticsearch.

leobg · 2015-10-08T16:23:19Z

But attachments could be Excel, Word documents or other readable format. PDF's are often readable too - it is priceless to have the search render results based on such file content!

colinmollenhour · 2015-10-09T08:08:00Z

Have you ever looked at an Excel or Word document in ASCII form? It is a binary format so is useless to elasticsearch without some sort of plaint-text conversion.

SteveDavis · 2015-10-24T23:31:00Z

In my digging around while trying to resolve the issue where I can't currently get a search result on anything other than a plain text file, it's become apparent that for a file such as a .jpg file the code will place the following into the index:

If you decode the pertinent content from the line:
"file": "dW5zdXBwb3J0ZWQ=\n"

using (for example) https://www.base64decode.org/

and passing in:
dW5zdXBwb3J0ZWQ=

you get back:
unsupported

So, the index will contain information about the file - it's name, size, author etc - but because it's not one of a supported list of file types the whole file content won't actually be loaded into the index.

In the plugin file redmine_elasticsearch\app\serializers\attachment_serlializer.rb, the supported list of file extensions is listed as:

SUPPORTED_EXTENSIONS = %w{
.doc .docx .htm .html .json .ods .odt .pdf .ppt .pptx .rb .rtf .sh .sql .txt .xls .xlsx .xml .yaml .yml
}

So, while it will index details about other formats of content (and I believe the 'tika' module is supposed to convert the otherwise unreadable binary content into something which can indeed be indexed - although I'm having problems with that in my environment at the moment) - it wont put large volumes of pointless data into the index.

def file
content = supported? ? File.read(object.diskfile) : UNSUPPORTED
Base64.encode64(content)
end

The above code snippet from the same .rb file is where files which aren't in the list (and also a list of mime types) will have their content replaced with the string 'unsupported' prior to running the encoding.

So - Feel free to correct me if I'm wrong (I'm certainly no expert in this area) but it looks to me like this is not really a relevant point - The plugin already works out what is likely to be usable (or not) and doesn't load up large files of binary data into the index.

Assuming I'm correct, I think this item can probably be closed. Please do let me know

Kind Regards - Steve

colinmollenhour · 2015-10-26T00:04:55Z

Ok so the situation is not as bad as I thought, but I'd still prefer to disable indexing of files completely as at least in my case there will be no benefit from indexing attachments. Thanks for adding the details.

nodecarter mentioned this issue Oct 16, 2015

Document Parsers #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip indexing issue attachments #31

Skip indexing issue attachments #31

colinmollenhour commented Oct 8, 2015

leobg commented Oct 8, 2015

colinmollenhour commented Oct 9, 2015

SteveDavis commented Oct 24, 2015

colinmollenhour commented Oct 26, 2015

Skip indexing issue attachments #31

Skip indexing issue attachments #31

Comments

colinmollenhour commented Oct 8, 2015

leobg commented Oct 8, 2015

colinmollenhour commented Oct 9, 2015

SteveDavis commented Oct 24, 2015

colinmollenhour commented Oct 26, 2015