Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip indexing issue attachments #31

Open
colinmollenhour opened this issue Oct 8, 2015 · 4 comments
Open

Skip indexing issue attachments #31

colinmollenhour opened this issue Oct 8, 2015 · 4 comments

Comments

@colinmollenhour
Copy link

Issue attachments are potentially large and of a format that isn't trivially indexable (like JPG, PDF, etc). No reason to store these in elasticsearch.

@leobg
Copy link

leobg commented Oct 8, 2015

But attachments could be Excel, Word documents or other readable format. PDF's are often readable too - it is priceless to have the search render results based on such file content!

@colinmollenhour
Copy link
Author

Have you ever looked at an Excel or Word document in ASCII form? It is a binary format so is useless to elasticsearch without some sort of plaint-text conversion.

@SteveDavis
Copy link

In my digging around while trying to resolve the issue where I can't currently get a search result on anything other than a plain text file, it's become apparent that for a file such as a .jpg file the code will place the following into the index:

image

If you decode the pertinent content from the line:
"file": "dW5zdXBwb3J0ZWQ=\n"

using (for example) https://www.base64decode.org/

and passing in:
dW5zdXBwb3J0ZWQ=

you get back:
unsupported

So, the index will contain information about the file - it's name, size, author etc - but because it's not one of a supported list of file types the whole file content won't actually be loaded into the index.

In the plugin file redmine_elasticsearch\app\serializers\attachment_serlializer.rb, the supported list of file extensions is listed as:

SUPPORTED_EXTENSIONS = %w{
.doc .docx .htm .html .json .ods .odt .pdf .ppt .pptx .rb .rtf .sh .sql .txt .xls .xlsx .xml .yaml .yml
}

So, while it will index details about other formats of content (and I believe the 'tika' module is supposed to convert the otherwise unreadable binary content into something which can indeed be indexed - although I'm having problems with that in my environment at the moment) - it wont put large volumes of pointless data into the index.

def file
content = supported? ? File.read(object.diskfile) : UNSUPPORTED
Base64.encode64(content)
end

The above code snippet from the same .rb file is where files which aren't in the list (and also a list of mime types) will have their content replaced with the string 'unsupported' prior to running the encoding.

So - Feel free to correct me if I'm wrong (I'm certainly no expert in this area) but it looks to me like this is not really a relevant point - The plugin already works out what is likely to be usable (or not) and doesn't load up large files of binary data into the index.

Assuming I'm correct, I think this item can probably be closed. Please do let me know

Kind Regards - Steve

@colinmollenhour
Copy link
Author

Ok so the situation is not as bad as I thought, but I'd still prefer to disable indexing of files completely as at least in my case there will be no benefit from indexing attachments. Thanks for adding the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants