Plugins::Attachments: Add an attachements plugin (support parsing various file formats) #92

kimchy · 2010-03-29T10:00:08Z

Using the new plugins system, implement the attachments plugin, allow to add a mapping type called attachment which accepts a binary input (base64) of an attachment to index.

Installation is simple, just download the plugin zip file and place it under plugins directory within the installation. When building from source, the plugin will be under build/distributions/plugins. Once placed in the installation, the attachment mapper type will be automatically supported.

Using the attachment type is simple, in your mapping JSON, simply a certain JSON element as attachment, for example:

{
    person : {
        properties : {
            "myAttachment" : { type : "attachment" }
        }
    }
}

In this case, the JSON to index can be:

{
    myAttachment : "... base64 encoded attachment ..."
}

The attachment type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are: date, title, author, and keywords. They can be queries using the "dot notation", for example: myAttachment.author.

Both the meta data and the actual content are simple core type mappers (string, date, ...), thus, they can be controlled in the mappings. For example:

{
    person : {
        properties : {
            "file" : { 
                type : "attachment",
                fields : {
                    file : {index : "no"},
                    date : {store : "yes"},
                    author : {analyzer: "myAnalyzer"}
                }
            }
        }
    }
}

In the above example, the actual content indexed is mapped under fields name file, and we decide not to index it, so it will only be available in the _all field. The other fields map to their respective metadata names, but there is no need to specify the type (like string or date) since it is already known.

The plugin uses Apache Tika (http://lucene.apache.org/tika/) to parse it, so many formats are supported, listed here: http://lucene.apache.org/tika/0.6/formats.html.

The text was updated successfully, but these errors were encountered:

kimchy · 2010-03-29T10:01:42Z

Implemented.

lukas-vlcek · 2010-04-05T21:55:58Z

Not an Tika expert but it seems that Tika somehow supports for documents having nested documents (as of writing this is used when extracting content from archive files: zip, tar, ... etc). This could be also customized and used in other use cases (like parsing large mbox files, see http://markmail.org/message/h47lnpxtmdskmest ). Does ES integration take account on this? Note that in case of extracting data from archives individual documents are separated by DIV tags having specific class only. Looking at current ES implementation it seems that all nested documents are simply merged into one output document (parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata)). Is there any way how this can be customized?
What I would love to see is an option to extract data from archive first, split into individual documents and then parse individual documents in parallel.

kimchy · 2010-04-05T23:20:32Z

Yea, archives are not really meant to be supported currently. This is for the simple reaons that archives are usually very large and it does not make sense to send them in a single HTTP request.

One option is to do the parsing on the client side, and feed elasticsearch with the documents. Another option is for the plugin to expose a streaming endpoint, that will parse and generate several documents out of the compound stream.

Gem Fix

Commit instructions and setup for ssh access to common servers for servers needed for benchmark work. Relates elastic#92

bluelu mentioned this issue Dec 11, 2014

Node is not responsive after the end of a big merge for close to 10 minutes #8905

Closed

makeyang mentioned this issue Sep 7, 2015

es 0.90.2 plus jdk6.0_25-b06 crashed on production #13368

Closed

ClaudioMFreitas pushed a commit to ClaudioMFreitas/elasticsearch-1 that referenced this issue Nov 12, 2019

Merge pull request elastic#92 from gingerwizard/master

e62af4a

Gem Fix

billhong-just mentioned this issue Jan 25, 2021

jre crash when org.apache.lucene.codecs.DocValuesConsumer$SortedNumericDocValuesSub.nextDoc() #67882

Closed

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Oct 2, 2023

Common ssh setup for benchmark team

c899089

Commit instructions and setup for ssh access to common servers for servers needed for benchmark work. Relates elastic#92

This was referenced Oct 17, 2023

Deserialization of category context: wrong variant instance elastic/elasticsearch-java#691

Open

Sorting across the whole data set doesn't work when using a point-in-time search with slicing #101096

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plugins::Attachments: Add an attachements plugin (support parsing various file formats) #92

Plugins::Attachments: Add an attachements plugin (support parsing various file formats) #92

kimchy commented Mar 29, 2010

kimchy commented Mar 29, 2010

lukas-vlcek commented Apr 5, 2010

kimchy commented Apr 5, 2010

Plugins::Attachments: Add an attachements plugin (support parsing various file formats) #92

Plugins::Attachments: Add an attachements plugin (support parsing various file formats) #92

Comments

kimchy commented Mar 29, 2010

kimchy commented Mar 29, 2010

lukas-vlcek commented Apr 5, 2010

kimchy commented Apr 5, 2010