Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugins::Attachments: Add an attachements plugin (support parsing various file formats) #92

Closed
kimchy opened this issue Mar 29, 2010 · 3 comments

Comments

@kimchy
Copy link
Member

kimchy commented Mar 29, 2010

Using the new plugins system, implement the attachments plugin, allow to add a mapping type called attachment which accepts a binary input (base64) of an attachment to index.

Installation is simple, just download the plugin zip file and place it under plugins directory within the installation. When building from source, the plugin will be under build/distributions/plugins. Once placed in the installation, the attachment mapper type will be automatically supported.

Using the attachment type is simple, in your mapping JSON, simply a certain JSON element as attachment, for example:

{
    person : {
        properties : {
            "myAttachment" : { type : "attachment" }
        }
    }
}

In this case, the JSON to index can be:

{
    myAttachment : "... base64 encoded attachment ..."
}

The attachment type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are: date, title, author, and keywords. They can be queries using the "dot notation", for example: myAttachment.author.

Both the meta data and the actual content are simple core type mappers (string, date, ...), thus, they can be controlled in the mappings. For example:

{
    person : {
        properties : {
            "file" : { 
                type : "attachment",
                fields : {
                    file : {index : "no"},
                    date : {store : "yes"},
                    author : {analyzer: "myAnalyzer"}
                }
            }
        }
    }
}

In the above example, the actual content indexed is mapped under fields name file, and we decide not to index it, so it will only be available in the _all field. The other fields map to their respective metadata names, but there is no need to specify the type (like string or date) since it is already known.

The plugin uses Apache Tika (http://lucene.apache.org/tika/) to parse it, so many formats are supported, listed here: http://lucene.apache.org/tika/0.6/formats.html.

@kimchy
Copy link
Member Author

kimchy commented Mar 29, 2010

Implemented.

@lukas-vlcek
Copy link
Contributor

Not an Tika expert but it seems that Tika somehow supports for documents having nested documents (as of writing this is used when extracting content from archive files: zip, tar, ... etc). This could be also customized and used in other use cases (like parsing large mbox files, see http://markmail.org/message/h47lnpxtmdskmest ). Does ES integration take account on this? Note that in case of extracting data from archives individual documents are separated by DIV tags having specific class only. Looking at current ES implementation it seems that all nested documents are simply merged into one output document (parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata)). Is there any way how this can be customized?
What I would love to see is an option to extract data from archive first, split into individual documents and then parse individual documents in parallel.

@kimchy
Copy link
Member Author

kimchy commented Apr 5, 2010

Yea, archives are not really meant to be supported currently. This is for the simple reaons that archives are usually very large and it does not make sense to send them in a single HTTP request.

One option is to do the parsing on the client side, and feed elasticsearch with the documents. Another option is for the plugin to expose a streaming endpoint, that will parse and generate several documents out of the compound stream.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants