-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plugins::Attachments: Add an attachements plugin (support parsing various file formats) #92
Comments
Implemented. |
Not an Tika expert but it seems that Tika somehow supports for documents having nested documents (as of writing this is used when extracting content from archive files: zip, tar, ... etc). This could be also customized and used in other use cases (like parsing large mbox files, see http://markmail.org/message/h47lnpxtmdskmest ). Does ES integration take account on this? Note that in case of extracting data from archives individual documents are separated by DIV tags having specific class only. Looking at current ES implementation it seems that all nested documents are simply merged into one output document (parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata)). Is there any way how this can be customized? |
Yea, archives are not really meant to be supported currently. This is for the simple reaons that archives are usually very large and it does not make sense to send them in a single HTTP request. One option is to do the parsing on the client side, and feed elasticsearch with the documents. Another option is for the plugin to expose a streaming endpoint, that will parse and generate several documents out of the compound stream. |
Commit instructions and setup for ssh access to common servers for servers needed for benchmark work. Relates elastic#92
Using the new plugins system, implement the
attachments
plugin, allow to add a mapping type calledattachment
which accepts a binary input (base64) of an attachment to index.Installation is simple, just download the plugin zip file and place it under
plugins
directory within the installation. When building from source, the plugin will be underbuild/distributions/plugins
. Once placed in the installation, theattachment
mapper type will be automatically supported.Using the
attachment
type is simple, in your mapping JSON, simply a certain JSON element asattachment
, for example:In this case, the JSON to index can be:
The
attachment
type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are:date
,title
,author
, andkeywords
. They can be queries using the "dot notation", for example:myAttachment.author
.Both the meta data and the actual content are simple core type mappers (
string
,date
, ...), thus, they can be controlled in the mappings. For example:In the above example, the actual content indexed is mapped under
fields
namefile
, and we decide not to index it, so it will only be available in the_all
field. The otherfields
map to their respective metadata names, but there is no need to specify thetype
(likestring
ordate
) since it is already known.The plugin uses Apache Tika (http://lucene.apache.org/tika/) to parse it, so many formats are supported, listed here: http://lucene.apache.org/tika/0.6/formats.html.
The text was updated successfully, but these errors were encountered: