Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to set document_id to avoid duplication of events? #156

Open
honestverkh opened this issue May 7, 2020 · 8 comments
Open

Comments

@honestverkh
Copy link

honestverkh commented May 7, 2020

I would like to deduplicate events coming from logstash. This can be done by providing an unique id. In normal logstash es output this can be achieved by setting the option document_id, see https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html#ls-doc-id

But I see no such configuration in logstash-output-amazon_es plugin. Is there a way to pass the id to es somehow? Or is it possible to achieve deduplication in another way?

@fjlozanoacosta
Copy link

fjlozanoacosta commented Jul 22, 2020

I'm looking for this exact same feature.

If I use action => "update" I get a [ERROR][logstash.outputs.amazonelasticsearch][diagnosesFromDb][ec7508cd0a62677ac04a682c8ea1ce1b44c9599a6bdc490509912cb8629a2b4e] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>"https://xxxxxx.ca-central-1.es.amazonaws.com:443/_bulk"}.

I've been trying to use doc_as_upsert as true but it still duplicates my data

This is my config

input {
  jdbc {
    jdbc_driver_library => "/usr/share/logstash/config/postgresql-42.2.14.jar"
    jdbc_driver_class => "org.postgresql.Driver"
    jdbc_connection_string => "jdbc:postgresql://${DB_HOST}:${DB_PORT}/${DB_NAME}"
    jdbc_user => "${DB_USER}"
    jdbc_password => "${DB_PASSWORD}"

    schedule => "* * * * * UTC"
    statement => 'some sql query'
  }
}

output {
  stdout { codec => rubydebug }
  amazon_es {
    hosts => ["XXX"]
    region => "ca-central-1"
    aws_access_key_id => "${AWS_ACCESS_KEY_ID}"
    aws_secret_access_key => "${AWS_SECRET_ACCESS_KEY}"
    index => "${ENV}_diagnoses"
    document_id => "%{diagnosis.id}"
    doc_as_upsert => true
    action => "update"
    max_bulk_bytes => 9999999
  }
}

@wmaroy
Copy link

wmaroy commented Aug 5, 2020

Any updates on this issue?

@kasrutten
Copy link

I am also currently blocked by this.

@EvaOyy
Copy link

EvaOyy commented Aug 16, 2020

I am trying to deduplicate too! my way around this is to delete the index and recreating it again :(

@fjlozanoacosta
Copy link

I ended up not using this plugin and using the elasticsearch output plugin like so:

  elasticsearch {
    hosts => "${ELASTICSEARCH_HOST_PORT}"
    user => "${ELASTIC_USERNAME}"
    password => "${ELASTIC_PASSWORD}"
    index => "index"
    document_id => "%{id}"
    doc_as_upsert => true
    action => "update"
    ssl => true
    ilm_enabled => false
  }
}

@sethcenterbar
Copy link

I ended up not using this plugin and using the elasticsearch output plugin like so:

  elasticsearch {
    hosts => "${ELASTICSEARCH_HOST_PORT}"
    user => "${ELASTIC_USERNAME}"
    password => "${ELASTIC_PASSWORD}"
    index => "index"
    document_id => "%{id}"
    doc_as_upsert => true
    action => "update"
    ssl => true
    ilm_enabled => false
  }
}

How did you set up the ELASTIC_USERNAME and password?

@sethcenterbar
Copy link

sethcenterbar commented Sep 18, 2020

I would like to deduplicate events coming from logstash. This can be done by providing an unique id. In normal logstash es output this can be achieved by setting the option document_id, see https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html#ls-doc-id

But I see no such configuration in logstash-output-amazon_es plugin. Is there a way to pass the id to es somehow? Or is it possible to achieve deduplication in another way?

After being quite confused by this thread (and still questioning whether we were ever doing upserts correctly), I realized that document_id does actually work for de-duplication, at least in my case.

output {
    amazon_es {
        hosts => ["somehost"]
        region => "us-region-number"
        index => "some_index"
        document_id => "%{some_id}"
    }
}

@fjlozanoacosta
Copy link

fjlozanoacosta commented Sep 22, 2020

Yeah document_id does work for de-duplication.

For the username and password I believe we're using Amazon Cognito Authentication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants