Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registered domain processor #67611

Merged
merged 14 commits into from
Apr 14, 2021

Conversation

andrewstucki
Copy link

@andrewstucki andrewstucki commented Jan 15, 2021

This PR adds support for a processor that looks up the registered domain, eTLD from the public suffix list, and uses those to split out the subdomain as well. It's essentially a port of the beats processor.

~ curl -H "Content-Type: application/json" -X POST -u elastic:password http://localhost:9200/_ingest/pipeline/_simulate\?verbose --data-binary @- << EOF
{
  "pipeline": {
    "processors": [
      {
        "registered_domain": {
          "field": "url",
          "target_field": "url_registered_domain"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "url": "www.google.com"
      }
    }
  ]
}

Produces:

...
"_source" : {
  "url_registered_domain" : {
    "subdomain": "www",
    "registered_domain": "google.com",
    "top_level_domain": "com"
  }
  "url": "www.google.com"
}
...

@andrewstucki andrewstucki added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 Team:Data Management Meta label for data/management team v7.12.0 labels Jan 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@danhermann
Copy link
Contributor

@andrewstucki, I will review this over the next week or so. There is an existing issue open for this processor (#57476) and I've put a non-trivial amount of time into it, so it would have been good to avoid the duplication of effort on this one.

@andrewstucki
Copy link
Author

@danhermann sounds good, I'd be happy to close this in favor of your processor if it's more complete, just figured I'd throw together a Friday afternoon implementation 😄

Just let me know.

@danhermann
Copy link
Contributor

@danhermann sounds good, I'd be happy to close this in favor of your processor if it's more complete, just figured I'd throw together a Friday afternoon implementation 😄

Just let me know.

I'll review it and see what it looks like. There were some complexities that I encountered in the implementation of this processor and you might have avoided those.

@mark-vieira
Copy link
Contributor

@elasticmachine update branch

@danhermann
Copy link
Contributor

@andrewstucki, I've given this a brief look and I think your approach will work. There have been some code organization changes since you submitted your PR so the classes will need to be moved into the ingest.common package. If you can make that change, I will continue with the review.

@P1llus
Copy link
Member

P1llus commented Mar 9, 2021

Just adding in a comment from me that this is a dependency for us to clear some of our tasks on https://github.com/elastic/security-team/issues/791 that is scheduled for 7.13, so it would be great once it has been fixed, if it could get some priority to be merged to 7.13 if possible.

@andrewstucki andrewstucki force-pushed the registered-domain-processor branch from d8b5659 to 05ebe9b Compare March 23, 2021 17:02
@andrewstucki
Copy link
Author

@danhermann we should be good to go on this now.

Copy link
Contributor

@danhermann danhermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewstucki, this looks pretty good. I've left some comments and questions below. Also, I'm no expert on this, but I've seen some mention of "public suffix" being the preferred synonym for "eTLD". Do you know if that's the case?

@andrewstucki
Copy link
Author

@danhermann so, "public suffix" and "eTLD" are synonymous, I'm not really sure if there's a preferred usage or not, but as our ECS docs explicitly refer to "effective top level domain"s, I figured we should opt to keep things consistent.

@danhermann
Copy link
Contributor

@elasticmachine update branch

Copy link
Contributor

@danhermann danhermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewstucki, just two minor changes below and we can get this merged.

@danhermann
Copy link
Contributor

@andrewstucki, thanks for all the work on this. I'm going to merge it and get it backported for the next release.

@danhermann
Copy link
Contributor

@andrewstucki, I'm just now noticing that domain is simply the input value. Is that as intended?

@andrewstucki
Copy link
Author

andrewstucki commented Apr 14, 2021

@danhermann yep, domain is just a copy of the data that was parsed. The general idea was say you had something that was like:

{
  "proprietary": {
    "field": "www.google.com"
  }
}

assuming you configured the processor to look at this field and fill in url, you'd get the following:

{
  "url": {
    "domain": "www.google.com",
    "registered_domain": "google.com",
    ...
  }
}

that way it fills in everything it can, including https://www.elastic.co/guide/en/ecs/current/ecs-url.html#field-url-domain


Edit: also, if you are using say url.domain as the source field, it'll overwrite it with the same value, so it should be fine to use in either case.

@danhermann
Copy link
Contributor

cc: @elastic/es-ui in case auto-complete needs to be updated to accommodate this new processor.

@alisonelizabeth
Copy link
Contributor

Thanks for the heads up @danhermann! I opened elastic/kibana#97465 to add support in Kibana.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature Team:Data Management Meta label for data/management team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants