Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NOAA benchmark #30

Merged
merged 1 commit into from
Jul 4, 2017
Merged

Add NOAA benchmark #30

merged 1 commit into from
Jul 4, 2017

Conversation

martijnvg
Copy link
Member

This now benchmarks range fields specifically, but it can also be used to benchmark other numeric query/agg operations.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but I'm glad we are getting a benchmark that has range fields.

"description": "Indexes the whole document corpus using Elasticsearch default settings. We only adjust the number of replicas as we benchmark a single node cluster and Rally will only start the benchmark if the cluster turns green and we want to ensure that we don't use the query cache. Document ids are unique so all index operations are append only. After that a couple of queries are run.",
"default": true,
"index-settings": {
"index.number_of_shards": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"ASN00003105",
"ASN00003100",
"ASN00004083"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a simple term query like the disjunction has? Otherwise if there is a change in performance in that query, it might not be obvious whether it is related to the terms query or to the range?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to benchmark both the point and the doc values query, it might also help to have one conjunction with a range that matches most documents and a term query that matches between 0.1 and 1%% of the index, and another conjunction where the range matches 2x fewer documents than the range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a simple term query like the disjunction has?

I think you missed the range_query_range_field_in_conjunction_with_term_query query above this one?

it might also help to have one conjunction with a range that matches most document
and a term query that matches between 0.1 and 1%% of the index,

A weather station in this data set has at most 366 document which is 0,014% of the total amount of documents. So I think the 0.1 and 1% case is covered.

What query could be used for matching most of the docs, that on its own doesn't have a lot of overhead that could interfere with the benchmark? A term range? match_all ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A weather station in this data set has at most 366 document which is 0,014% of the total amount of documents. So I think the 0.1 and 1% case is covered.

Arg, I made a mistake. A simple term query for weather station is 0,003%. The terms query matches with 5856 documents and that is 0,05%. So what I'll do is increase the number terms in the the terms query to get at least to 0,1%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried that the overhead of merging postings of multiple terms will add noise. Maybe we could cross this dataset with stations (ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt) in order to be able to index more metadata with all documents such as geo coordinates, state and elevation of the station. Then I believe we could find some states that have significant numbers of records?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll add more metadata to the documents.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ @martijnvg I can update my python script to do this if you want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colings86 Thanks that would be great. Note that for creating this track I did made some modifications to your script, mainly around the fact that it needs to be converted to a json file. This is what I have now: https://gist.github.com/martijnvg/72a3711cb26fd84f196e9a1c4a41d038


{
"short-description": "Daily weather measurement summaries from around the globe.",
"description": "Indexes 10M+ weather measurement summaries from NOAA.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe document where the data was retrieved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a link in the README, I think that is sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, I missed it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says 10M+ weather measurements but it's actually only 2.5M.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielmitterdorfer The doc count is actually 10914068, so I'll just update it to that. I would expect Rally to fail with an error, because the document-count in track.json was incorrect.

Copy link
Member

@danielmitterdorfer danielmitterdorfer Jul 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh Rally does not count the documents again but I may add this feature. I've just raised elastic/rally#296.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing the track! I left a few minor comments.

noaa/README.txt Outdated
Dataset containing daily weather measurement from NOAA:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/

The dataset has been processed by: https://gist.github.com/colings86/078e85a1131324471f4f10c73570d678
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just compress the zip file from the gist and dump it here so it is self-contained? Also, the gist contains instructions, especially:

Sort files using something like sort --field-separator=',' --key=1,2 -o ~/Downloads/2017-sorted.csv ~/Downloads/2017.csv

And I think you should document how you sorted the files.

{
"operation": "index",
"#COMMENT": "This is an incredibly short warmup time period but it is necessary to get also measurement samples. As this benchmark is rather about search than indexing this is ok.",
"warmup-time-period": 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this short warmup time period warranted here? I think this is only necessary for percolator (where indexing throughput is not interesting anyway). Ideally we'd have at least 240 seconds here.

{
"operation": "index",
"#COMMENT": "This is an incredibly short warmup time period but it is necessary to get also measurement samples. As this benchmark is rather about search than indexing this is ok.",
"warmup-time-period": 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above for the warmup time period. If possible this should be at least 240 seconds.


{
"short-description": "Daily weather measurement summaries from around the globe.",
"description": "Indexes 10M+ weather measurement summaries from NOAA.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says 10M+ weather measurements but it's actually only 2.5M.

"clients": 8
}
]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Missing new line

}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Missing new line

@martijnvg
Copy link
Member Author

I've updated the PR.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


{
"short-description": "Daily weather measurement summaries from around the globe.",
"description": "Indexes 10M+ weather measurement summaries from NOAA.",
Copy link
Member

@danielmitterdorfer danielmitterdorfer Jul 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh Rally does not count the documents again but I may add this feature. I've just raised elastic/rally#296.

@martijnvg martijnvg merged commit bf5b21b into elastic:master Jul 4, 2017
@jpountz
Copy link
Contributor

jpountz commented Jul 4, 2017

@martijnvg could you map the station code as a keyword so that it does not get the text/keyword dual mapping? In general, I think it'd be better to map all fields explicitly and disable dynamic mappings.

@martijnvg
Copy link
Member Author

@jpountz yes: 659d697

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants