Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate better with lucene test framework and mockfilesystems #10656

Merged
merged 58 commits into from
Apr 20, 2015

Conversation

rmuir
Copy link
Contributor

@rmuir rmuir commented Apr 19, 2015

A few of the changes:

  • single REPRODUCE WITH: and exception
  • fail the build if you typo the test name from mvn.
  • clean up output and stacktraces to work better with IDE integration and easier to debug.
  • fail tests if they leak file handles.
  • fail tests if they use an extreme number of file handles (> 2k)
  • simulate windows behavior randomly (no deletion of open files) for the filesystem.
  • add extra files to simulate .DS_Store/thumbs.db type behavior.
  • running tests with -Dtests.verbose=true shows all destructive operations to the filesystem and indexwriter streams.
  • parallellized rest tests.
  • more randomization/asserting consistently via lucenetestcase
  • refactor base test classes (including removing a few) for consistency.
  • fix some test reproducibility problems.
  • speed up execution of full test suite (~ 6:45 for full test run versus ~ 9:00 on master for me)
  • mark slower tests with @Slow annotation, default tests.slow to false.

The last change speeds up junit times to ~ 3:15 for a test run, which gives most bang for the buck: still 73% coverage vs 79% with tests.slow=true, but a good deal faster, and you can always turn on slow tests easily/in jenkins.

Things I want to defer to future issues:

  • make sure internaltestcluster randomization is less chaotic / more reasonable perf.
  • use lucenetestcase iwconfig randomization to the extent we can.
  • try to get to 2 minutes 'mvn test' with good coverage. likely requires adding many unit tests.
  • try to narrow the coverage gap more between tests.slow=true/false
  • speed up old backwards index tests (they spend a lot of time merging indexes)
  • try to design a more safe and efficient integration test base class (the suite-scoped one today has a confusing lifecycle).
  • hook in more Asserting* classes from lucene to find bugs in query impls etc.
  • fixing code to work safe with multiple JDK filesystems / allowing them to be used 'for real', anything like that.

Things i investigated and weren't worth the trouble:

  • parallelizing old backwards index test. not faster.
  • plugging in ram filesystem. runs on tmpfs are not really faster.

rmuir and others added 30 commits April 15, 2015 18:23
Conflicts:
	src/main/java/org/elasticsearch/index/translog/Translog.java
	src/main/java/org/elasticsearch/index/translog/fs/FsTranslog.java
…ad. We should probalby ban .getURI()?

Also added a couple nocommits for some issues with tests after mockfs is
working again. But I also re-enabled the mockfs suppression in the base
test case for now.
with nocommits for tests that need to be investigated.
@@ -128,7 +130,8 @@ public NodeEnvironment(Settings settings, Environment environment) throws IOExce
int maxLocalStorageNodes = settings.getAsInt("node.max_local_storage_nodes", 50);
for (int possibleLockId = 0; possibleLockId < maxLocalStorageNodes; possibleLockId++) {
for (int dirIndex = 0; dirIndex < environment.dataWithClusterFiles().length; dirIndex++) {
Path dir = environment.dataWithClusterFiles()[dirIndex].resolve(Paths.get(NODES_FOLDER, Integer.toString(possibleLockId)));
// TODO: wtf with resolve(get())
Path dir = environment.dataWithClusterFiles()[dirIndex].resolve(PathUtils.get(NODES_FOLDER, Integer.toString(possibleLockId)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to chain resolves, which seems to be the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets do it as a followup, we should try to minimize non-test changes here, it was just a TODO since this was using Paths.get for "concatenation" (which is not recommended)

@@ -44,6 +45,7 @@
/**
*
*/
@Slow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this test slower due slow connection setup on the vagrant boxes we were using. I think we can speed it up again. I'll do this as a separate change

@bleskes
Copy link
Contributor

bleskes commented Apr 19, 2015

I like the simplification and removal of test base class. I'm not 100% familiar with the test infra - I think @javanna should have a look as well.

I'm also doubting about the exclusion of slow tests by default. I feels to me that a sanity check run (i.e., run only fast things) is something one should opt in to as opposed to opt out. This is mostly due to the fact that some parts of are system are only tested by slow tests (for example, the discovery runs which add and kill nodes).

Last minor detail : org.elasticsearch.test.ElasticsearchIntegrationTest#afterTestFailed can be removed. It's not used (but also not part of any diff, so I can't comment inline).

@rmuir
Copy link
Contributor Author

rmuir commented Apr 19, 2015

I'm also doubting about the exclusion of slow tests by default. I feels to me that a sanity check run (i.e., run only fast things) is something one should opt in to as opposed to opt out. This is mostly due to the fact that some parts of are system are only tested by slow tests (for example, the discovery runs which add and kill nodes).

Well these are unit test runs. Maybe such tests belong in an integration test suite or somewhere else? Or maybe they should be done as unit test, but they are simply too slow.

Again disabling these non-unit-tests this provides the majority of test coverage (73% vs 79%) with a fraction of the time. Its the right tradeoff.

@rmuir
Copy link
Contributor Author

rmuir commented Apr 19, 2015

Also, we really need to draw a line in the sand (great time is now) where we don't coddle and encourage slow tests. If we aggressively keep the test suite fast, by default, then people will follow suit, because they want their tests to execute and be fast. Furthermore, people will run the tests more often, because it does not take 15-30 minutes to do so. So it ultimately will result in better coverage than just letting things get slower and slower and less reliable.

@s1monw
Copy link
Contributor

s1monw commented Apr 20, 2015

Again disabling these non-unit-tests this provides the majority of test coverage (73% vs 79%) with a fraction of the time. Its the right tradeoff.

+1000 this is the right way to go here. We can still configure CI to run all these things but it should be sufficient to make a commit / push if the non-slowish tests pass.

Also, we really need to draw a line in the sand (great time is now) where we don't coddle and encourage slow tests.

this is the way to go - we have to work towards more coverage as you said in the description with the fast tests. That's a reasonable goal and straight forward IMO, yet it will take time but that's fine. This is actually a step forward while it might seem to be a step backwards in terms of coverage but it encourages since tests are faster and have good coverage. you can still set this as your default if you wanna be on the totally safe end. It's the same discussion as using local transport by default for speed it bypasses the entire network stack but it's the right tradeoff. Please lets move forward here and don't loos ourself in endless discussions. if it turns out to be a problem we can still go back.

@@ -1,7 +1,7 @@
---
setup:
- skip:
version: 0 - 999
version: " - "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure this doesn't break the clients builds? @clintongormley WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javanna it does but we are on master we can do this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think " - " is an improvement over "0 - 999". I'd rather see some meaningful value, like all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question is if we want to have a "pending" test like this in the suite at all? Again, if yes, I'd like to have something semantic, like a pending: true flag, not a skip which needs to be deciphered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karmi If you want a special all that is doable. I would rather have that then yet another flag like pending.

@javanna
Copy link
Member

javanna commented Apr 20, 2015

left a couple of comments around REST tests and restarting the suite cluster after a failure. Note that if we go ahead and remove this mechanism there is more code that can be removed (as Boaz pointed out): ElasticsearchIntegrationTest#afterTestFailed, ElasticsearchRestTestCase#afterTestFailed, and I think the rest client creation in RestTestExecutionContext can be simplified (we might be able to make the rest client final).

@s1monw
Copy link
Contributor

s1monw commented Apr 20, 2015

+1 on removing more stuff if possible as luca said but we should not be blocked by ANY project downstream that use some of our test. We are on master and we can literally do whatever we want as long as there is an upgrade path. Yet, this is test only so we can move forward even if we break client builds.

this change LGTM and I think we should not waste time and letting it go out of date. this is a massive improvement in our infra. +1 to push and take it from here iterating further.

rmuir added a commit that referenced this pull request Apr 20, 2015
Integrate better with lucene test framework and mockfilesystems
@rmuir rmuir merged commit db096b4 into master Apr 20, 2015
karmi added a commit to elastic/elasticsearch-ruby that referenced this pull request Apr 21, 2015
…ranges

Due to the change in elastic/elasticsearch@68f75ea, the normalized minimum and maximum versions
no longer matched the assumptions about the version definition in the `skip` statement.

The condition has been updated to check for empty `min` and `max` (non-normalized) versions.

Related: elastic/elasticsearch#10656
@clintongormley
Copy link
Contributor

@javanna sorry for the late response, been away. For the record, i'm happy with breaking the clients version parsing - it's a tiny fix.

rjernst added a commit to rjernst/elasticsearch that referenced this pull request Apr 21, 2015
This was suggested on elastic#10656 as cleaner than " - " to indicate all
versions should be skipped.
@s1monw s1monw deleted the mockfilesystem branch April 21, 2015 20:21
rjernst added a commit to rjernst/elasticsearch that referenced this pull request Apr 22, 2015
This was suggested on elastic#10656 as cleaner than " - " to indicate all
versions should be skipped.

closes elastic#10702
dadoonet added a commit to elastic/elasticsearch-cloud-azure that referenced this pull request Apr 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test Issues or PRs that are addressing/adding tests v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants