-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8422] [BUILD] [PROJECT INFRA] Add a module abstraction to dev/run-tests #6866
[SPARK-8422] [BUILD] [PROJECT INFRA] Add a module abstraction to dev/run-tests #6866
Conversation
/cc @brennonyork |
Test build #35071 has finished for PR 6866 at commit
|
Test build #35079 has finished for PR 6866 at commit
|
c108d09
to
7092d3e
Compare
Test build #35100 has finished for PR 6866 at commit
|
/cc @tdas, do you think that I should define additional modules for the individual external streaming projects, such as the Kafka + Flume components? I'd be glad to do that here as part of this PR, since I'm going to have to push some additional documentation commits anyways. TBH, I think the biggest wins are going to come from adding conditional test running logic to the Python tests, since those take 20-30 minutes to run, but I'm going to defer that to a separate PR / someone else, since we'll have to refactor |
"streaming-mqtt/test", | ||
"streaming-twitter/test", | ||
"streaming-zeromq/test", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should run pyspark test, same to sql and mllib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent was to have this be covered via transitive dependencies: the pyspark
module (on line 162) depends on mllib, sql, and streaming, so changes in any of those modules will cause the python tests to be run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O, I see, make sense.
@JoshRosen Defining additional modules is a good idea, but its still optional. Bundling them in the streaming module is fine as the streaming module that runs only the streaming+external tests and nothing else is going to be a huge win. Though just in case you want to separate the external stuff in their own module, you have to also consider doing that for python. Gotta be careful, if not done right, one can easily miss running python Kafka tests if Scala Kafka code was updated. |
|
||
mllib = Module( | ||
name="mllib", | ||
dependencies=[sql], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess mllib also depends on streaming, but let me double-check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, adding this.
Test build #35236 has finished for PR 6866 at commit
|
Test build #35239 has finished for PR 6866 at commit
|
It just occurred to me that |
dependencies=[], | ||
source_file_regexes=[ | ||
"external/", | ||
"extras/java8-tests/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't be there
], | ||
sbt_test_goals=[ | ||
"mllib/test", | ||
"examples/test", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary? we don't do this for other modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary; removing it now.
looks great! |
The one part of the test code that I still find somewhat confusing is the logic for choosing which SBT profiles to enable in order to run the tests. If we want to run the hive tests, for instance, we need to run the tests with the I'm going to look into pulling this logic into the |
Test build #35334 has finished for PR 6866 at commit
|
Test build #35355 has finished for PR 6866 at commit
|
Yay, this still passes tests! I've updated this to move the test build profile selection logic into the module system, which I think makes things a bit easier to understand. I've also done a bit of work to add modules for some of the external streaming subprojects. I have lots of other planned enhancements for this script, but I'm going to defer them to future work. In the meantime, if folks think that this looks okay then I'd like to merge it into master ASAP. |
|
||
pyspark = Module( | ||
name="pyspark", | ||
dependencies=[mllib, streaming, sql], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is python tests for streaming_kafka
LGTM, only one small thing |
@JoshRosen I fixed it and merged into master! |
This was introduced in apache#6866.
This patch builds upon #5694 to add a 'module' abstraction to the
dev/run-tests
script which groups together the per-module test logic, including the mapping from file paths to modules, the mapping from modules to test goals and build profiles, and the dependencies / relationships between modules.This refactoring makes it much easier to increase the granularity of test modules, which will let us skip even more tests. It's also a prerequisite for other changes that will reduce test time, such as running subsets of the Python tests based on which files / modules have changed.
This patch also adds doctests for the new graph traversal / change mapping code.