Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CI runs with Bazel [Blocked: #59] #1861

Open
BenHenning opened this issue Sep 18, 2020 · 15 comments
Open

Optimize CI runs with Bazel [Blocked: #59] #1861

BenHenning opened this issue Sep 18, 2020 · 15 comments
Labels
enhancement End user-perceivable enhancements. Impact: Medium Moderate perceived user impact (non-blocking bugs and general improvements). Issue: Needs Clarification Indicates that an issue needs more detail in order to be able to be acted upon. Priority: Nice-to-have This work item is nice to have for its milestone. Z-ibt Temporary label for Ben to keep track of issues he's triaged.

Comments

@BenHenning
Copy link
Member

There are a few things we can do to improve CI runs once Bazel becomes the primary build system for the project, in particular:

  1. We could have a multi-stage workflow, where the first set of jobs runs linters, creates the emulator image & related dependencies for Fix #1832 and part of #1831: Add example Espresso emulator test to GitHub CI #1844, and uses Bazel to build the APK and all tests. The outputs from these jobs would be uploaded as artifacts. The second stage would download these artifacts and run the tests in several parallel jobs (likely Robolectric for 1 job and N jobs for the emulator tests).

  2. We could create a feedback loop by re-downloading the Bazel artifacts in stage 1 for subsequent pushes for a new copy of the codebase. If done correctly, this could result in an incremental Bazel build with only the latest changes. The dependencies and emulator artifacts don't need to be recreated. This would result in stage 1 becoming very quick for subsequent commits, leading to a significantly shorter overall test time.

  3. For push-based workflow initiations, we should use Bazel to determine exactly which test targets are actually affected by the changes in the PR and only run those tests (we can still build everything--this is just an optimization for stage 2 so that it runs faster).

The result of this work, if possible, would be a bit slower initial commit workflow run but significantly faster subsequent runs up to the PR submission. develop & release branch pushes should still run the entire test suite from scratch.

Some caveats to consider:

  1. Timestamp changes between CI machine images could cause Bazel to rebuild everything--some investigation will be needed into Bazel build stability

  2. We use homebrew currently to install ffmpeg & its dependencies for capturing videos for emulator tests--there's not a clear way to upload/download Homebrew artifacts without just reinstalling them (or if it will actually be a performance improvement)

  3. We need to handle the case of many quick, successive commits (especially after the initial PR commit and before that commit's slower preparation workflows have completed)

@BenHenning BenHenning added this to the Backlog milestone Sep 18, 2020
@BenHenning BenHenning added the Priority: Nice-to-have This work item is nice to have for its milestone. label Sep 18, 2020
@BenHenning
Copy link
Member Author

@vinitamurthi any thoughts?

@BenHenning
Copy link
Member Author

/cc @miaboloix in case you have any thoughts

@vinitamurthi
Copy link
Contributor

Overall this looks good to me. I think the emulator image specifically will be cached by the GitHub cache action itself. Not so sure about whether we can build Bazel incrementally with the way the cache works today.

@vinitamurthi
Copy link
Contributor

I also like the idea of step 3 but we'd have to make a script to do that as far as I understand. We could even use that script in the pre-push hooks and run tests just for the affected files

@BenHenning
Copy link
Member Author

Thanks. To be clear, I'm assuming we need to use artifacts as a cache to achieve the goals of this issue, rather than using the existing cache since that doesn't seem to share things across job instances.

@BenHenning
Copy link
Member Author

bazelbuild/bazel#8521 & https://groups.google.com/g/bazel-discuss/c/I9udqWIcEdI/m/iczVgWLOBQAJ describe approaches for doing "dirty target" detection, but it's a bit tricky to do. #1904 is showing that remote caching is not sufficient:

  1. Sometimes remote caches inexplicably aren't resulting in many cache hits
  2. Remote caching must be made inaccessible to forks

@BenHenning
Copy link
Member Author

Also, per #1904 we're attempting to use remote caching as a simpler way to share dependencies across builds. However, it seems to cache ~20% of the total actions, so there's still a lot of local computation happening (and it seems rather slow, especially for resource processing & data-binding). I think there are fundamental slow parts of the Android build pipeline that we're paying multiplicatively across all of our individual test targets that we only pay once in Gradle builds. This seems to be a significant enough chunk of the overall build time that it's leading to a 3-4x slower test runtime on Bazel over Gradle. This might just be something we have to live with in the beginning until we can:

  1. Fragment the codebase and use affected target computations to limit the number of tests each PR runs (though we still need to consider cases when all tests need to run, anyway, e.g. due to changes in the root parts of the build graph)
  2. Figure out ways to improve caching (from my early analysis, there are a few resource/info actions done as part of data-binding that are not deterministic and this ruins caching for critical parts of the Android build pipeline)
  3. Find ways to enable remote caching for forks since otherwise the builds are going to suffer drastically, maybe by sharing caches across build nodes as proposed above (though we need to keep in mind caches are multiple gigabytes and tens of thousands of files, so this may be infeasible with out-of-the-box actions options)
  4. Look into ways to make data-binding faster, or faster data-binding alternatives (maybe Jetpack Compose?). This needs more investigation.
  5. If all else fails, introduce remote build execution. This is a last resort since it's expected that this will be much more expensive than the current remote caching we have enabled.

@BenHenning
Copy link
Member Author

So digging into the build reproducibility issue a bit more, it seems like the following two artifacts are causing issues:

  • class-info.zip
  • layout-info.zip

Both of these are generated for databinding purposes. It seems like they both are either empty or contain some basic JSON or processed XML files. I'm not totally sure if those files are reproducible, but they use structured data and don't seem to contain anything time or random based so they can probably be made to be reproducible if they aren't already. The key problem here seems to be in the zip files themselves: creating consecutive zip files will never result in the same file (https://tanzu.vmware.com/content/blog/barriers-to-deterministic-reproducible-zip-files for context). This actually doesn't seem solvable, so I may need to convince folks to use an alternate structure or not use compression at all (at least for Bazel).

@BenHenning
Copy link
Member Author

Aha, digging into layout-info.zip a bit more, I'm finding out that the main difference is the generated layout info XML files include absolute references to the original (copied) layout file. This, unfortunately, includes a PID or other identifier that changes for each build within the path itself, ruining reproducibility.

@BenHenning
Copy link
Member Author

The class-info.zip differences seem to be tied to just being a zip file that changes (the contents are otherwise the same, as is the file size).

@BenHenning
Copy link
Member Author

I'm reaching out to the databinding team to better understand what our options are here before I delve into the Bazel side to figure out how to improve build reproducibility (and thus, caching).

@BenHenning
Copy link
Member Author

Haven't followed up, but it seems that this is just something that's been missed in the Android build pipeline for Bazel & something we can probably fix ourselves once we have time. Hopefully later in the year. :)

BenHenning added a commit that referenced this issue Feb 12, 2021
…Oppia tests (#1904)

* Update Bazel CI workflow

Build all Oppia targets in Bazel rather than just the binary target.

* Update main.yml

Reset the workflow name so that it doesn't need to be renamed in GitHub settings.

* Introduce remote caching in Bazel.

This uses a remote storage service with a local file encrypted using
git-secret to act as the authentication key for Bazel to read & write
artifacts to the service for caching purposes.

* Add debug line.

* Disable workflows + fix debug line.

* More debugging.

* More debugging.

* Work around GitHub hiding secret since we're debugging.

* Use base64 to properly encode newlines in GPG keys.

* Remove debug lines before changing back to correct GPG key.

* Switch to production key.

* Fix env variable reference. Lock-down actions workflows via codeowners.

* Install git-secret to default location.

* Add details. Debug $PATH.

* Fix pathing for git-secret.

* Dummy commit to re-trigger actions.

* Undo commit to see if this one shows up.

* Fix git-secret pathing try 2.

* New commit to re-trigger action.

* Path debugging.

* Workaround to get GitHub to show changes.

* Update runner to use Bash.

Reference:
https://github.community/t/github-path-does-not-add-to-the-path/143992.

* Restore binary-only build, other builds, and introduce new workflow for
building all Bazel targets.

* Remove debug lines.

* Rename & remove keep_going.

* Compute matrix containing all test targets.

This will allow us to distribute parallelization responsibilities partly
to GitHub actions to hopefully have slightly better throughput. See
https://github.blog/changelog/2020-04-15-github-actions-new-workflow-features/
for reference on how this mechanism works.

* Simplify, fix some issues, and debug instead of run.

* Turn on actual testing.

* Lower parallelization since GitHub started cancelling tasks.

* Try 15 parallel jobs instead.

* Turn off fail-fast instead of limiting parallelization.

Fail fast seems to be the reason why the tests aren't completing, not
quota (since if too many jobs are started, the extras should just be
queued until resources open up).

* Simplify workflow & allow it to be required.

Also, introduce bazelrc file to simplify the CI CLIs interacting with
Bazel.

* Add test change to investigate computing affected targets.

* Another test change to compute affected targets.

* Update workflow to use future script compute_affected_tests.sh.

Also, ignore Bazel directories in Git (to ease local development).

* Add script to compute affected targets.

This script is primarily meant to be used for CI.

* Execute tests in parallel to build.

This creates a new job to compute affected targets alongside the build.
This may result in the initial build being a bit slower, but subsequent
commits should be fast if remote caching is enabled. This will also
result in better performance for forks that can't use remote caching.

* Script clean-ups.

Also, re-trigger actions.

* Try to ensure develop branch is available for change analysis.

* Add automatic workflow cancellation.

Also, add support to explicitly run all tests if '[RunAllTests]' is in
the PR title.

* Attempt to make conditional matrix computation work.

* Remove join since it was used incorrectly.

* Add support for testing when Bazel, BUILD, and WORKSPACE files change.

One consequence is the current Bazel file structure is very tight, so
any changes will likely run the whole suite. This will get better over
time.

Also, show test logs for all test runs.

* Fix broken tests by adding missing dep library.

* Finalize PR.

1. Expand codeowners to include all workflow files.
2. Remove test comments in Kotlin files.
3. Re-enable all workflows.
4. Attempt to fix tests broken on actions but not locally by adding more
thread synchronization points.

* Lint fix.

* Fix timing issues and JDK 9-specific regression.

See comment thread in #1904 for more context.

* Restore workflow names + introduce test check.

The new test check workflow will be a static job that will be required
to pass. One failing test is being introduced to verify that this check
fails as expected.

The original workflow names are being restored so that they don't need
to be renamed in GitHub settings (since that would introduce a
discontinuity in CI service & require multiple migratiaon PRs to fix).

* Update StateFragmentLocalTest.kt

Remove fail-on-purpose test since it verified what it needed to: the new test status check job is working as expected.

* Address reviewer comments.

* Fix most tests broken in Bazel after syncing.

* Gitignore + fix broken test.

The test failure here only happens when using JDK9+ (which doesn't
happen in the Gradle world, or on CI).

The .gitignore is since we can't yet specify a .bazelproject in a
shareable way.

* Lint fixes.

* Post-merge clean-up.

* Fix broken post-merge test.

* Remove unnecessary codeowners per earlier codeowners setup.

* Fix
ItemSelectionInputDoesNotContainAtLeastOneOfRuleClassifierProviderTest.

* Disable image region selection tests.

These tests can't correctly pass on Robolectric until #1611 is fixed, so
disabling them for the time being to avoid the image loading flake
happening on CI (but not locally). Note that chances are a fix will
still be needed for the flake, but that can be addressed later.

* Disable 2 previously missed tests.

* Post-merge lint fix.

* Add missing dependency.

Verified all tests build & pass both for JDK 9 & 11. Hopefully they will
work as expected on CI.

* Add missing codeowners for new files.

* Post-merge fixes.

This fixes some tests that were broken after recent PRs, and fixed a
visibility error introduced in #2663.

* Move Bazel tests to new workflow.

This will make it easier to restart failures without having to also
restart unrelated tests.
@BenHenning
Copy link
Member Author

BenHenning commented Mar 31, 2021

While we still need to improve caching for Android actions, per actions/cache#239 it seems like we might be able to get good mileage out of the GitHub actions cache by sharing a cache across all Bazel builds and benefiting greatly from Bazel's incremental building functionality. The cache will grow unbounded until we reach an established threshold, then we'll reset it (which effectively just means on occasion different actions will trigger a rebuild of the cache and take a bit longer to fill in the cache pieces).

Due to Bazel being a hermetic build system, the natural data race in trying to backfill the cache should also be fine (e.g. if multiple actions detect a cache reset and each start from scratch then the last one to upload wins, and all later actions benefit from it).

From local testing, I'm guessing we'll see a 3x-10x performance improvement depending on how expensive uploading/downloading the cache is (which for a 4GB cache might actually take a lot of time). However, speeding up the builds may let us reduce sharding by running blocks of tests in each job rather than having one job per test. In that way we'll share the one-time cost of downloading/uploading the cache, and potentially reduce the number of distinct phases from ~32 (160 tests / 5 concurrent tests) to ~2-3 which may result in tests taking 10-40 minutes to run instead of 8-16 hours.

Note that if we go with this approach we probably will also want to block the test runs on the Bazel build since that will act as an effective "prime cache" mechanism to further reduce potential OOMs when building & running large numbers of tests.

BenHenning added a commit that referenced this issue Apr 6, 2021
…zel CI workflows (#3035)

* Enable caching for Bazel CI actions.

This attempts to use actions/cache to persist a shared disk cache across
Bazel CI actions to leverage Bazel's hermetic & incremental build
performance benefits to hopefully substantially decrease Bazel CI build
times.

This only enables the cache for the binary build for now; tests will be
added in a susbequent commit once this is proven as working.

Other expensive workflows have been disabled until the workflow changes
are stable.

* Fix cache directory relative reference.

* Fix cache param name.

* Fix cache size check.

* Fix Oppia bazel artifact file name.

* Enable caching for Bazel unit tests.

Also, re-enables Bazel unit tests workflow.

* Introduce more granular caching.

This commit introduces buckets of caching for tests, test-level cache
selection, build-level cache selection, and priority orders to try and
leverage multiple simultaneous caches that can be shared across tests &
binary builds.

* Migrate deprecate set-env to new env files.

* Re-enable Gradle tests.

Also, this is triggering a full incremental run of the Bazel unit tests
to see if they can run any faster than the previous ~2 hour run.
@BenHenning
Copy link
Member Author

Per #3513 we should also consider reverting the SDK pinning occurring in that run or find a way to cache it to avoid the ~1 minute cost to download the SDK (caching is probably not viable since the SDK is quite large).

@BenHenning
Copy link
Member Author

Another potential improvement from the above: we could modify the existing SDK already included in the CI environment by removing the preinstalled build tools & just downloading the ones we need. That would be much faster, but it puts us in a weird semi-hermetic SDK environment which doesn't seem ideal.

@Broppia Broppia added dev_team Impact: Low Low perceived user impact (e.g. edge cases). labels Jul 29, 2022
@BenHenning BenHenning added Impact: Medium Moderate perceived user impact (non-blocking bugs and general improvements). Issue: Needs Clarification Indicates that an issue needs more detail in order to be able to be acted upon. Z-ibt Temporary label for Ben to keep track of issues he's triaged. and removed Impact: Low Low perceived user impact (e.g. edge cases). labels Sep 16, 2022
@BenHenning BenHenning removed this from the Backlog milestone Sep 16, 2022
@seanlip seanlip added bug End user-perceivable behaviors which are not desirable. and removed issue_type_infrastructure labels Mar 28, 2023
@seanlip seanlip added enhancement End user-perceivable enhancements. and removed Type: Improvement bug End user-perceivable behaviors which are not desirable. labels Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement End user-perceivable enhancements. Impact: Medium Moderate perceived user impact (non-blocking bugs and general improvements). Issue: Needs Clarification Indicates that an issue needs more detail in order to be able to be acted upon. Priority: Nice-to-have This work item is nice to have for its milestone. Z-ibt Temporary label for Ben to keep track of issues he's triaged.
Projects
Development

No branches or pull requests

4 participants