Optimize CI runs with Bazel [Blocked: #59] #1861

BenHenning · 2020-09-18T20:13:56Z

There are a few things we can do to improve CI runs once Bazel becomes the primary build system for the project, in particular:

We could have a multi-stage workflow, where the first set of jobs runs linters, creates the emulator image & related dependencies for Fix #1832 and part of #1831: Add example Espresso emulator test to GitHub CI #1844, and uses Bazel to build the APK and all tests. The outputs from these jobs would be uploaded as artifacts. The second stage would download these artifacts and run the tests in several parallel jobs (likely Robolectric for 1 job and N jobs for the emulator tests).
We could create a feedback loop by re-downloading the Bazel artifacts in stage 1 for subsequent pushes for a new copy of the codebase. If done correctly, this could result in an incremental Bazel build with only the latest changes. The dependencies and emulator artifacts don't need to be recreated. This would result in stage 1 becoming very quick for subsequent commits, leading to a significantly shorter overall test time.
For push-based workflow initiations, we should use Bazel to determine exactly which test targets are actually affected by the changes in the PR and only run those tests (we can still build everything--this is just an optimization for stage 2 so that it runs faster).

The result of this work, if possible, would be a bit slower initial commit workflow run but significantly faster subsequent runs up to the PR submission. develop & release branch pushes should still run the entire test suite from scratch.

Some caveats to consider:

Timestamp changes between CI machine images could cause Bazel to rebuild everything--some investigation will be needed into Bazel build stability
We use homebrew currently to install ffmpeg & its dependencies for capturing videos for emulator tests--there's not a clear way to upload/download Homebrew artifacts without just reinstalling them (or if it will actually be a performance improvement)
We need to handle the case of many quick, successive commits (especially after the initial PR commit and before that commit's slower preparation workflows have completed)

BenHenning · 2020-09-18T20:14:13Z

@vinitamurthi any thoughts?

BenHenning · 2020-09-22T21:37:57Z

/cc @miaboloix in case you have any thoughts

vinitamurthi · 2020-09-23T02:48:42Z

Overall this looks good to me. I think the emulator image specifically will be cached by the GitHub cache action itself. Not so sure about whether we can build Bazel incrementally with the way the cache works today.

vinitamurthi · 2020-09-23T02:50:33Z

I also like the idea of step 3 but we'd have to make a script to do that as far as I understand. We could even use that script in the pre-push hooks and run tests just for the affected files

BenHenning · 2020-09-29T01:36:55Z

Thanks. To be clear, I'm assuming we need to use artifacts as a cache to achieve the goals of this issue, rather than using the existing cache since that doesn't seem to share things across job instances.

BenHenning · 2020-11-19T18:21:43Z

bazelbuild/bazel#8521 & https://groups.google.com/g/bazel-discuss/c/I9udqWIcEdI/m/iczVgWLOBQAJ describe approaches for doing "dirty target" detection, but it's a bit tricky to do. #1904 is showing that remote caching is not sufficient:

Sometimes remote caches inexplicably aren't resulting in many cache hits
Remote caching must be made inaccessible to forks

BenHenning · 2020-11-20T01:47:52Z

Also, per #1904 we're attempting to use remote caching as a simpler way to share dependencies across builds. However, it seems to cache ~20% of the total actions, so there's still a lot of local computation happening (and it seems rather slow, especially for resource processing & data-binding). I think there are fundamental slow parts of the Android build pipeline that we're paying multiplicatively across all of our individual test targets that we only pay once in Gradle builds. This seems to be a significant enough chunk of the overall build time that it's leading to a 3-4x slower test runtime on Bazel over Gradle. This might just be something we have to live with in the beginning until we can:

Fragment the codebase and use affected target computations to limit the number of tests each PR runs (though we still need to consider cases when all tests need to run, anyway, e.g. due to changes in the root parts of the build graph)
Figure out ways to improve caching (from my early analysis, there are a few resource/info actions done as part of data-binding that are not deterministic and this ruins caching for critical parts of the Android build pipeline)
Find ways to enable remote caching for forks since otherwise the builds are going to suffer drastically, maybe by sharing caches across build nodes as proposed above (though we need to keep in mind caches are multiple gigabytes and tens of thousands of files, so this may be infeasible with out-of-the-box actions options)
Look into ways to make data-binding faster, or faster data-binding alternatives (maybe Jetpack Compose?). This needs more investigation.
If all else fails, introduce remote build execution. This is a last resort since it's expected that this will be much more expensive than the current remote caching we have enabled.

BenHenning · 2020-11-24T03:30:44Z

So digging into the build reproducibility issue a bit more, it seems like the following two artifacts are causing issues:

class-info.zip
layout-info.zip

Both of these are generated for databinding purposes. It seems like they both are either empty or contain some basic JSON or processed XML files. I'm not totally sure if those files are reproducible, but they use structured data and don't seem to contain anything time or random based so they can probably be made to be reproducible if they aren't already. The key problem here seems to be in the zip files themselves: creating consecutive zip files will never result in the same file (https://tanzu.vmware.com/content/blog/barriers-to-deterministic-reproducible-zip-files for context). This actually doesn't seem solvable, so I may need to convince folks to use an alternate structure or not use compression at all (at least for Bazel).

BenHenning · 2020-11-24T19:35:34Z

Aha, digging into layout-info.zip a bit more, I'm finding out that the main difference is the generated layout info XML files include absolute references to the original (copied) layout file. This, unfortunately, includes a PID or other identifier that changes for each build within the path itself, ruining reproducibility.

BenHenning · 2020-11-24T19:49:14Z

The class-info.zip differences seem to be tied to just being a zip file that changes (the contents are otherwise the same, as is the file size).

BenHenning · 2020-11-24T19:59:58Z

I'm reaching out to the databinding team to better understand what our options are here before I delve into the Bazel side to figure out how to improve build reproducibility (and thus, caching).

BenHenning · 2021-02-09T00:10:16Z

Haven't followed up, but it seems that this is just something that's been missed in the Android build pipeline for Bazel & something we can probably fix ourselves once we have time. Hopefully later in the year. :)

…Oppia tests (#1904) * Update Bazel CI workflow Build all Oppia targets in Bazel rather than just the binary target. * Update main.yml Reset the workflow name so that it doesn't need to be renamed in GitHub settings. * Introduce remote caching in Bazel. This uses a remote storage service with a local file encrypted using git-secret to act as the authentication key for Bazel to read & write artifacts to the service for caching purposes. * Add debug line. * Disable workflows + fix debug line. * More debugging. * More debugging. * Work around GitHub hiding secret since we're debugging. * Use base64 to properly encode newlines in GPG keys. * Remove debug lines before changing back to correct GPG key. * Switch to production key. * Fix env variable reference. Lock-down actions workflows via codeowners. * Install git-secret to default location. * Add details. Debug $PATH. * Fix pathing for git-secret. * Dummy commit to re-trigger actions. * Undo commit to see if this one shows up. * Fix git-secret pathing try 2. * New commit to re-trigger action. * Path debugging. * Workaround to get GitHub to show changes. * Update runner to use Bash. Reference: https://github.community/t/github-path-does-not-add-to-the-path/143992. * Restore binary-only build, other builds, and introduce new workflow for building all Bazel targets. * Remove debug lines. * Rename & remove keep_going. * Compute matrix containing all test targets. This will allow us to distribute parallelization responsibilities partly to GitHub actions to hopefully have slightly better throughput. See https://github.blog/changelog/2020-04-15-github-actions-new-workflow-features/ for reference on how this mechanism works. * Simplify, fix some issues, and debug instead of run. * Turn on actual testing. * Lower parallelization since GitHub started cancelling tasks. * Try 15 parallel jobs instead. * Turn off fail-fast instead of limiting parallelization. Fail fast seems to be the reason why the tests aren't completing, not quota (since if too many jobs are started, the extras should just be queued until resources open up). * Simplify workflow & allow it to be required. Also, introduce bazelrc file to simplify the CI CLIs interacting with Bazel. * Add test change to investigate computing affected targets. * Another test change to compute affected targets. * Update workflow to use future script compute_affected_tests.sh. Also, ignore Bazel directories in Git (to ease local development). * Add script to compute affected targets. This script is primarily meant to be used for CI. * Execute tests in parallel to build. This creates a new job to compute affected targets alongside the build. This may result in the initial build being a bit slower, but subsequent commits should be fast if remote caching is enabled. This will also result in better performance for forks that can't use remote caching. * Script clean-ups. Also, re-trigger actions. * Try to ensure develop branch is available for change analysis. * Add automatic workflow cancellation. Also, add support to explicitly run all tests if '[RunAllTests]' is in the PR title. * Attempt to make conditional matrix computation work. * Remove join since it was used incorrectly. * Add support for testing when Bazel, BUILD, and WORKSPACE files change. One consequence is the current Bazel file structure is very tight, so any changes will likely run the whole suite. This will get better over time. Also, show test logs for all test runs. * Fix broken tests by adding missing dep library. * Finalize PR. 1. Expand codeowners to include all workflow files. 2. Remove test comments in Kotlin files. 3. Re-enable all workflows. 4. Attempt to fix tests broken on actions but not locally by adding more thread synchronization points. * Lint fix. * Fix timing issues and JDK 9-specific regression. See comment thread in #1904 for more context. * Restore workflow names + introduce test check. The new test check workflow will be a static job that will be required to pass. One failing test is being introduced to verify that this check fails as expected. The original workflow names are being restored so that they don't need to be renamed in GitHub settings (since that would introduce a discontinuity in CI service & require multiple migratiaon PRs to fix). * Update StateFragmentLocalTest.kt Remove fail-on-purpose test since it verified what it needed to: the new test status check job is working as expected. * Address reviewer comments. * Fix most tests broken in Bazel after syncing. * Gitignore + fix broken test. The test failure here only happens when using JDK9+ (which doesn't happen in the Gradle world, or on CI). The .gitignore is since we can't yet specify a .bazelproject in a shareable way. * Lint fixes. * Post-merge clean-up. * Fix broken post-merge test. * Remove unnecessary codeowners per earlier codeowners setup. * Fix ItemSelectionInputDoesNotContainAtLeastOneOfRuleClassifierProviderTest. * Disable image region selection tests. These tests can't correctly pass on Robolectric until #1611 is fixed, so disabling them for the time being to avoid the image loading flake happening on CI (but not locally). Note that chances are a fix will still be needed for the flake, but that can be addressed later. * Disable 2 previously missed tests. * Post-merge lint fix. * Add missing dependency. Verified all tests build & pass both for JDK 9 & 11. Hopefully they will work as expected on CI. * Add missing codeowners for new files. * Post-merge fixes. This fixes some tests that were broken after recent PRs, and fixed a visibility error introduced in #2663. * Move Bazel tests to new workflow. This will make it easier to restart failures without having to also restart unrelated tests.

BenHenning · 2021-03-31T20:34:40Z

While we still need to improve caching for Android actions, per actions/cache#239 it seems like we might be able to get good mileage out of the GitHub actions cache by sharing a cache across all Bazel builds and benefiting greatly from Bazel's incremental building functionality. The cache will grow unbounded until we reach an established threshold, then we'll reset it (which effectively just means on occasion different actions will trigger a rebuild of the cache and take a bit longer to fill in the cache pieces).

Due to Bazel being a hermetic build system, the natural data race in trying to backfill the cache should also be fine (e.g. if multiple actions detect a cache reset and each start from scratch then the last one to upload wins, and all later actions benefit from it).

From local testing, I'm guessing we'll see a 3x-10x performance improvement depending on how expensive uploading/downloading the cache is (which for a 4GB cache might actually take a lot of time). However, speeding up the builds may let us reduce sharding by running blocks of tests in each job rather than having one job per test. In that way we'll share the one-time cost of downloading/uploading the cache, and potentially reduce the number of distinct phases from ~32 (160 tests / 5 concurrent tests) to ~2-3 which may result in tests taking 10-40 minutes to run instead of 8-16 hours.

Note that if we go with this approach we probably will also want to block the test runs on the Bazel build since that will act as an effective "prime cache" mechanism to further reduce potential OOMs when building & running large numbers of tests.

…zel CI workflows (#3035) * Enable caching for Bazel CI actions. This attempts to use actions/cache to persist a shared disk cache across Bazel CI actions to leverage Bazel's hermetic & incremental build performance benefits to hopefully substantially decrease Bazel CI build times. This only enables the cache for the binary build for now; tests will be added in a susbequent commit once this is proven as working. Other expensive workflows have been disabled until the workflow changes are stable. * Fix cache directory relative reference. * Fix cache param name. * Fix cache size check. * Fix Oppia bazel artifact file name. * Enable caching for Bazel unit tests. Also, re-enables Bazel unit tests workflow. * Introduce more granular caching. This commit introduces buckets of caching for tests, test-level cache selection, build-level cache selection, and priority orders to try and leverage multiple simultaneous caches that can be shared across tests & binary builds. * Migrate deprecate set-env to new env files. * Re-enable Gradle tests. Also, this is triggering a full incremental run of the Bazel unit tests to see if they can run any faster than the previous ~2 hour run.

BenHenning · 2021-07-22T01:13:09Z

Per #3513 we should also consider reverting the SDK pinning occurring in that run or find a way to cache it to avoid the ~1 minute cost to download the SDK (caching is probably not viable since the SDK is quite large).

BenHenning · 2021-07-22T01:30:45Z

Another potential improvement from the above: we could modify the existing SDK already included in the CI environment by removing the preinstalled build tools & just downloading the ones we need. That would be much faster, but it puts us in a weird semi-hermetic SDK environment which doesn't seem ideal.

BenHenning added Type: Improvement labels Sep 18, 2020

BenHenning added this to the Backlog milestone Sep 18, 2020

BenHenning added the Priority: Nice-to-have This work item is nice to have for its milestone. label Sep 18, 2020

BenHenning mentioned this issue Nov 19, 2020

Fix part of #1861: [RunAllTests] Update Bazel CI workflow to run all Oppia tests #1904

Merged

2 tasks

BenHenning mentioned this issue Mar 11, 2021

[RunAllTests] Remove caching for Bazel CI workflows #2884

Merged

This was referenced Mar 31, 2021

[RunAllTests] Fix part of #2844: work around Bazel-specific CI issue that's causing an OpenJDK crash/flake #2846

Merged

[RunAllTests] Fix part of #1861: Enable GitHub Actions caching for Bazel CI workflows #3035

Merged

BenHenning mentioned this issue Jul 22, 2021

[RunAllTests] Fix #3510: Set up Android SDK hermetically in GitHub Actions to workaround #3024 #3513

Merged

8 tasks

BenHenning added temp: triage for beta and removed Status: Not started labels Jun 7, 2022

Broppia added dev_team Impact: Low Low perceived user impact (e.g. edge cases). labels Jul 29, 2022

BenHenning added the issue_user_developer label Sep 16, 2022

BenHenning added this to [Task Force] Migrate to Bazel - Android Sep 16, 2022

BenHenning moved this to Todo in [Task Force] Migrate to Bazel - Android Sep 16, 2022

BenHenning removed this from the Backlog milestone Sep 16, 2022

seanlip added bug End user-perceivable behaviors which are not desirable. and removed issue_type_infrastructure labels Mar 28, 2023

seanlip added enhancement End user-perceivable enhancements. and removed Type: Improvement bug End user-perceivable behaviors which are not desirable. labels Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CI runs with Bazel [Blocked: #59] #1861

Optimize CI runs with Bazel [Blocked: #59] #1861

BenHenning commented Sep 18, 2020

BenHenning commented Sep 18, 2020

BenHenning commented Sep 22, 2020

vinitamurthi commented Sep 23, 2020

vinitamurthi commented Sep 23, 2020

BenHenning commented Sep 29, 2020

BenHenning commented Nov 19, 2020

BenHenning commented Nov 20, 2020

BenHenning commented Nov 24, 2020

BenHenning commented Nov 24, 2020

BenHenning commented Nov 24, 2020

BenHenning commented Nov 24, 2020

BenHenning commented Feb 9, 2021

BenHenning commented Mar 31, 2021 •

edited

Loading

BenHenning commented Jul 22, 2021

BenHenning commented Jul 22, 2021

Optimize CI runs with Bazel [Blocked: #59] #1861

Optimize CI runs with Bazel [Blocked: #59] #1861

Comments

BenHenning commented Sep 18, 2020

BenHenning commented Sep 18, 2020

BenHenning commented Sep 22, 2020

vinitamurthi commented Sep 23, 2020

vinitamurthi commented Sep 23, 2020

BenHenning commented Sep 29, 2020

BenHenning commented Nov 19, 2020

BenHenning commented Nov 20, 2020

BenHenning commented Nov 24, 2020

BenHenning commented Nov 24, 2020

BenHenning commented Nov 24, 2020

BenHenning commented Nov 24, 2020

BenHenning commented Feb 9, 2021

BenHenning commented Mar 31, 2021 • edited Loading

BenHenning commented Jul 22, 2021

BenHenning commented Jul 22, 2021

BenHenning commented Mar 31, 2021 •

edited

Loading