Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap for CI Setup Improvements #5136

Open
jdrueckert opened this issue Sep 3, 2023 · 4 comments
Open

Roadmap for CI Setup Improvements #5136

jdrueckert opened this issue Sep 3, 2023 · 4 comments
Labels
Category: Build/CI Requests, Issues and Changes targeting gradle, groovy, Jenkins, etc. Category: Doc Requests, Issues and Changes targeting javadoc and module documentation Size: L Very big effort likely requiring a lot of research and work in many areas across the codebase Status: Needs Discussion Requires help discussing a reported issue or provided PR Topic: Stabilization Requests, Issues and Changes related to improving stablity and reducing flakyness Type: Improvement Request for or addition/enhancement of a feature

Comments

@jdrueckert
Copy link
Member

Motivation

Problems with our current/previous CI setup includes high cost, complexity, artifactory downtimes, and lack of reproducibility.

Cost

Currently, CI workers cannot be scaled down below two even though contributor activity currently is very low so that most days of the week we don't even need a single one.

The available resources for CI runs should be small by default to save cost as long as we don't have a lot of activity.
In times of high activity (e.g. during peak times or bigger efforts spanning a lot of repos / PRs) it should automatically scale or at least be possible for privileged contributors to enable a higher scale of available resources.

Complexity

The different CI jobs are unnecessarily entwined and complex, including e.g. copying around the build harness and other artifacts instead of publishing them to and consuming them from artifactory. Even for long-time contributors the CI setup is hard to understand, debug, and fix. Often times we need to wait for @Cervator to find time to resolve an issue or update configuration etc.

Individual CI jobs should be independent of each other and use artifactory as the source of truth. Aside of test results that don't need to be persisted long-term, any (build) artifacts should be published to and consumed from artifactory. Job contracts (required inputs, expected outputs) should be documented and supported with architecture and data flow diagrams to make the CI setup easier to understand and maintain for everyone. This will allow us to distribute the work better and react faster in case of issues.

Artifactory Downtimes

In the past, artifactory went down every few weeks/months (depending on activity), IIRC mostly due to out of space or out of memory issues. This negatively impacts active contributors in locally building and testing as well as CI runs consuming artifacts from or publishing artifacts to artifactory. While "old" contributors can often rely to a degree on cached information, artifactory downtimes highly affect new contributors that cannot.

Artifactory as the source of truth should be as stable and highly available as possible to avoid blocking contributors old or new. Space issues should be mitigated by adding more capacity, archiving or rotating artifacts. Periodically run health checks should verify artifactory is available and at least attempting to restart it if it's not.

Lack of Reproducibility

Due to the complexity of our CI setup, especially interdependencies between jobs, as well as custom logic in Jenkinsfiles, CI runs are currently hard to reproduce for developers. Building a release in particular is currently not actionable locally due to the release tag not including information on the included module status (last commit at release time). In addition, the lack of proper (read: non-SNAPSHOT) versioning and BoM information across omega makes releases irreproducible.

In addition to reducing complexity, instead of maintaining a lot of logic in the Jenkinsfiles, this logic should be maintained as gradle tasks where possible such that it can be locally reproduced easily by developers. Furthermore, a workspace pinning mechanism similar to @skaldarnar's NodeGooey would already help to more easily reproduce issues of other contributors or (omega) releases by providing a kind of BoM.

Proposal

On a high level, what will the effort entail? (there's space for a fine-granular task breakdown farther down)

  • stabilize artifactory
  • make artifactory source of truth
  • remove interdependencies between CI jobs
  • improve CI setup and maintenance documentation incl. diagrams
  • support workspace pinning

Which areas of Terasology will the effort affect?

  • infra/CI incl. jenkins and artifactory
  • engine and module repo pushval and post-merge CI runs
  • CI means for release and other maintenance activities

What is the "Definition of Done" for this effort?

  • all jobs publish artifacts solely to and consume artifacts solely from artifactory (except test results)
  • artifactory health check with restart functionality if unavailable
  • if possible artifact archiving and/or rotating configured
  • optional: if necessary revised resource capacities for artifactory
  • runbooks for common maintenance activities and debugging common issues
  • architecture diagram for CI setup incl. jenkins, artifactory, github, their sub-entities (e.g. jobs/repos) and their communication channels
  • sequence/activity/flow diagram at least for (omega) release process
  • workspace pinning mechanism + runbook for pinning workspace as part of release process
  • optional: automated workspace pin creation during release
  • automatic down-scaling of agents, manual upscaling of agents (one agent should always be available to get started by default, but don't run 24/7)
  • optional: automatic scale-up of agents in case of high demand
  • if possible build logic formerly located in Jenkinsfiles transferred into new gradle tasks

Concerns

Is there specific expertise that will be needed for this effort?

  • access to infra agents (kubernetes, jenkins, artifactory) and IaC repo
  • expertise in kubernetes, jenkinsfiles, gradle, etc.
  • familiarity with existing setup

Do you expect this effort to conflict with any other efforts?

  • unavailability of jenkins or artifactory as well as failures of jenkins jobs or artifactory usage (publishing / consuming) unrelated to code changes can block development and discourage old and new contributors

What are potential drawbacks of the effort?

  • effort might leave CI setup in worse state than before if it's not completed
  • might introduce new issues

What are maintenance or continuous efforts that will persist beyond the completion of this effort?

  • software updates for infra agents, plugins, and dependencies
  • debugging and fixing future issues

Task Breakdown

tbd

  • What are the individual tasks that need to be done to complete the effort?
  • Can you roughly estimate how hard the individual tasks would be for a software developer with 2 years of on-the-job Java development expertise, but no in-depth expertise in special areas such as rendering or AI?
  • Which tasks are inter-dependent?
  • Which tasks can be done in parallel?

Additional notes

Current CI Setup

image

Desired CI Setup

image

@jdrueckert jdrueckert added Category: Doc Requests, Issues and Changes targeting javadoc and module documentation Category: Build/CI Requests, Issues and Changes targeting gradle, groovy, Jenkins, etc. Status: Needs Discussion Requires help discussing a reported issue or provided PR Size: L Very big effort likely requiring a lot of research and work in many areas across the codebase Type: Improvement Request for or addition/enhancement of a feature Topic: Stabilization Requests, Issues and Changes related to improving stablity and reducing flakyness labels Sep 3, 2023
@skaldarnar
Copy link
Member

builds should also be reproducible for CI (e.g. when Artifactory was down, there was a power outtage, …)

@skaldarnar
Copy link
Member

skaldarnar commented Sep 3, 2023

optional: automated workspace pin creation during release

Hm, I thought this would be the other way around, that we pin which versions we want to release… 🤔

but that would require much more development and tooling regarding versioning, wouldn't it?

Not necessarily. If we try to make all the versions nice and clean and follow semver probably yes, but if can pin based on a commit hash that would still be reproducible while not engaging in versioning hell.

I imagine this release to be one of the core maintainers checking out the whole workspace in the latest commits (in most cases), testing it locally, and then somehow pinning this state (based on commit hashes)

@jdrueckert jdrueckert added this to the 2023 Revive - Milestone 3 milestone Nov 18, 2023
@soloturn
Copy link
Contributor

soloturn commented Apr 7, 2024

@jdrueckert, what is the build harness which is passed around, and what is the index repo? am i right if:

@jdrueckert
Copy link
Member Author

@jdrueckert, what is the build harness which is passed around, and what is the index repo?

Index repo: https://github.com/Terasology/index
Build harness is basically a collection of files that are required to build modules "stand-alone" (see here)

am i right if:
* engine repo: https://github.com/MovingBlocks/Terasology/

yes, correct, that's our engine repo

* module repo: e.g. https://github.com/Terasology/JoshariasSurvival

yes, https://github.com/Terasology/JoshariasSurvival is one of our module repos. basically any repo in https://github.com/Terasology is a module repo except for the index repo (which IMHO shouldn't be in there).

but i am wondiering where this is built

Modules are built by Jenkins like the engine is: https://jenkins.terasology.io/job/Terasology/job/Modules/
For CI purposes you have the normal branch/PR jobs and the develop job which runs after a PR is merged.
When we build a new Terasology release, we have a dedicated omega job which fetches the Terasology.zip built by the engine job and the module jars built by the module jobs from the artifactory to create the TerasologyOmega.zip which we attach to our releases (e.g. see https://github.com/MovingBlocks/Terasology/releases/tag/v5.3.0) and which is consumed by the launcher.
In the current CI setup diagram that's this part
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Build/CI Requests, Issues and Changes targeting gradle, groovy, Jenkins, etc. Category: Doc Requests, Issues and Changes targeting javadoc and module documentation Size: L Very big effort likely requiring a lot of research and work in many areas across the codebase Status: Needs Discussion Requires help discussing a reported issue or provided PR Topic: Stabilization Requests, Issues and Changes related to improving stablity and reducing flakyness Type: Improvement Request for or addition/enhancement of a feature
Projects
Status: No status
Status: No status
Status: No status
Development

No branches or pull requests

3 participants