Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Compliance Test Suite to automated build #353

Closed
planetf1 opened this issue Nov 6, 2018 · 28 comments
Closed

Add Compliance Test Suite to automated build #353

planetf1 opened this issue Nov 6, 2018 · 28 comments
Labels
build-improvement Build improvements - maven, gradle, GitHub actions conformance-testing Egeria conformance testing consumability Makes the software easier to use or understand. Includes docs, error messages, logs, API definitions pinned Keep open (do not time out) testing testing - including automation

Comments

@planetf1
Copy link
Member

planetf1 commented Nov 6, 2018

I suggest we add running the compliance test suite against the in-memory repository as part of
our once-daily build

At a later point in time we can consider if this could be run against other metadata repositories.

@planetf1 planetf1 added enhancement New feature or request consumability Makes the software easier to use or understand. Includes docs, error messages, logs, API definitions labels Nov 6, 2018
@planetf1
Copy link
Member Author

I presume this is still a requirement @mandy-chessell @cmgrote @grahamwallis

@planetf1 planetf1 self-assigned this Jul 19, 2019
@mandy-chessell
Copy link
Contributor

Yes please

@cmgrote
Copy link
Member

cmgrote commented Jul 23, 2019

At a later point in time we can consider if this could be run against other metadata repositories.

There's an an initial version of this already part of our Helm charts for other metadata repositories... Just set the following in the values.yaml of the vdc chart (by default it's set to false), and it should create and configure a CTS instance for each repository being deployed as part of the chart:

# Egeria Conformance Test Suite - sets up to run against all Egeria repositories (if enabled=true)
cts:
  enabled: true

(I think this is probably our best option, since it will require such an external repository to first exist -- probably not something that will ever be part of our automated Jenkins builds, particularly for proprietary / licensed repositories?)

@planetf1 planetf1 added the cicd label Oct 18, 2019
@mandy-chessell mandy-chessell added this to the 2019.11 (1.2) milestone Oct 25, 2019
@planetf1 planetf1 added the testing testing - including automation label Nov 1, 2019
@planetf1 planetf1 added the build-improvement Build improvements - maven, gradle, GitHub actions label Nov 11, 2019
@planetf1
Copy link
Member Author

In addition to running the cts, the results should be shared in some way - for example through some kind of build artifact, so that a consuming organisation could refer back to the CTS results for a shipped release - as well as developers being able to see CTS results from each build run.

These could then be linked to from release notes/top level readme

@planetf1 planetf1 added functionality-call To discuss and agree during weekly functionality call(s) and removed enhancement New feature or request labels Nov 11, 2019
@planetf1
Copy link
Member Author

For 1.2 I will plan to execute the CTS & post the results at, or with a link from, the GitHub releases page. Experience learnt will be used to help refine the requirements for 1.3 where automation will be targeted

@planetf1 planetf1 modified the milestones: 2019.11 (1.2), 2019.12 (1.3) Nov 29, 2019
@planetf1
Copy link
Member Author

CTS is now looking good in 1.2, but for this release the run is done semi manually (ie via notebook). Automation will follow in a subsequent release

@planetf1
Copy link
Member Author

I am starting to prototype some CI/CD definitions for automated helm chart deployment.
Initially this is to azure with a very limited sub as a POC, and initially with a basic notebook deployment (less moving parts), but will a) work on fuller sub b) add cts once some initial proof points are complete.

I'll link the necessary PRs for the CI/CD definitions here. Some of the changes are in base egeria, others are done directly through azure pipelines.

planetf1 referenced this issue in planetf1/egeria Dec 16, 2019
planetf1 referenced this issue in planetf1/egeria Dec 16, 2019
Signed-off-by: Nigel Jones <[email protected]>
planetf1 referenced this issue in planetf1/egeria Dec 17, 2019
Signed-off-by: Nigel Jones <[email protected]>
@planetf1 planetf1 modified the milestones: 2019.12 (1.3), 2020.01 (1.4) Dec 20, 2019
@planetf1 planetf1 modified the milestones: 2020.01 (1.4), 2020.02 (1.5) Jan 23, 2020
@cmgrote cmgrote removed the functionality-call To discuss and agree during weekly functionality call(s) label Feb 10, 2020
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label Apr 24, 2021
@planetf1 planetf1 removed the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label Apr 28, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label Jun 29, 2021
@mandy-chessell mandy-chessell removed the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label Jun 30, 2021
@planetf1 planetf1 removed the cicd label Aug 3, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label Oct 20, 2021
@planetf1 planetf1 added pinned Keep open (do not time out) and removed no-issue-activity Issues automatically marked as stale because they have not had recent activity. labels Oct 27, 2021
@planetf1
Copy link
Member Author

planetf1 commented Apr 1, 2022

The charts worked well for release testing on graph & inmem (after updated for inmem & openshift).

Next step would be to look at running automatically within a ci/cd workflow, probably triggered off a schedule

We could take something like the KinD workflow from https://github.com/odpi/egeria-database-connectors/blob/main/.github/workflows/connector-test.yml - obviously there's more to do..

Also referenced lately during the 3.7 release #6341

@planetf1
Copy link
Member Author

A few simple things we could check

  • number of tests failed is 0
  • number of tests successful is > ???? (current number? just big? could be checked in)
  • no exceptions in audit log
  • profile results compare to baseline (which could be checked in)

Better would be to properly map the tests to test suites ie much more fine grained, but this is likely substantial development

CTS failures can take a while to investigate, so automation could pick up issues a lot quicker - for example by running daily.

One concern is whether the 7GB agent running KinD would have enough resource to complete the tests

@planetf1
Copy link
Member Author

planetf1 commented Dec 14, 2022

An initial version of this is now being tested at https://github.com/planetf1/cts
Still debugging, but expected:

The github action will

  • run 2 parallel jobs - for inmem & graph
  • Install k8s within a container (KinD)
  • setup & install our cts chart
  • wait for results
  • capture and post the results as an attachment

Caveats

  • Manual trigger only (for testing)
  • 'name' of job is based on connector which is a long name - needs parsing to something simpler
  • Personal repo - just to get started quickly (need to discuss in team where this belongs)
  • Need to consider scheduling - daily? Triggers? dependencies?
  • How to report / ensure results looked at - slack?
  • Need some simple analysis of the results for pass/fail (ie # tests, exceptions etc) (maybe split out from test)
  • Hardcoded to 3.14-SNAPSHOT (may benefit from a tag for latest release)

cc: @cmgrote @mandy-chessell @lpalashevski

@planetf1 planetf1 mentioned this issue Dec 14, 2022
1 task
@planetf1
Copy link
Member Author

planetf1 commented Dec 14, 2022

After 4.5 hours, the CTS is still running (even at the minimum size (5 vs 1) - even in memory (and graph takes longer)
-> https://github.com/planetf1/cts/actions/runs/3695246683/jobs/6257391400

I set the job timeout to 5 hours (max for github is 6 - then job gets killed)

We have 2 CPUs, 7GB ram but it may be we cannot 'fit' the cts into this resource.

If not we need additional infrastructure - one of

  • An enterprise github account (can use larger github hosted runners)
  • External runners (need to deploy on our own infrastructure, then install github client code.). Could be k8s but needs resource/funding
  • skip github actions and use external resource directly - as above

Or we figure out how to speed up Egeria/CTS significantly.

I'll check the config further & try and debug via ssh just in case of any errors, and extend timeout closer to 6h

@cmgrote
Copy link
Member

cmgrote commented Dec 14, 2022

All worth looking at -- when I run locally these days it's against 20GB memory and 3 cores, and at a size of 2. I think it finishes within 3 hours or less (for XTDB).

So my first hunch would be that 7GB and 2 cores is probably too small (7GB maybe the main culprit -- could it just be hitting a non-stop swapping scenario?)

@planetf1
Copy link
Member Author

I usually run on a 3-6 x 16GB cluster ... Though often multiple instances in parallel (all the charts)

I have run locally in around a 6-8GB but indeed this config may sadly be too small.

I'm going to take a closer look if o can get an ssh session setup

@planetf1
Copy link
Member Author

planetf1 commented Dec 15, 2022

Two projects to setup github actions runners on a k8s cluster:

https://github.com/evryfs/github-actions-runner-operator
https://github.com/actions-runner-controller/actions-runner-controller

The latter is being taken over by github for native support actions/actions-runner-controller#2072

@planetf1
Copy link
Member Author

Investigated external runners -- but hit issues with KinD. commented in actions-runner-controller discussion.

Reverted to debugging github runners. The following fragment assisted with debugging (see https://github.com/lhotari/action-upterm for more info):

       === debug
      - name: Setup upterm session
        uses: lhotari/action-upterm@v1
        with:
          ## limits ssh access and adds the ssh public key for the user which triggered the workflow
          limit-access-to-actor: true

The issue turned out to be that the strimzi operator pod was not starting due to failing to meet cpu constraints. This defaulted to '500m' (0.500 cpu units) which should have been ok. However ever 1m failed to schedule. this looks like a KinD issue, but overriding the min cpu to '0m' allowed the pods to schedule. This was needed for our own pods too.

Added additional checks. For example:

          until kubectl get pod -l app.kubernetes.io/name=kafka -o go-template='{{.items | len}}' | grep -qxF 1; do
          echo "Waiting for pod"
          sleep 1
          done

This fragment simply loops until the pod matching expression exists. (kubectl rollout status may also be useful)

Then we can do

          kubectl wait pods --selector=app.kubernetes.io/name=kafka --for condition=Ready --timeout=10m

This will immediately return if the pod matching expression doesn't exist, which is why the above check is needed first.

All of these checks don't help running the cts as such, but rather help report the current stage in the github actions job log.

If CTS works we can revisit better approaches, custom actions etc.

@planetf1
Copy link
Member Author

SUCCESSFUL test run -> https://github.com/planetf1/cts/actions/runs/3708502295 - ie tasks completed as successful

results are attached to the job.

WIll elaborate the job to do some basic checks of the results.

@planetf1
Copy link
Member Author

planetf1 commented Dec 16, 2022

Example output I'm experimenting with

This is based on positive/negative evidence counts in the details cts results ie:

➜  graph ./cts-analyze.py
              Metadata sharing MANDATORY_PROFILE   CONFORMANT_FULL_SUPPORT [  71657 /      0 ]
              Reference copies  OPTIONAL_PROFILE            NOT_CONFORMANT [   8496 /     32 ]
          Metadata maintenance  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [  14126 /      0 ]
                 Dynamic types  OPTIONAL_PROFILE            UNKNOWN_STATUS [      0 /      0 ]
                 Graph queries  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [    528 /      0 ]
             Historical search  OPTIONAL_PROFILE     CONFORMANT_NO_SUPPORT [    530 /      0 ]
                Entity proxies  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   2759 /      0 ]
       Soft-delete and restore  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   2592 /      0 ]
                Undo an update  OPTIONAL_PROFILE     CONFORMANT_NO_SUPPORT [    406 /      0 ]
           Reidentify instance  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   2650 /      0 ]
               Retype instance  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [  16365 /      0 ]
               Rehome instance  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   1590 /      0 ]
                 Entity search  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [  62878 /      0 ]
           Relationship search  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   8253 /      0 ]
        Entity advanced search  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [  44800 /      0 ]
  Relationship advanced search  OPTIONAL_PROFILE   CONFORMANT_FULL_SUPPORT [   9312 /      0 ]

FAIL [246942/32]
➜  graph echo $?         
1

This returns a simple pass/fail - based on whether any assertions have failed.
It does not (yet?) compare to a baseline

There are many other interpretations we could do of the data, and format the evidence, check for other exceptions in log.
Having experimented, refactoring could be a lot neater.

@planetf1
Copy link
Member Author

Added checks into latest pipeline.
Set default container to 'latest'
Added schedule, daily

@planetf1
Copy link
Member Author

I have reverted the doubling of the retry count used during CTS after seeing run-times on the CTS automation exceed 6 hours. Analysis of the cts execution is needed, but perhaps we were hitting many more of these time limits than I'd expected during even successful execution.

See #7314 -- need to test it through ci/cd to get an exact comparison

@planetf1
Copy link
Member Author

I'm proposing to move my repo under odpi. Whilst no doubt we can make improvements, refactor, it's a starting point, and moving it will make it easier for others to use it, review test results, improve CTS, and improve our test infrastructure.

@planetf1
Copy link
Member Author

Having backed of the timer increase, CTS is now running in 4-4.5 hours. will leave like this

@planetf1 planetf1 removed their assignment May 15, 2023
@mandy-chessell
Copy link
Contributor

The development work for this is complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build-improvement Build improvements - maven, gradle, GitHub actions conformance-testing Egeria conformance testing consumability Makes the software easier to use or understand. Includes docs, error messages, logs, API definitions pinned Keep open (do not time out) testing testing - including automation
Projects
None yet
Development

No branches or pull requests

3 participants