-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KOGITO-9940] Add E2E test cases for platform configured with Job Service and Data Index in a combination of scenarios with ephemeral and postgreSQL persistence in dev and production profiles #337
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice! Just a few minor comments. Thank you!
corev1.ResourceCPU: resource.MustParse("100m"), | ||
corev1.ResourceMemory: resource.MustParse("256Mi"), | ||
corev1.ResourceCPU: resource.MustParse("500m"), | ||
corev1.ResourceMemory: resource.MustParse("1Gi"), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any justification for this change? Have you run a benchmark? @wmedvede do you have any idea regarding DI/JS consumption resources? Can we have a follow-up task to get a more approximated number? I feel like this can be too much for a default setup. Mainly when running it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got OOMKill with 256Mi for the DI. I increased it to 512Mi as well as the CPU limits to try to speed up the deployment as it takes 2:40 seconds for the container to reach ready status, which is significant more than for the Job Service (90 seconds).
It didn't help much on both accounts. I can reduce it to 512Mi and the CPU to 100m for limits if you think it aligns with your expectations.
Is there is any previous test done on the resource limits for the DI container or these values were selected on best effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about changing these numbers now, we can do a benchmark later and have close numbers and a good approximation for users depending on their env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I'll set the memory request and limit to 512Mi to avoid random OOMKills. I wonder if the startup time is caused by the JVM resizing its memory capacity as it runs the code....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that on OCP cluster the data-service failed to start with the default values set by the operator and had to set the resources limits in the platform CR explicitly to make it start:
services:
dataIndex:
enabled: true
podTemplate:
container:
image: "quay.io/kiegroup/kogito-data-index-postgresql-nightly:latest" # To be removed when stable version is released
resources:
limits:
cpu: 500m
memory: 512Mi
persistence:
test/testdata/data-index_and_job-service/dev/postgreSQL/01-postgres.yaml
Outdated
Show resolved
Hide resolved
Please don't merge until @domhanak's review while I'm on PTO. |
@jordigilh were you able to run the tests locally? It seems that we have a build problem. |
Yes, but with go 1.20. I see one of the functions I used is not supported in 1.19. I'll fix that. |
@ricardozanini Running the e2e test suite locally I'm getting an error for the existing e2e tests:
Seems like the greetings container has been rebuilt for a newer version of Java. |
8b2d477
to
ca1e99b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just one comment from my side, I have rebased this locally with main and executed the e2e tests, and got this error. Looks like one test is not passing, but maybe it's my local enviroment.
SonataFlow Operator Validate that Platform services and flows are running successfully when creating a simple workflow [It] with both Job Service and Data Index and postgreSQL persistence and the workflow in a production profile
/home/wmedvede/development/projects/kogito/kogito-serverless-operator/test/e2e/workflow_test.go:357
[FAILED] Timed out after 300.002s.
Expected success, but got an error:
<*errors.errorString | 0xc00037cab0>:
kubectl wait pod -n test-459 -l sonataflow.org/workflow-app --for condition=Ready --timeout=30s failed with error: (exit status 1) error: no matching resources found
{
s: "kubectl wait pod -n test-459 -l sonataflow.org/workflow-app --for condition=Ready --timeout=30s failed with error: (exit status 1) error: no matching resources found\n",
}
In [It] at: /home/wmedvede/development/go/go1.20.4/src/reflect/value.go:586 @ 01/12/24 16:20:22.891
Summarizing 1 Failure:
[FAIL] SonataFlow Operator Validate that Platform services and flows are running successfully when creating a simple workflow [It] with both Job Service and Data Index and postgreSQL persistence and the workflow in a production profile
/home/wmedvede/development/go/go1.20.4/src/reflect/value.go:586
Ran 7 of 7 Specs in 1476.462 seconds
FAIL! -- 6 Passed | 1 Failed | 0 Pending | 0 Skipped
--- FAIL: TestE2E (1476.46s)
FAIL
FAIL command-line-arguments 1476.476s
FAIL
make: *** [Makefile:349: test-e2e] Error 1
656c722
to
ebb4dc3
Compare
Pr check also complains about missing headers in some files |
Fixed 😄 |
Re run them one more time with success...
I can only guess that the problem you found was the data index or job service pods failed to deploy. If you see this again please capture the logs from operator and the platform CR so I can troubleshoot it? |
So looks like there is a consistent fail on PR check - 3 out of 3 reruns:
I am currently not sure why this is happening, locally it passes. Should be investigated to keep the CI stable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you @jordigilh please rebase for the CI to execute these after kind migration
It's because the e2e job has set
I noticed this while troubleshooting #322 . I removed the env variable in the job and the problem disappeared. If that variable is required for other reasons I can resort in setting it to |
We should remove this |
07accbf
to
ed83260
Compare
@jordigilh just a last generation check and we should be good! |
888df95
to
d6993c1
Compare
…vice and Data Index in a combination of scenarios with ephemeral and postgreSQL persistence in dev and production profiles Signed-off-by: Jordi Gil <[email protected]>
…ted in golang 1.19 Signed-off-by: Jordi Gil <[email protected]>
Signed-off-by: Jordi Gil <[email protected]>
Signed-off-by: Jordi Gil <[email protected]>
Signed-off-by: Jordi Gil <[email protected]>
Signed-off-by: Jordi Gil <[email protected]>
Signed-off-by: Jordi Gil <[email protected]>
…hen running the ephemeral postgres Signed-off-by: Jordi Gil <[email protected]>
…lth status since some finish quicker than the time it takes for the logic to evaluate the health endpoint and causes a test failure Signed-off-by: Jordi Gil <[email protected]>
d6993c1
to
e4721df
Compare
@ricardozanini can we merge this PR? It's green |
…vice and Data Index in a combination of scenarios with ephemeral and postgreSQL persistence in dev and production profiles (apache#337)
…vice and Data Index in a combination of scenarios with ephemeral and postgreSQL persistence in dev and production profiles (apache#337)
Extends the E2E tests to include coverage for the scenarios where Job Service and Data Index are deployed:
Each test case takes between 4-5 minutes to run, so I have limited the coverage to these 4 cases.