How to scale from 1 to 2 jobs in parallel #1186

audunsol · 2020-09-23T10:26:33Z

audunsol
Sep 23, 2020

I have this configuration now:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: my-scaled-job
  namespace: simulations
spec:
  pollingInterval: 5                        # Optional. Default: 30 seconds
  successfulJobsHistoryLimit: 50            # Optional. Default: 100. How many completed jobs should be kept.
  failedJobsHistoryLimit: 50                # Optional. Default: 100. How many failed jobs should be kept.
  maxReplicaCount: 100                      # Optional. Default: 100
  jobTargetRef:
    parallelism: 1                          # [max number of desired pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    completions: 1                          # [desired number of successfully finished pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    activeDeadlineSeconds: 3600             #  Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer
    backoffLimit: 6                         # Specifies the number of retries before marking this job failed. Defaults to 6
    template:
      # describes the [job template](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/)
      metadata:
        labels:
          jobgroup: my-sim
      spec:
        containers:
          - name: longrunningsimulation
            image: mycontainerregistry.azurecr.io/mycompanyname/simulationservice:{imagetag}
            command: ['node', 'heavy-simulation-code.js']
            env:
              - name: SIMULATION_JOB_QUEUE_NAME
                value: simulation-job-queue
              - name: STORAGE_ACCOUNT_CONNECTION_STRING
                valueFrom:
                  secretKeyRef:
                    name: my-secrets
                    key: STORAGE_ACCOUNT_CONNECTION_STRING
        restartPolicy: Never
        terminationGracePeriodSeconds: 3600
  triggers:
    - type: azure-queue
      metadata:
        queueName: simulation-job-queue
        queueLength: '1' # Optional. Queue length target for HPA. Default: 5 messages
        connectionFromEnv: STORAGE_ACCOUNT_CONNECTION_STRING

The docker image referred to in this job is a NodeJS application doing the following:

When starting up, it tries to retrieve a message from the configured azure storage queue
If no message is found, it is trying 10 times before giving up and just exists
If a message is found, it reads it, extracts some blob reference info from it (where a large input payload is stored as JSON)
Spins out a new worker thread running its calculation, while reporting progress into some azure table store
Deletes the queue message here, as this job is now committed to process it
When done calculating/simulating, it writes the output result as JSON to a different azure storage blob
It terminates (does not read more messages)

It seems to work fine the most of the time, but I have encountered the following issues now:

When adding 1 queue message, one job and corresponding pod is starting up
Adding 1 more queue message, nothing happens
There are now 1 message in queue and 1 job running
When the running job finishes, then a new job is created and the last queue message processed

A lot of different cases seems to be working, like adding 10 messages "at the same time" seems to make all of them to start at once and complete OK. But gradually adding messages does not seem to work as I want.

I really would like KEDA to spin up new jobs as soon as possible after a new message arrives, as long as there are resources available etc. But now, the last item added is not processed until the second last item is done.

What am I doing wrong?

I guess I would like to set the queueLength to 0 and not 1 (as my target is really to get it down to 0, not 1), but that results in some panic: runtime error: integer divide by zero in the keda operator logs.

I also tried setting a long lease period on my message and delete it at the end, in case the scaling metrics looked at the sum of number of visible and invisible messages in the queue, but it seems like it is scaling solely on the number of visible queue items (at least that seems to be my hunch).

So compared to the tables from https://keda.sh/docs/2.0/concepts/scaling-jobs/#details, I think our case is this:

Queue Length	Max Replica Count	Target Average Value	Running Job Count	Number of the Scale	Comment
10	3	1	0	3
10	3	2	0	3
10	3	1	1	2
10	100	1	0	10
1	100	1	1	0	<-- my case - I want 1 here!

Answered by GirupaShankari

May 23, 2022

@audunsol

I am also facing the same issue, please find my job.yaml. If i gradually add messages it is not scaling once the existing running pod completes after that only it is scaling. For example, if i have 6 messages in the queue, already 4 pods are running, anyone of the 4 pods completes then it is picking another one from the queue and scaling and remaining 3 messages are still waiting in the queue.

I want like as soon as message arrives in the queue, it should scale one pod. can u please help me to fix this, Thanks.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: extract-feature-scaledjob
namespace: test
spec:
jobTargetRef:
parallelism: 1 # max number of desired pods
com…

View full answer

zroubalik · 2020-09-23T14:22:22Z

zroubalik
Sep 23, 2020
Maintainer

@TsuyoshiUshio ^

0 replies

audunsol · 2020-09-24T20:41:59Z

audunsol
Sep 24, 2020
Author

Maybe it is better if I move this into an issue instead?

I can't really understand how this is suppsed to be the desired behaviour for anyone for scaling jobs (however I am open to get some more contexts/examples on how the scale calculation works or is supposed to be used, in case it is my mental model that is just wrong).

But if it is not an obvious config mistake I have made, I think I either have to get it resolved or worked around, or - worst case - rip KEDA out of our systems again. There are just so few alternatives solving the job-scaling approach, so I was very excited when I saw that Job-scaling-feature coming into KEDA - I really hope this can serve my use case for long running simulations in kubernetes, but right now it just doesn't.

2 replies

zroubalik Sep 25, 2020
Maintainer

I am that much experienced with scaling of jobs, so just guessing, have you tried to tweak parallelism config?

audunsol Sep 25, 2020
Author

Thanks @zroubalik !

I thought that (according to https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism) tweaking on parallelism would give me many pods per job (and I thought I was supposed to use KEDA to give me many jobs - one job per queue message - with one pod each - but this might be where I am misunderstanding something).

audunsol · 2020-09-25T10:40:21Z

audunsol
Sep 25, 2020
Author

Just tested the following combinations:

parallelism	completions	outcome
1	1	as described above
1	unset	still the same behavior
5	unset	pods seems to be started sequentially under the same job! (end result is the same)
5	1	same as previous
5	5	get 5 parallel pods for each message. 1 get a queue message to work on, 4 of them does not (in my setup they polls for messages for a short while, then exists). If the second message comes in before all the other 4 has exited, one of them will process that. Not what I want.

7 replies

audunsol Oct 4, 2020
Author

Thank you for your good answer @TsuyoshiUshio !

I have openeed issue #1222 for this now (I hope I managed to hit the correct issue description).

Looking forward to your fix!

audunsol Oct 4, 2020
Author

Do you also want me to open an issue on the divide by 0 issue? (mentioned probably a bit too deep in my first post here).

I guess it happens here:

keda/pkg/scaling/scale_handler.go

Line 279 in 38b3e9b

ans := x / y

To be honest, I think that 0 is actually my target queue length (I don't have a desire to have stuff in my queue at all, I ideally want everything to be processed as quickly as possible), so I think allowing 0 would be good. But I guess it tampers with the basic assumptions and logic from existing code, so I am open for that I am thinking incorrectly about the nature of the target queue length parameter.

If I am just wrong about that, I think that it at least would be good to get some validation error returned when trying to apply a scaledjob with the target length set to 0, instead of having to look into the log for the KEDA-operator and see that it has crashed.

zroubalik Oct 5, 2020
Maintainer

Do you also want me to open an issue on the divide by 0 issue? (mentioned probably a bit too deep in my first post here).

I guess it happens here:

keda/pkg/scaling/scale_handler.go

Line 279 in 38b3e9b

ans := x / y

To be honest, I think that 0 is actually my target queue length (I don't have a desire to have stuff in my queue at all, I ideally want everything to be processed as quickly as possible), so I think allowing 0 would be good. But I guess it tampers with the basic assumptions and logic from existing code, so I am open for that I am thinking incorrectly about the nature of the target queue length parameter.

If I am just wrong about that, I think that it at least would be good to get some validation error returned when trying to apply a scaledjob with the target length set to 0, instead of having to look into the log for the KEDA-operator and see that it has crashed.

Please do so or the best would be if you can open a PR with the fix :)

TsuyoshiUshio Oct 5, 2020

Hi @audunsol

I created a PR and already merged. The fix will solve this issue.
#1214

Thank you for the pointing out the issue. the y will be the targetAvarageValue that is the sum of the scalers of AvarageValue.
AvalageValue is queueLength: "5" # default 5 if you use the Azure Storage Queue trigger. That means how many queueLengh
will consume in a pod. So I thoguht, it might not be 0. 0 means a pod won't consume any message.
However, I don't investigate all of the scalers. If there is the case, we might need to fix it. @zroubalik Do you know the case?

zroubalik Oct 6, 2020
Maintainer

@TsuyoshiUshio I am not aware of any other cases, I think we can close this for now and fix for other scalers if needed

GirupaShankari · 2022-05-23T08:44:38Z

GirupaShankari
May 23, 2022

@audunsol

I am also facing the same issue, please find my job.yaml. If i gradually add messages it is not scaling once the existing running pod completes after that only it is scaling. For example, if i have 6 messages in the queue, already 4 pods are running, anyone of the 4 pods completes then it is picking another one from the queue and scaling and remaining 3 messages are still waiting in the queue.

I want like as soon as message arrives in the queue, it should scale one pod. can u please help me to fix this, Thanks.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: extract-feature-scaledjob
namespace: test
spec:
jobTargetRef:
parallelism: 1 # max number of desired pods
completions: 1 # desired number of successfully finished pods
activeDeadlineSeconds: 3600
backoffLimit: 6
template:
metadata:
labels:
app: extract-feature
# aadpodidbinding: agicmi #The label applied for MI assigned to the application pods
spec:
containers:
- name: extract-feature
image: xxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/iasp-newdev-r3.5/extract-feature:latest
envFrom:
- configMapRef:
name: extract-feature-configmap
env:
- name: extract
value: scaled
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8300
protocol: TCP
pollingInterval: 30 # Optional. Default: 30 seconds
successfulJobsHistoryLimit: 20 # Optional. Default: 100. How many completed jobs should be kept.
failedJobsHistoryLimit: 10 # Optional. Default: 100. How many failed jobs should be kept.
envSourceContainerName: extract-feature # Optional. Default: .spec.JobTargetRef.template.spec.containers[0]
maxReplicaCount: 100 # Optional. Default: 100
triggers:

type: aws-sqs-queue
metadata:
queueURL: "https://sqs.us-east-1.amazonaws.com/xxxxxxxxxxxx/iasp-dev-keda-sqs-queue.fifo"
awsRegion: "us-east-1"
namespace: "test"
identityOwner: operator
queueLength: "1"
messageCount: "1"

kind: Service
apiVersion: v1
metadata:
name: extract-feature
namespace: test
spec:
selector:
app: extract-feature
ports:

port: 8300
targetPort: 8300
type: LoadBalancer

2 replies

audunsol May 23, 2022
Author

Hello @GirupaShankari

This issue was actually solved for me by #1214

I cannot seem to mark a reply to my comment as an answer unfortunately, but this comment #1186 (reply in thread) was the main answer for me.

Your config looks more or less OK to me, as far as I can see now, but I think you should try to add:

  scalingStrategy:
    strategy: "accurate"

(under spec in your ScaledJob manifest).

I am currently using that in my config now for Azure Queues.

I have not tried with aws-sqs, so I cannot guarantee that it will help in your case.

See this link about using various scaling strategies: https://keda.sh/docs/2.1/concepts/scaling-jobs/#details

GirupaShankari May 23, 2022

@audunsol
Thanks a ton, it is working perfectly fine now. We were struggling to fix this, Appreciate your help !!!

MoKassem · 2022-12-14T15:29:39Z

MoKassem
Dec 14, 2022

This not working for me as described here for RabbitMQ message. What scalingStrategy i should use to get this behaviour?! I'm using "accurate" as recommended above and still facing the same issue when a job is currently running, and a new message received in the queue, it doesn't scale up a new job.

I would like to spin up a new job as long as there's a message in the queue regardless if there's a job running or no.

1 reply

audunsol Dec 14, 2022
Author

@MoKassem I haven't tried RabbitMQ, so not sure how it reports queue length. Also, many versions of KEDA have passed since my recommendation on using scalingStrategy above, so not sure it holds any more.

However, a brief look at the scaler code seems to indicate that there is an excludeUnacknowledged setting that is respected here, so depending on if your job acks the message before or after running the full job, (and if you have set protocol to http and not amqp it seems) there could be something to be found there.

A bit of a longshot perhaps, without knowing RabbitMQ, nor your version of KEDA, or your ScaledJob YAML. But worth a mention.

MoKassem · 2022-12-15T15:51:09Z

MoKassem
Dec 15, 2022

@audunsol Thank u for ur reply.

It's the latest version of KEDA 2.9. I provided the ScaledJob specs below, and the protocol is amqp.

I don't think this has to do anything of acknowledgement of messages. I'm having a problem when there's an already a long running Job processing some message that had already been pulled from the queue/acknowledged---> So the queue is now empty but a job is running processing it. ---> Then a new single message is received (1 message in the queue)---> No new job gets created to process this new message.

        containers:
        - name: axl-routing-sm
          image: us-west1-docker.pkg.dev/axl-dcr/axl-routing-controller:0.5.52-kubernetes
          imagePullPolicy: Always
          resources:
            requests:
              memory: "2Gi"
              cpu: 4
            limits:
              cpu: 8
              memory: "4Gi"
        restartPolicy: Never
    backoffLimit: 0  
  pollingInterval: 5                    
  minReplicaCount: 0
  maxReplicaCount: 10                        
  successfulJobsHistoryLimit: 100     
  failedJobsHistoryLimit: 100              
  rollout:
    strategy: gradual                              
    propagationPolicy: foreground
  scalingStrategy:
    strategy: "accurate"                        
    pendingPodConditions:               
    - "Pending"
    - "ContainerCreating"
  triggers:
  - type: rabbitmq
    metadata:
      protocol: amqp
      queueName: routing-kubernetes
      mode: QueueLength
      value: "1"
    authenticationRef:
      name: keda-trigger-auth-axl-rabbitmq

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to scale from 1 to 2 jobs in parallel #1186

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to scale from 1 to 2 jobs in parallel #1186

audunsol Sep 23, 2020

I want like as soon as message arrives in the queue, it should scale one pod. can u please help me to fix this, Thanks.

Replies: 6 comments · 12 replies

zroubalik Sep 23, 2020 Maintainer

audunsol Sep 24, 2020 Author

zroubalik Sep 25, 2020 Maintainer

audunsol Sep 25, 2020 Author

audunsol Sep 25, 2020 Author

audunsol Oct 4, 2020 Author

audunsol Oct 4, 2020 Author

zroubalik Oct 5, 2020 Maintainer

TsuyoshiUshio Oct 5, 2020

zroubalik Oct 6, 2020 Maintainer

GirupaShankari May 23, 2022

I want like as soon as message arrives in the queue, it should scale one pod. can u please help me to fix this, Thanks.

audunsol May 23, 2022 Author

GirupaShankari May 23, 2022

MoKassem Dec 14, 2022

audunsol Dec 14, 2022 Author

MoKassem Dec 15, 2022

audunsol
Sep 23, 2020

Replies: 6 comments 12 replies

zroubalik
Sep 23, 2020
Maintainer

audunsol
Sep 24, 2020
Author

zroubalik Sep 25, 2020
Maintainer

audunsol Sep 25, 2020
Author

audunsol
Sep 25, 2020
Author

audunsol Oct 4, 2020
Author

audunsol Oct 4, 2020
Author

zroubalik Oct 5, 2020
Maintainer

zroubalik Oct 6, 2020
Maintainer

GirupaShankari
May 23, 2022

audunsol May 23, 2022
Author

MoKassem
Dec 14, 2022

audunsol Dec 14, 2022
Author

MoKassem
Dec 15, 2022