[3.x] Trim job IDs from failed job tags #743

sebdesign · 2020-01-13T22:37:49Z

In light of #715 and #740, and after reading the insights from @travisaustin, I think I found where the memory hog comes from.

Everytime a job fails, the ID is added to each tag. If there is no other failure in the next 48 hours, the tags expires. But otherwise, the StoreTagsForFailedJob keeps adding job IDs and updates the expiration date, but the previous IDs were never removed, which probably causes memory issues.

In this commit we'll remove the job IDs for that are older than 48 hours for each tag, in respect of the TTL in StoreTagsForFailedJob.

For example:

On 2020-01-01 00:00:00 a job with ID of 1 and tagged App\User:1 fails. The tag failed:App\User:1 is added with the job ID 1.
So after 2880 minutes, i.e. on 2020-01-03 00:00:00 the failed:User:1 tag should expire.
Meanwhile, on 2020-01-02 00:00:00 another job with ID of 2 and tagged App\User:1 fails. The ID 2 is added to the failed:App\User:1 tag, and the expiration of the tag is delayed for 2880 minutes, i.e. 2020-01-04 00:00:00.
If there are more failures in the next 2880 minutes, the new IDs will keep getting appended, and the previous IDs are never removed.

Now with this PR:

If there is a failure on 2020-01-03 00:00:00 for a job with ID of 3, it will be added as usual to the failed:App\User:1 tag, along with the existing IDs 1 and 2, but now the score is the precise timestamp that it was added, like in RedisJobRepository@storeJobReference.
Then, it will trim all the job IDs that were added 2880 minutes ago or earlier from this tag, like in RedisJobRepository@trimFailedJobs. So the id 1 will be removed.

If this is correct, then we could revert #741 and trim the job IDs from the recent: tags each time a job is added, like with the failed: tags.

Everytime a job fails, the ID is added to each tag. If there is no other failure in the next 48 hours, the tags expires. But otherwise, the `StoreTagsForFailedJob` keeps adding job IDs and updates the expiration date, but the previous IDs were **never** removed, which probably causes memory issues. In this commit we'll remove the job IDs for that are older than 48 hours for each tag, in respect of the TTL in `StoreTagsForFailedJob`.

travisaustin · 2020-01-14T00:22:45Z

I think you are mostly correct, but I think you are only solving half the issue.

For example, in my production environment, the Redis keys that are filling up are:

horizon:failed:<MODEL CLASS NAME>:<MODEL ID>
horizon:recent:<MODEL CLASS NAME>:<MODEL ID>

In my case, each of these keys has a TTL of 15 minutes. The problem is that I create a job of each ID of each model every 2 minutes (our workflow polls a remote system for each model every 2 minutes). So the TTL starts at 15 minutes, and when it hits ~13 minutes, a new job is dispatched for the model, which adds a new Job ID to that key, and the TTL is reset to 15 minutes. Since it never times out, my keys have 50,000 or more job IDs each even though Horizon only shows about 1,000 recent jobs.

I have not tested your PR, but, based on your writeup, you are only removing keys from horizon:failed. I would like to see horizon:recent trimmed, as well.

Thanks for the PR. I am excited to have it!!

sebdesign · 2020-01-14T00:35:56Z

@travisaustin the horizon:recent tags feature was removed in #741, that's why i'ts not implemented in this PR.

How did you set the TTL to 15 minutes? The TTL for the failed tags is hardcoded to 2880 seconds here.

Regarding your example, if this PR gets merged, you won't accumulate 50,000 IDs per key, because each time a job ID is added to the queue, the old IDs (now() - 2880 min) will be removed. Redis will know which IDs to remove because they are now added with a score that matches the timestamp they were added. I hope that makes sense to you!

driesvints · 2020-01-14T12:49:49Z

src/Contracts/TagRepository.php

+     * @param  array  $tags
+     * @return void
+     */
+    public function trim(array $tags);


We can't add this in a minor release because it'll break implementations who don't have this method defined.

That means when trim is called it will also break those implementations, since they don't define this method?

It'll break as soon as they install this package because the implementation won't adhere to the contract.

I mean if we didn't add it to the contract, the trim method would fail during runtime because it's probably not defined on the implementation, when it's called on the listener.

taylorotwell · 2020-01-14T14:39:18Z

I think I will just hold off on this for now.

driesvints changed the title ~~Trim job IDs from failed job tags~~ [3.x] Trim job IDs from failed job tags Jan 14, 2020

driesvints reviewed Jan 14, 2020

View reviewed changes

taylorotwell closed this Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3.x] Trim job IDs from failed job tags #743

[3.x] Trim job IDs from failed job tags #743

sebdesign commented Jan 13, 2020 •

edited

Loading

travisaustin commented Jan 14, 2020 •

edited

Loading

sebdesign commented Jan 14, 2020

driesvints Jan 14, 2020

sebdesign Jan 14, 2020

driesvints Jan 14, 2020 •

edited

Loading

sebdesign Jan 14, 2020

taylorotwell commented Jan 14, 2020

[3.x] Trim job IDs from failed job tags #743

[3.x] Trim job IDs from failed job tags #743

Conversation

sebdesign commented Jan 13, 2020 • edited Loading

travisaustin commented Jan 14, 2020 • edited Loading

sebdesign commented Jan 14, 2020

driesvints Jan 14, 2020

Choose a reason for hiding this comment

sebdesign Jan 14, 2020

Choose a reason for hiding this comment

driesvints Jan 14, 2020 • edited Loading

Choose a reason for hiding this comment

sebdesign Jan 14, 2020

Choose a reason for hiding this comment

taylorotwell commented Jan 14, 2020

sebdesign commented Jan 13, 2020 •

edited

Loading

travisaustin commented Jan 14, 2020 •

edited

Loading

driesvints Jan 14, 2020 •

edited

Loading