Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare one off jobs & dynamic scheduled jobs #17

Closed
ErisDS opened this issue Jul 29, 2020 · 41 comments
Closed

Rare one off jobs & dynamic scheduled jobs #17

ErisDS opened this issue Jul 29, 2020 · 41 comments

Comments

@ErisDS
Copy link

ErisDS commented Jul 29, 2020

I was really excited to see this library pop up - it's awesome to see something using native worker threads and not requiring redis/mongo/some other store. But was then a bit confused by the configuration method when it came to trying it out.

I have use cases for different types of jobs:

  1. recurring jobs that operate like a cron
  2. one off, super rare, but long-running tasks that might never be run in the lifetime of the process
  3. jobs that are generated dynamically and need to be run at a specific time
  • 1 seems to be the main use case for Bree, but in my case the use case is smallest
  • 2 is the my main use case - I can sort of see how I might manage it by not calling bree.start() and only calling bree.run('task') if the long running task is needed, but that feels like I'm not using the tool properly
  • 3 is a nice to have - already have code doing this, but unless I'm missing something there's no way to achieve it with Bree - except approximation with a cron running every minute to check

Given the rareness of the one-off jobs, it's a shame to have to declare them upfront, rather than being able to add them if and when they show up - otherwise there's overhead for no good reason.

With type 3 specifically, I find it odd that Bree has support for setting an exact date when a job should run, but that can only be set on instantiation?

I guess I'm looking for bree.add(({jobConfig}) or bree.run({jobConfig}) - or am I massively missing something?

@ErisDS ErisDS changed the title Rare / one off jobs Rare one off jobs & dynamic scheduled jobs Jul 29, 2020
@niftylettuce
Copy link
Contributor

niftylettuce commented Aug 1, 2020

Hi @ErisDS and thanks writing in! 👋 🎉

I really want to see these cases be handled by Bree too (I have them for myself on some projects), and I'm still debating on the best approach for this. One approach would be to add IPC, pub/sub, or have a Bree job that queries for these one off jobs on a particular interval if real-time is not required, and in turn you could have Bree inside of Bree basically (though you may just want to use the modules that Bree internally uses to some degree if you don't want the overhead of Bree itself - which isn't much though).

One of the main reasons why I wrote Bree was because of the bad practices (other libraries) led developers towards. I think abstracting jobs as needed into barebones approach where you are in control of everything, and each job is completely independent and as lightweight as possible - is the best approach.

For (2) and (3), you could achieve these with pub/sub inside a long running Bree task or some other approach that isn't real time (like you said, a query every so often, e.g. one minute). You could then spawn Bree inside of this to fire off these jobs as needed.

I am totally open to your thoughts on what an API for this might look like with Bree, or if you do find an approach that works, if you'd like to share it and we can add it to the README (or an abstraction of it, e.g. if it is confidential). Very open to contributors, PR's, and help from the community!

@niftylettuce
Copy link
Contributor

One other thing I didn't mention is that Bree exposes all of its config, workers, etc. For example, someone added a new job to a Bree config after it already started by adding to bree.config.jobs, whereas const bree = new Bree(...) (e.g. #10) or they listened for a worker and then communicated over Comlink once it had started (e.g. https://github.com/uvcat/uvcat/blob/fb4139b40ceced5c1ac4219588d78ddee2f3fa2b/packages/%40uvcat/plugin-worker/index.js#L29-L31).

@ErisDS
Copy link
Author

ErisDS commented Aug 4, 2020

As per #19 using the config object would skip validation etc, and feels very much like a hack 😬

Where I am at is assessing various libraries looking for the right fit. I'm super excited by what this library promises and wanted to share my use cases. I'd love to say I've got time to commit to helping move the library along as I realise it's brand new, but realistically I can't.

There is also the lack of polyfill for threads for < Node.js 12 that makes this a no-go for us until April next year, but perhaps in that time the library will take off, mature and become the perfect fit 😄

In the meantime, let me just share that the env I'm working in is one of multiple services (inside a single app) each of which would manage their own jobs. So in this architecture having a single location for job files doesn't fit (I realise that is configurable) and similarly, I want each service to be able to "register" it's jobs as and when it sees fit.

Maybe that's some useful food for thought? Maybe it's just more evidence that this library isn't for me and that's totally OK too.

Happy to discuss further, but also happy to close this for now & not clog up your nice tidy repo!

@niftylettuce
Copy link
Contributor

niftylettuce commented Aug 4, 2020

Would these three changes suffice for you @ErisDS to be able to use this?

  1. Make a new addJob method that re-uses the exact same validation logic for adding a job as when one initializes a new Bree instance. You would use this instead of the Passing data #19 hack.
  2. Polyfill for threads could simply be to use child_process's spawn method to spawn a child (albeit you would lose worker data communication). We could document how to do graceful reloads and listen for SIGINT etc (as we do with @ladjs/graceful)
  3. An example in the README for using sockets, or redis pubsub to communicate with Bree to add new jobs (?)

@niftylettuce
Copy link
Contributor

Actually, regarding (2) in my previous message, perhaps we could use https://github.com/chjj/bthreads as a polyfill, unless you knew of a better one?

@shadowgate15
Copy link
Member

shadowgate15 commented Aug 4, 2020

@niftylettuce I do wonder if adding an addJob functionality would be a good feature. thoughts?

@niftylettuce
Copy link
Contributor

@shadowgate15 I think we should 100% add it, since it is a common use case and the hack is not best approach. There could also be a .remove method too (we should probably call it .add to keep the API simple and similar to run.

@niftylettuce
Copy link
Contributor

v1.1.24 released and now has support for add and remove methods (examples have been added to the README too), thanks @shadowgate15 for all your hard work here.

Next I think we just need to polyfill worker threads, and add some IPC/pubsub/socket examples.

@dilizarov
Copy link

@niftylettuce thanks for the speedy updates. My question:

Let's say I need to dynamically run a job whenever I get a webhook from a 3rd party service with specific data.

Is it fine to call bree.add(jobInWebhookWithCustomDataInWorkerData) every time the webhook happens?

This job would only run once. Or is it possible to rerun a job with new worker data? I fear bree.add makes me feel like if the webhook gets triggered 10 times, then I'll have created 10 bree jobs that only run once. Would I have to clean each of these jobs myself?

My idea would be to have a singleton Bree instance that houses all of my jobs and even runs my dynamic one-off jobs.

The scheduled jobs recurring jobs without custom parameters per recurrence are wonderfully documented in this library.

If we had an example of the best way to run dynamic one-off jobs, that'd be fantastic. Right now, I imagine the answer is to run the job via bree.add, but then I need to make sure that the job runs ASAP and gets removed once complete.

@niftylettuce
Copy link
Contributor

If you just give it a unique name, it should be fine, and also you will need to add bree.run(someJobObjectWithUniqueName). Make sure to pass a path each time (this can be static/the same, it's just the name that needs to be unique, and could be an ID or something, whatever you choose).

@niftylettuce
Copy link
Contributor

Either myself or @shadowgate15 will add a section to README for Dynamic jobs.

@dilizarov
Copy link

dilizarov commented Aug 5, 2020

Ah, so for every dynamic event, I'll want it unique?

Something like the following in my webhook:

const uniqueJobName = `jobName-${uuid.v4()}`
bree.add(jobConfigWithUniqueJobName)
bree.run(uniqueJobName)

And I shouldn't run into any problems with having a ton of dynamic jobs spun up and in the Bree instance without them being cleared after they are run?

I look forward to the section in the README!

Super excited for this library :)

@niftylettuce
Copy link
Contributor

v1.1.25 is now released with support for adding single jobs (instead of having to add an Array when you call bree.add).

https://github.com/breejs/bree/releases/tag/v1.1.25

@niftylettuce
Copy link
Contributor

Also I just wanted to share, it is generally against best practice to use IPC/websockets/pubsub to queue dynamic jobs in real-time. Your process could be interrupted at any time, even to reasons out of your control, and the data would be lost. I would recommend keeping dynamic jobs actually stored in a queue, with a persistent database, and then have a job that runs every so often to flush the queue (with limited concurrency). We will still find time to document these examples and also answer your questions soon, along with adding polyfill for workers for you @ErisDS.

@dilizarov
Copy link

@niftylettuce Are you suggesting for dynamic jobs we want run immediately, we should place it in a persistent DB and simply have Bree poll for the job like... every few milliseconds or something? I understand using the persistent DB for dynamic jobs that are allowed to be sent down the line, but for sending notifications to users for instance, I want them notified ASAP, but I want to leverage jobs to get it done.

@niftylettuce
Copy link
Contributor

@dilizarov Yes I am suggesting that. You could just have an interval that polls and locks from the queue every second.

@dilizarov
Copy link

@niftylettuce Got it. I'm assuming another good thing about using a persistent DB is it helps architecturally as far as building a history of notifications goes (for UX purposes), but also I don't have to pass any data to the worker for Bree as ideally the data I persist into the DB should suffice.

I will say though... the idea of polling a Postgres DB every second gets me a little antsy, BUT I'm also kinda new to this stuff so I guess maybe that's the norm and DBs were made for these reasons anyways 🤷‍♂️

@niftylettuce
Copy link
Contributor

I mean you could poll every 2 seconds, people really won't know the difference, and chances are there are going to be other delays that are out of your control. Just keep it simple, don't stress yourself out, and do things that don't scale.

@dilizarov
Copy link

@niftylettuce Just for anyone else who ends up on this thread, one could also leverage hooks in their ORM. For instance, with Sequelize ... when you create your model, you can leverage the afterCreate hook to kick off a job that'll process that newly persisted record. Obviously, if things go to shit, I imagine you can get retries going with that job.

@niftylettuce
Copy link
Contributor

v3.0.0 of Bree is released with support for Node v10+ and browsers.

See the updated README at https://github.com/breejs/bree#readme.

https://github.com/breejs/bree/releases/tag/v3.0.0

@niftylettuce
Copy link
Contributor

Also @ErisDS note that we have a bree.add and bree.remove method. The bree.add method accepts a string or object as you requested.

@ghost
Copy link

ghost commented Aug 31, 2020

Is there an example for the discussion on dynamic jobs. I want to have a polling job that checks the db for some data, if that data meets a condition, then I want to trigger another job that can be long-running to process that data. The polling job would need to know that the long running job is still processing the data, once that job exits, the polling job continue to poll the db for new data and again trigger that processing job if it meets that condition

@niftylettuce
Copy link
Contributor

just make it so you set values in your database when the persistent job needs to queue another job, and then have another job looking for that, other job.. I think you're overcomplicating it though. 99% of jobs I've seen don't require such complexity, unless you're building rockets.

https://github.com/breejs/express-example

@ErisDS
Copy link
Author

ErisDS commented Aug 31, 2020

TBH I have the same questions & I'm definitely not building rockets.

I'm still struggling to understand how to use Bree to fit my usecase. It feels like I have to make my use cases fit Bree. More examples would be fantastic.

Here's a hopefully clear, not rockets use case: A user upload a huge file to an API, and it needs background processing.

@niftylettuce
Copy link
Contributor

@ErisDS the file needs background processed after the upload completes? Is it written to disk somewhere? Stored on S3 bucket? Tmp dir? Can you just write its location to a database "BackgroundQueue" and then once it's complete, remove it? Write a job that polls this db once a minute, or every second, depending on how frequent these are uploaded and how fast they need processed, and lock the specific files. If it takes X seconds long, or if it fails, you can implement your own retry logic. Did you need to update the front-end too once these background processes are finished? Socket.io? A simple XHR polling client-side to check against an API endpoint once the job is finished?

@niftylettuce
Copy link
Contributor

To be clear: For dynamic jobs, all you really need (at least to me, from what you've all shared), is to create a persistent database table with Mongo or SQL (your choice), store some info about what is dynamic about the job, and then have a a job polling against this database and locking these jobs (only query for jobs that are not yet locked of course). You could also implement logic to not fetch a job if there's a count > 0 of a job already locked. You can have full fine-grain control with this approach. Jobs run faster this way, less overhead with some broken queue mechanisms like you'd find in Bull or Agenda, and way less complexity.

@niftylettuce
Copy link
Contributor

If either of you gives me a very specific example, or provides more detail (e.g. you're writing to tmp dir and then you need to do XYZ with the job, e.g. compress the SVG or whatever) let me know. I'd be happy to help write your job for you so you have a clearer understanding. I'd also need to know if you're using SQL, Mongoose, Postgres, whatever - e.g. Bookshelf, Knex, etc.

@niftylettuce
Copy link
Contributor

Do the job examples here help at all? https://github.com/forwardemail/forwardemail.net/tree/master/jobs -- specifically this one shows how to do concurrency @ https://github.com/forwardemail/forwardemail.net/blob/master/jobs/check-domains.js

@ErisDS
Copy link
Author

ErisDS commented Aug 31, 2020

My use case is a data importer. A file is uploaded and stored to disk & processed later to import into the DB (but the job is written, it's calling it that's the issue). The importer will be used 0-10 times at the very start of the applications lifespan and then likely never again. So it doesn't make any sense to me to have code that polls for the rest of the application's life - which is hopefully years / til the end of the internet - for something that will almost certainly never happen.

I'm looking for true one-off jobs.

@ErisDS
Copy link
Author

ErisDS commented Aug 31, 2020

A second use case is having job 1 that handles sending huge bulk emails in batches. And then job 2 that polls for resulting delivery events and processes those. There's no point running job 1 unless the application is configured to send emails, which it may not be, and there's no point running job 2 until an email has been sent.

All of this comes from being a decentralised app with 100s of 1000s of installs - not a single centralsied application.

I'm also interested in strategies for handling batch jobs, where there may be 100 jobs and then something extra has to happen when they all complete. In some job libraries they have specific handling for this e.g. sidekiq where the last child triggers an event/callback, and in other systems there's a parent job that monitors the batch jobs.

I'll check out those examples in more detail shortly.

@NicolasGorga
Copy link

Hello, i've just started using Bree and integrated it with a room booking request app for events that i'm developing for my university final project. I have two use cases, one that i didn't find any trouble to solve with Bree, but the other i cannot see how i can solve it with it.

Use Case 1: Send push notifications to users of my app every day, on a fixed hour of the day. I could solve it with no problem with Bree: when i initialize Bree i create a Job that is configured to run on a specific hour of the day, based on a variable in my .env

Use Case 2: Users make room booking requests for events they are going to have. This requests can be accepted or declined after a priorization algorithm is run, 5 minutes before the start of the event and a push notification is sent to the device informing the resolution. My idea was to add a job when the user creates the booking request and configure it to run in the future, 5 minutes before the start of the event, that would execute the algorithm and inform the user.

Example: if the event that a user creates a booking request for is starting in 1 hour, then i would configure the job to run at 55 minutes from the current time. The problem is that after reading the docs, it is stated that after using bree.add({name: myJob, date: eventStart - 5 minutes}), i have to use bree.run(), but i dont want to run it when the user makes the request, as stated previously, but exactly at the moment i configured the date property.

Is there any workaround? The project is due in a few days and this is the main problem im struggling now, please help i appreciate it!

@shadowgate15
Copy link
Member

'bree.run()' only starts the job. Putting 'date' in the config will set the job to run at that date.

@NicolasGorga
Copy link

Hey @shadowgate15 thanks for your quick response!

When adding the job, im setting the 'date', bu t the thing is it doesnt execute it. Ive just tested it, by executing this code when a specific endpoint is hit for testing purposes:

bree.add({
name: 'testAdd',
date: new Date(2022, 2, 7, 21, 30, 0)
});

I hit the endpoint at 21:28 PM and waited two minutes as im writing this but nothing happened, am i doing something wrong or this isnt supported? it would be a shame if it werent :(

@shadowgate15
Copy link
Member

date is interpreted by default as local time. So if your time and server time are not the same that could be why it didn't fire. Maybe try that?

@shadowgate15
Copy link
Member

Also you should use bree.start(JOB_NAME) instead of bree.run that will actually schedule it to run.

Also if the date is in the past it will not run.

@NicolasGorga
Copy link

@shadowgate15 you were right about the time, i already fixed that and consoled log to confirm it is now right!

The date is in the future, a couple of minutes, but it still isn't executing for some reason. Neither when i do it with add and then start and neither when i dont use add and pass it directly on initialization of Bree, have you had problems with the date property?

@shadowgate15
Copy link
Member

Try running it with NODE_ENV=* and see what the debug log says.

@NicolasGorga
Copy link

Hey @shadowgate15 I ended up changing it to cron and it works!

Im now having a different issue, where the worker seems to be exiting right away after doing a console log, will keep making some tests and tomorrow ill hit you up if i cant solve it by myself.

I really appreciate you help, since my work is due for next monday!

@NicolasGorga
Copy link

Hey @shadowgate15 how are you?

Sorry to bother you, but i couldn't resolve the second issue i was having, where i told you that the worker didn't seem to be waiting for the resolution of an async operation i await.

In the worker i make a call to a Service i created that calls a DAO which returns a list of push tokens. This operations are asynchronous so i await them in my worker. This works perfectly when i don't use the worker to esecute it, but with the worker it enters the function, bot skips the code after and exits.

To simplify things for the question i changed the worker to look like this:

(async () => {
console.log("worker started");

const wait = (ms: number) => new Promise(res => setTimeout(res, ms))

console.log('before async task');
await wait(5000);
console.log('5 seconds after');

process.exit(0);
}

)()

What isn't executing is the last console log, reflecting what happens in my real worker, where all the code after the await keyword gets skipped. I've searched lots of things on stackoverflow and blogs, but found nothing that lets me understand this situation.

Please could you give me some clarity as to what i'm doing wrong?

@shadowgate15
Copy link
Member

That should be an async function

@NicolasGorga
Copy link

@shadowgate15 i tried this in the dummy function i sent you and it worked, although i dont understand the difference from what i wrote, since im using the await keyword for thw asyn operation and all the other operations are synchronous.

Now applying this approach to my real job doesnt work, it fails just as before. My real job:

(async () => {
console.log("In worker");

const pushTokens = await userService.getPushTokens();

//nothing executes after the previous await
console.log("push tokens", pushTokens);

if (pushTokens.length > 0) {
  console.log("need to send push");

  const messages: ExpoPushMessage[] = pushTokens.map((pt) => ({
	to: pt,
	title: "December Rooms",
	body: "Vas a ir a la oficina hoy?",
	data: {
	  screenName: "ConfirmAssist",
	},
  }));

  console.log("sending push", messages);

  sendNotifications(messages);

  console.log("al push sent");
} else {
  console.log("no push to send");
}

process.exit(0);
})();

As you said with the dummy, i tried wrapping everything apart from the process.exit(0) in an async function, awaiting and the doing process.exit() but it didn't work, this is confusing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants