nsqd: data format and/or import #373

tj · 2014-06-13T22:12:15Z

So I have a bit of a weird use-case, we basically have very large jobs of fetching 700k+ files from s3, unpacking them and PUBing hundreds of thousands of events per file, and we need it to be somewhat atomic.

Right now the prototype I have put together works fine but obviously if that job fails at any point and gets requeued we have a ton of duplicate work to do, and half of it has already been queued.

My plan is/was to write NSQD's dat files somewhere on disk and then copy those over for nsq to soak up once the job is complete. So that leads to my questions of:

a) would you be interested in some sort of import-from-file functionality?
b) is there anything I should watch out for when doing this?
c) any documentation on the binary format? (I'll dig around)

cheers

tj · 2014-06-14T00:37:40Z

actually feel free to close this if you want, I might go with persisting the progress somewhere else and just continuing the job from where it left off, that'll be cheaper for us anyway since s3 drops connections pretty frequently

dudleycarr · 2014-06-14T03:31:05Z

One possible solution would be to create a topic for a given file on a "empheral" nsqd. When you're successfully pushed all of the messages for that file to that nsqd, you could start nsq_to_nsq to then stream the messages to the production nsqd to be consumed by your regular set of workers. I haven't tried this approach myself, but I imagine @mreiferson would be able to say if that's sane or not.

mreiferson · 2014-06-14T05:24:58Z

@dudleycarr's suggestion is interesting, you wouldn't even need to use nsq_to_nsq if the topic you published to on the "ephemeral" nsqd was one that was already being actively consumed. It would be discovered, consumed, then you could decommission.

I think it probably makes sense to track some state for these jobs somewhere. First, it probably makes sense to divide the high-level operation into various sub-tasks. Each sub-task state could be tracked individually. If the various phases of the operation are prone to failure for any number of reasons, it seems like you need to keep track of intermediate progress. NSQ can continue to serve as a transport and work dispatch, but the workers would benefit from this domain specific state to determine whether to redo a given sub-task.

Since you asked, it wouldn't be too hard to produce nsqd data files, the format mostly mirrors the wire format. It would probably be easiest to build them for an "empty" nsqd because you also need to provide a metadata file which "points" to the data files.

Relatedly #304 was interested in file_to_nsq so if you do go down this path consider those requirements perhaps?

tj · 2014-06-17T20:46:20Z

ended up just deduping in redis for now so it can requeue without any problems, not a huge deal for this use-case so I'll close thanks guys!

mreiferson · 2014-06-17T21:01:26Z

pragmatic 👍

mreiferson added the question label Jun 14, 2014

tj closed this as completed Jun 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: data format and/or import #373

nsqd: data format and/or import #373

tj commented Jun 13, 2014

tj commented Jun 14, 2014

dudleycarr commented Jun 14, 2014

mreiferson commented Jun 14, 2014

tj commented Jun 17, 2014

mreiferson commented Jun 17, 2014

nsqd: data format and/or import #373

nsqd: data format and/or import #373

Comments

tj commented Jun 13, 2014

tj commented Jun 14, 2014

dudleycarr commented Jun 14, 2014

mreiferson commented Jun 14, 2014

tj commented Jun 17, 2014

mreiferson commented Jun 17, 2014