Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: data format and/or import #373

Closed
tj opened this issue Jun 13, 2014 · 5 comments
Closed

nsqd: data format and/or import #373

tj opened this issue Jun 13, 2014 · 5 comments
Labels

Comments

@tj
Copy link
Contributor

tj commented Jun 13, 2014

So I have a bit of a weird use-case, we basically have very large jobs of fetching 700k+ files from s3, unpacking them and PUBing hundreds of thousands of events per file, and we need it to be somewhat atomic.

Right now the prototype I have put together works fine but obviously if that job fails at any point and gets requeued we have a ton of duplicate work to do, and half of it has already been queued.

My plan is/was to write NSQD's dat files somewhere on disk and then copy those over for nsq to soak up once the job is complete. So that leads to my questions of:

a) would you be interested in some sort of import-from-file functionality?
b) is there anything I should watch out for when doing this?
c) any documentation on the binary format? (I'll dig around)

cheers

@tj
Copy link
Contributor Author

tj commented Jun 14, 2014

actually feel free to close this if you want, I might go with persisting the progress somewhere else and just continuing the job from where it left off, that'll be cheaper for us anyway since s3 drops connections pretty frequently

@dudleycarr
Copy link
Contributor

One possible solution would be to create a topic for a given file on a "empheral" nsqd. When you're successfully pushed all of the messages for that file to that nsqd, you could start nsq_to_nsq to then stream the messages to the production nsqd to be consumed by your regular set of workers. I haven't tried this approach myself, but I imagine @mreiferson would be able to say if that's sane or not.

@mreiferson
Copy link
Member

@dudleycarr's suggestion is interesting, you wouldn't even need to use nsq_to_nsq if the topic you published to on the "ephemeral" nsqd was one that was already being actively consumed. It would be discovered, consumed, then you could decommission.

I think it probably makes sense to track some state for these jobs somewhere. First, it probably makes sense to divide the high-level operation into various sub-tasks. Each sub-task state could be tracked individually. If the various phases of the operation are prone to failure for any number of reasons, it seems like you need to keep track of intermediate progress. NSQ can continue to serve as a transport and work dispatch, but the workers would benefit from this domain specific state to determine whether to redo a given sub-task.

Since you asked, it wouldn't be too hard to produce nsqd data files, the format mostly mirrors the wire format. It would probably be easiest to build them for an "empty" nsqd because you also need to provide a metadata file which "points" to the data files.

Relatedly #304 was interested in file_to_nsq so if you do go down this path consider those requirements perhaps?

@tj
Copy link
Contributor Author

tj commented Jun 17, 2014

ended up just deduping in redis for now so it can requeue without any problems, not a huge deal for this use-case so I'll close thanks guys!

@tj tj closed this as completed Jun 17, 2014
@mreiferson
Copy link
Member

pragmatic 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants