-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsq_to_file: support for separate working dir for in-progress files #1110
Conversation
👍 in general (haven't looked at the code yet of course). I've run into these same pain points to varying degrees. A separate working directory would make it easy to monitor for any old files in the working directory and manually recover them (remove corrupt gzip blocks with data that will end up in other files) |
Yep, that's exactly what we're thinking! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 one minor detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 squash away. Anything else you want to do here before this lands?
Ahhhhh dang I just realized this change is unsafe as-is and should not be merged! (I'd dismiss the approval if I could.) If an I'm not sure if that explanation makes sense as written, but here's a sort of illustration: /tmp/nsqout
└── foo
├── foo.2018-12-05_22-41+0000.rig-000000.log.gz
├── foo.2018-12-05_22-42+0000.rig-000000.log.gz
├── foo.2018-12-05_22-43+0000.rig-000000.log.gz
└── foo.2018-12-05_22-44+0000.rig-000000.log.gz # <-- moved to output dir on clean restart
/tmp/nsqwork
└── foo
└── foo.2018-12-05_22-44+0000.rig-000000.log.gz # <-- new working file w/ same name, will overwrite file above Note: The current minute is included in these filenames to speed up the testing process; full command line is:
|
Good catch, feel like we should perform a similar check at rename time as we do here (and cycle through revision numbers)? |
Great idea, and great catch. I don't think we can use the exact same filename conflict logic, since this bit is not opening a new file to write to, it's moving to a new name: if f.gzipEnabled {
openFlag |= os.O_EXCL
} else {
openFlag |= os.O_APPEND
}
f.out, err = os.OpenFile(absFilename, openFlag, 0666)
if err != nil {
if os.IsExist(err) {
continue // to next rev name I think there's no |
Heh, I was literally just talking to 🐍 about this offline. I was initially afraid that, because It turns out there's a cool trick for this that was not obvious to me: make a hard link to the destination first, and only if that succeeds delete the source. If the hard link fails, the destination already exists and we need to pick a new file name. |
ah, very cool 👍 |
I think i dislike the "rename on move" approach (or at least relying on that in entirety). I think we should pick initial name based on unused names between both directories. (even though it introduces some race condition issues if you were to run more than one nsq_to_file against the same directory, it should be safe outside of that). I like the property that a filename is "unique" and constant once it's opened. That means that if you kill -9 nsq_to_file you still have a uniquely named file in the working directory to recover and don't have to do extra work to figure out if it was a duplicate file or not where that filename might already appear elsewhere. In practice I think this means a |
After thinking this through a bit, I agree that optimistically checking both dirs is a useful addition. However, I'd like The question is then, what should WDYT @mccutchen @jehiah? |
If we check both directories to pick a unique filename up front in the normal case, i'm ok with the possibility of a rename on move to guarantee a path to resolve conflicts at that time. |
👍 cool, I'll take a pass at this |
Okay, with the latest commits, I think I've implemented the approach described above. Please take another look! There's a bit of awkwardness/duplication in the code (e.g. safely munging file paths, duplicated log messages, not-quite-the-same overlap in functionality when iterating through revs) so all suggestions for improvements are welcome. I'm gonna bang on this locally a bit more to try to ensure I'm not missing any edge cases. |
Meh, sorry for the noise, but I have spotted a bug when choosing a new file name when there's a conflict in the output dir. Note that the minute changed from
|
Okay, I think this change is functionally complete and working as expected, though the code itself is not very pretty. Please take another look when y'all get a chance @jehiah @mreiferson @ploxiln. |
Also, to give myself a bit more confidence in the new behavior here, I spent the weekend over-engineering this jepsen-lite test program to exercise these changes: An example run:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Made note of two little things, sort of conversationally, no real need to change them.
apps/nsq_to_file/file_logger.go
Outdated
makeOutputDir(outputFileName) | ||
|
||
_, err := os.Stat(outputFileName) | ||
if err == nil || os.IsExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess that you should never get an error here (including an Exist error). But this seems very safe, and clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, during normal operation this conditional should never be true.
The os.IsExist
case covers an unlikely but possible scenario (I think!) where some entirely other process adds a file to the final output dir that has the same name as an in-progress file in the working dir, in a sequence of events along these lines:
- Two (or more) nsq_to_file processes are sharing the same working and output dirs
- One process is
kill -9
'd, orphaning a file in the working dir - A separate cleanup process scans the working dir for orphaned files and moves the file into a tmp dir for cleanup/processing
- The killed nsq_to_file process starts back up and picks the same file name as the orphaned file (because it's no longer present in either work dir or output dir)
- The cleanup process finishes processing and naively (or buggily) sticks the orphaned file into the output dir under the same name
- The restarted nsq_to_file process now has to safely pick a new filename at the point where it's trying to move the file from working dir to output dir
I think the sequence above is quite unlikely, but, as you said, I think guarding against it isn't too onerous!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean, if the file exists, I suspect the stat will never return an error, it will be the err == nil
case.
} | ||
} | ||
|
||
func atomicRename(src, dst string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have called this exclusiveRename()
because the file does exist in two places for a moment, and because it emulates the open(..., O_EXCL)
, but this is also fine.
Hahaha oh, yeah, that sounds about right. 🤦♂️
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's squash???????
If -work-dir is given, nsq_to_file will create new files under that directory and only move them to -output-dir upon close, ensuring that only fully finished and synced files appear in the output dir. Work dir will match output dir by default in order to preserve existing behavior out of the box.
Squashed and rebased, but the latter step seems to have angered the code coverage tests. Clicking through to the failing build, I'm not sure the failure has anything to do with these changes. |
it can be noisy, nbd thanks! |
We're finally making moves to use this feature. @mreiferson could we cut a new nsq release to get this included sometime soon? I think the latest release is 1.1.0 on Aug 19, 2018 |
You’re saying you want me to actually touch a computer? |
Will send brisket in exchange. Or the frozen turkeys @mccutchen loves so much. |
The problem
There are a couple of edge cases we currently need to guard against when shipping the archive files generated by
nsq_to_file
to remote storage (e.g. S3).nsq_to_file -gzip
and an unclean exit (e.g.kill -9
). We don't currently have a good workaround for this issue, which causes issues in downstream systems and requires manual intervention to remediate.Proposed change
If
nsq_to_file
used a separate "working dir" for the files to which it is actively writing and only moved files into the final output dir once they were cleanly finished/closed, the problems above would be much easier to address:Implementation details
If
-work-dir
is given,nsq_to_file
will create new files under that directory and only move them to-output-dir
upon close, ensuring that only fully finished and synced files appear in the output dir.By default, the work dir will match the output dir in order to preserve existing behavior out of the box.