-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: add experimental_save_rejected option to CSV IMPORT #42112
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)
pkg/ccl/importccl/import_processor_test.go, line 140 at r1 (raw file):
defer close(kvCh) if err := conv.readFiles(ctx, testCase.inputs, converterSpec.Format, externalStorage); err != nil { t.Fatalf("Conversion failed: %v", err)
If only this worked! Sadly you can't t.Fatal
across goroutines -- you need to use something to pass the error back to the main goroutine and Fatal
it there, either the way it was using a group before or with a channel (or var if you can ensure ordered access otherwise). I suspect the group (the way it was) is the easiest.
pkg/ccl/importccl/import_stmt.go, line 239 at r1 (raw file):
format.Format = roachpb.IOFileFormat_CSV // Set the default CSV separator for the cases when it is not overwritten. format.Csv.Comma = ','
curious if something specific motivated this? I don't recall exactly why but for some reason I thought we liked using the zero value for "let the decoder pick the default". That said, I don't feel strongly either way.
pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):
var rejected chan string if (format.Format == roachpb.IOFileFormat_CSV && format.Csv.SaveRejected) ||
I'd rather add a format.SaveRejected
field, that is independent of the file format (like format.Compression), so we don't need to keep adding cases here.
pkg/ccl/importccl/read_import_csv.go, line 73 at r1 (raw file):
func (c *csvInputReader) start(group ctxgroup.Group) { group.GoCtx(func(ctx context.Context) error {
weak preference: I'd move these changes, to the worker pool setup/mgmt, to a standalone commit before the rejected support stuff, since I think that'd make it much easier, below, to see the logic changes in loop for the rejected diff.
pkg/ccl/importccl/read_import_csv.go, line 234 at r1 (raw file):
log.Error(ctx, err) batch.rejected <- strings.Join(record, string(c.opts.Comma)) + "\n" continue OUTER
nit: on first read I thought this was skipping the remainder of the rows in the batch, since I read "for.. record :=" as "record" being a synonym for "row", when it is actually, I think, "field". renaming the loop var and/or commenting at the label about what it means might help.
pkg/roachpb/io-formats.proto, line 58 at r1 (raw file):
optional bool strict_quotes = 5 [(gogoproto.nullable) = false]; // If true, don't abort on failures but instead save the offending row and keep on. optional bool save_rejected = 6 [(gogoproto.nullable) = false];
see above: now that we're supporting multiple formats, I really think this should move out a level and be a common flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)
pkg/ccl/importccl/import_processor_test.go, line 140 at r1 (raw file):
Previously, dt (David Taylor) wrote…
If only this worked! Sadly you can't
t.Fatal
across goroutines -- you need to use something to pass the error back to the main goroutine andFatal
it there, either the way it was using a group before or with a channel (or var if you can ensure ordered access otherwise). I suspect the group (the way it was) is the easiest.
Will do
pkg/ccl/importccl/import_stmt.go, line 239 at r1 (raw file):
Previously, dt (David Taylor) wrote…
curious if something specific motivated this? I don't recall exactly why but for some reason I thought we liked using the zero value for "let the decoder pick the default". That said, I don't feel strongly either way.
Will do
pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I'd rather add a
format.SaveRejected
field, that is independent of the file format (like format.Compression), so we don't need to keep adding cases here.
I agree, this would require a fairly substantial refactor to do right across a bunch of shared options across modes. And also re-thinking the UX - do we need different names in DELIMITED and CSV for the fields separators?
Since this is a backport PR, should we leave this for a future PR?
pkg/ccl/importccl/read_import_csv.go, line 73 at r1 (raw file):
Previously, dt (David Taylor) wrote…
weak preference: I'd move these changes, to the worker pool setup/mgmt, to a standalone commit before the rejected support stuff, since I think that'd make it much easier, below, to see the logic changes in loop for the rejected diff.
Will do
pkg/ccl/importccl/read_import_csv.go, line 234 at r1 (raw file):
Previously, dt (David Taylor) wrote…
nit: on first read I thought this was skipping the remainder of the rows in the batch, since I read "for.. record :=" as "record" being a synonym for "row", when it is actually, I think, "field". renaming the loop var and/or commenting at the label about what it means might help.
Will do
pkg/roachpb/io-formats.proto, line 58 at r1 (raw file):
Previously, dt (David Taylor) wrote…
see above: now that we're supporting multiple formats, I really think this should move out a level and be a common flag.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)
pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
I agree, this would require a fairly substantial refactor to do right across a bunch of shared options across modes. And also re-thinking the UX - do we need different names in DELIMITED and CSV for the fields separators?
Since this is a backport PR, should we leave this for a future PR?
we're adding a new field to the photo in this change though, so I think we can add it where we want it to live in the future, even if for now the only write to that field is still just the one you have above in the CSV case in import_stmt.go -- I don't think that significantly changes the scope of the changes for back ports, but it does future-proof the proto for when we generalize later, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)
pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):
Previously, dt (David Taylor) wrote…
we're adding a new field to the photo in this change though, so I think we can add it where we want it to live in the future, even if for now the only write to that field is still just the one you have above in the CSV case in import_stmt.go -- I don't think that significantly changes the scope of the changes for back ports, but it does future-proof the proto for when we generalize later, no?
that is true in read_import_base.go however each individual reader gets passed only its options, see https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/read_import_csv.go#L35
so adding global option will require refactoring of the structs that are passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)
pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
that is true in read_import_base.go however each individual reader gets passed only its options, see https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/read_import_csv.go#L35
so adding global option will require refactoring of the structs that are passed.
That's true, though I think until we do broader refactoring, we could just condition on rejected != nil
so the only place we need to check the options is in base, when deciding to initialize the chan, and then the chan pointer arg is already passed into the read func, so the read func doesn't really need to go back to the options at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)
pkg/ccl/importccl/import_processor_test.go, line 140 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
Will do
Done
pkg/ccl/importccl/import_stmt.go, line 239 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
Will do
The problem is that we need to recreate the the line for rejected and for that we are using the separator.
pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):
Previously, dt (David Taylor) wrote…
That's true, though I think until we do broader refactoring, we could just condition on
rejected != nil
so the only place we need to check the options is in base, when deciding to initialize the chan, and then the chan pointer arg is already passed into the read func, so the read func doesn't really need to go back to the options at that point?
alright
pkg/ccl/importccl/read_import_csv.go, line 73 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
Will do
I'd leave them if you don't mind
pkg/ccl/importccl/read_import_csv.go, line 234 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
Will do
done
pkg/roachpb/io-formats.proto, line 58 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
ditto
Done.
PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)
pkg/ccl/importccl/read_import_csv.go, line 73 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
I'd leave them if you don't mind
👍
pkg/ccl/importccl/read_import_csv.go, line 111 at r2 (raw file):
cr.Comment = c.opts.Comment c.recordCh = make(chan csvRecord)
nit: I'd have made this a local in this func instead of modifying state on c
pkg/ccl/importccl/read_import_csv.go, line 125 at r2 (raw file):
defer tracing.FinishSpan(span) return ctxgroup.GroupWorkers(ctx, c.parallelism, func(ctx context.Context) error { return c.convertRecordWorker(ctx)
IMO it'd be slightly cleaner to have convertRecordWroker
take recordCh
and rejected
as arguments than to put them on c
/in the batches, but I'm fine with it this way particular for now if it minimizes scope of patch for backports.
bors r+ |
This was promised to a client and will be backported to 19.2.1. The feature should stay undocumented for now since the semantics and UX are still not well understood. To make the change work, we had to remove the parsing workers from `conv.start` and move them to `readFile` which means that for each file a separate set of workers will be brought up and down. Also the tally for all total number of rejected rows was moved to `read_import_base.go`. Release note (sql change): add undocumented experimental_save_rejected option to CSV IMPORT.
bors r+ |
Build failed |
bors r+ |
42112: importccl: add experimental_save_rejected option to CSV IMPORT r=spaskob a=spaskob This was promised to a client and will be backported to 19.2.1. The feature should stay undocumented for now since the semantics and UX are still not well understood. To make the change work, we had to remove the parsing workers from `conv.start` and move them to `readFile` which means that for each file a separate set of workers will be brought up and down. Also the tally for all total number of rejected rows was moved to `read_import_base.go`. Touches #40374. Release note (sql change): add undocumented experimental_save_rejected option to CSV IMPORT. Co-authored-by: Spas Bojanov <[email protected]>
Build succeeded |
This was promised to a client and will be backported to 19.2.1.
The feature should stay undocumented for now since the semantics
and UX are still not well understood.
To make the change work, we had to remove the parsing workers from
conv.start
and move them toreadFile
which means that for eachfile a separate set of workers will be brought up and down. Also the
tally for all total number of rejected rows was moved to
read_import_base.go
.Touches #40374.
Release note (sql change): add undocumented experimental_save_rejected
option to CSV IMPORT.