Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: add experimental_save_rejected option to CSV IMPORT #42112

Merged
merged 1 commit into from
Nov 5, 2019

Conversation

spaskob
Copy link
Contributor

@spaskob spaskob commented Nov 1, 2019

This was promised to a client and will be backported to 19.2.1.
The feature should stay undocumented for now since the semantics
and UX are still not well understood.

To make the change work, we had to remove the parsing workers from
conv.start and move them to readFile which means that for each
file a separate set of workers will be brought up and down. Also the
tally for all total number of rejected rows was moved to read_import_base.go.

Touches #40374.

Release note (sql change): add undocumented experimental_save_rejected
option to CSV IMPORT.

@spaskob spaskob requested review from dt and miretskiy November 1, 2019 14:48
@spaskob spaskob requested a review from a team as a code owner November 1, 2019 14:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spaskob spaskob changed the title Csv rejected importccl: add experimental_save_rejected option to CSV IMPORT Nov 1, 2019
Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/import_processor_test.go, line 140 at r1 (raw file):

					defer close(kvCh)
					if err := conv.readFiles(ctx, testCase.inputs, converterSpec.Format, externalStorage); err != nil {
						t.Fatalf("Conversion failed: %v", err)

If only this worked! Sadly you can't t.Fatal across goroutines -- you need to use something to pass the error back to the main goroutine and Fatal it there, either the way it was using a group before or with a channel (or var if you can ensure ordered access otherwise). I suspect the group (the way it was) is the easiest.


pkg/ccl/importccl/import_stmt.go, line 239 at r1 (raw file):

			format.Format = roachpb.IOFileFormat_CSV
			// Set the default CSV separator for the cases when it is not overwritten.
			format.Csv.Comma = ','

curious if something specific motivated this? I don't recall exactly why but for some reason I thought we liked using the zero value for "let the decoder pick the default". That said, I don't feel strongly either way.


pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):

			var rejected chan string
			if (format.Format == roachpb.IOFileFormat_CSV && format.Csv.SaveRejected) ||

I'd rather add a format.SaveRejected field, that is independent of the file format (like format.Compression), so we don't need to keep adding cases here.


pkg/ccl/importccl/read_import_csv.go, line 73 at r1 (raw file):

func (c *csvInputReader) start(group ctxgroup.Group) {
	group.GoCtx(func(ctx context.Context) error {

weak preference: I'd move these changes, to the worker pool setup/mgmt, to a standalone commit before the rejected support stuff, since I think that'd make it much easier, below, to see the logic changes in loop for the rejected diff.


pkg/ccl/importccl/read_import_csv.go, line 234 at r1 (raw file):

							log.Error(ctx, err)
							batch.rejected <- strings.Join(record, string(c.opts.Comma)) + "\n"
							continue OUTER

nit: on first read I thought this was skipping the remainder of the rows in the batch, since I read "for.. record :=" as "record" being a synonym for "row", when it is actually, I think, "field". renaming the loop var and/or commenting at the label about what it means might help.


pkg/roachpb/io-formats.proto, line 58 at r1 (raw file):

  optional bool strict_quotes = 5 [(gogoproto.nullable) = false];
  // If true, don't abort on failures but instead save the offending row and keep on.
  optional bool save_rejected = 6 [(gogoproto.nullable) = false];

see above: now that we're supporting multiple formats, I really think this should move out a level and be a common flag.

Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/import_processor_test.go, line 140 at r1 (raw file):

Previously, dt (David Taylor) wrote…

If only this worked! Sadly you can't t.Fatal across goroutines -- you need to use something to pass the error back to the main goroutine and Fatal it there, either the way it was using a group before or with a channel (or var if you can ensure ordered access otherwise). I suspect the group (the way it was) is the easiest.

Will do


pkg/ccl/importccl/import_stmt.go, line 239 at r1 (raw file):

Previously, dt (David Taylor) wrote…

curious if something specific motivated this? I don't recall exactly why but for some reason I thought we liked using the zero value for "let the decoder pick the default". That said, I don't feel strongly either way.

Will do


pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I'd rather add a format.SaveRejected field, that is independent of the file format (like format.Compression), so we don't need to keep adding cases here.

I agree, this would require a fairly substantial refactor to do right across a bunch of shared options across modes. And also re-thinking the UX - do we need different names in DELIMITED and CSV for the fields separators?
Since this is a backport PR, should we leave this for a future PR?


pkg/ccl/importccl/read_import_csv.go, line 73 at r1 (raw file):

Previously, dt (David Taylor) wrote…

weak preference: I'd move these changes, to the worker pool setup/mgmt, to a standalone commit before the rejected support stuff, since I think that'd make it much easier, below, to see the logic changes in loop for the rejected diff.

Will do


pkg/ccl/importccl/read_import_csv.go, line 234 at r1 (raw file):

Previously, dt (David Taylor) wrote…

nit: on first read I thought this was skipping the remainder of the rows in the batch, since I read "for.. record :=" as "record" being a synonym for "row", when it is actually, I think, "field". renaming the loop var and/or commenting at the label about what it means might help.

Will do


pkg/roachpb/io-formats.proto, line 58 at r1 (raw file):

Previously, dt (David Taylor) wrote…

see above: now that we're supporting multiple formats, I really think this should move out a level and be a common flag.

ditto

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I agree, this would require a fairly substantial refactor to do right across a bunch of shared options across modes. And also re-thinking the UX - do we need different names in DELIMITED and CSV for the fields separators?
Since this is a backport PR, should we leave this for a future PR?

we're adding a new field to the photo in this change though, so I think we can add it where we want it to live in the future, even if for now the only write to that field is still just the one you have above in the CSV case in import_stmt.go -- I don't think that significantly changes the scope of the changes for back ports, but it does future-proof the proto for when we generalize later, no?

Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):

Previously, dt (David Taylor) wrote…

we're adding a new field to the photo in this change though, so I think we can add it where we want it to live in the future, even if for now the only write to that field is still just the one you have above in the CSV case in import_stmt.go -- I don't think that significantly changes the scope of the changes for back ports, but it does future-proof the proto for when we generalize later, no?

that is true in read_import_base.go however each individual reader gets passed only its options, see https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/read_import_csv.go#L35

so adding global option will require refactoring of the structs that are passed.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

that is true in read_import_base.go however each individual reader gets passed only its options, see https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/read_import_csv.go#L35

so adding global option will require refactoring of the structs that are passed.

That's true, though I think until we do broader refactoring, we could just condition on rejected != nil so the only place we need to check the options is in base, when deciding to initialize the chan, and then the chan pointer arg is already passed into the read func, so the read func doesn't really need to go back to the options at that point?

Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/import_processor_test.go, line 140 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Will do

Done


pkg/ccl/importccl/import_stmt.go, line 239 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Will do

The problem is that we need to recreate the the line for rejected and for that we are using the separator.


pkg/ccl/importccl/read_import_base.go, line 167 at r1 (raw file):

Previously, dt (David Taylor) wrote…

That's true, though I think until we do broader refactoring, we could just condition on rejected != nil so the only place we need to check the options is in base, when deciding to initialize the chan, and then the chan pointer arg is already passed into the read func, so the read func doesn't really need to go back to the options at that point?

alright


pkg/ccl/importccl/read_import_csv.go, line 73 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Will do

I'd leave them if you don't mind


pkg/ccl/importccl/read_import_csv.go, line 234 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Will do

done


pkg/roachpb/io-formats.proto, line 58 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

ditto

Done.

@spaskob
Copy link
Contributor Author

spaskob commented Nov 4, 2019

PTAL

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/read_import_csv.go, line 73 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I'd leave them if you don't mind

👍


pkg/ccl/importccl/read_import_csv.go, line 111 at r2 (raw file):

	cr.Comment = c.opts.Comment

	c.recordCh = make(chan csvRecord)

nit: I'd have made this a local in this func instead of modifying state on c


pkg/ccl/importccl/read_import_csv.go, line 125 at r2 (raw file):

		defer tracing.FinishSpan(span)
		return ctxgroup.GroupWorkers(ctx, c.parallelism, func(ctx context.Context) error {
			return c.convertRecordWorker(ctx)

IMO it'd be slightly cleaner to have convertRecordWroker take recordCh and rejected as arguments than to put them on c/in the batches, but I'm fine with it this way particular for now if it minimizes scope of patch for backports.

@spaskob
Copy link
Contributor Author

spaskob commented Nov 4, 2019

bors r+

This was promised to a client and will be backported to 19.2.1.
The feature should stay undocumented for now since the semantics
and UX are still not well understood.

To make the change work, we had to remove the parsing workers from
`conv.start` and move them to `readFile` which means that for each
file a separate set of workers will be brought up and down. Also the
tally for all total number of rejected rows was moved to `read_import_base.go`.

Release note (sql change): add undocumented experimental_save_rejected
option to CSV IMPORT.
@spaskob
Copy link
Contributor Author

spaskob commented Nov 5, 2019

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 5, 2019

Build failed

@spaskob
Copy link
Contributor Author

spaskob commented Nov 5, 2019

bors r+

craig bot pushed a commit that referenced this pull request Nov 5, 2019
42112: importccl: add experimental_save_rejected option to CSV IMPORT  r=spaskob a=spaskob

This was promised to a client and will be backported to 19.2.1.
The feature should stay undocumented for now since the semantics
and UX are still not well understood.

To make the change work, we had to remove the parsing workers from
`conv.start` and move them to `readFile` which means that for each
file a separate set of workers will be brought up and down. Also the
tally for all total number of rejected rows was moved to `read_import_base.go`.

Touches #40374.

Release note (sql change): add undocumented experimental_save_rejected
option to CSV IMPORT.

Co-authored-by: Spas Bojanov <[email protected]>
@craig
Copy link
Contributor

craig bot commented Nov 5, 2019

Build succeeded

@craig craig bot merged commit 8b2a1fd into cockroachdb:master Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants