Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save the offending data import is rejecting #40374

Closed
robert-s-lee opened this issue Aug 30, 2019 · 4 comments
Closed

save the offending data import is rejecting #40374

robert-s-lee opened this issue Aug 30, 2019 · 4 comments
Assignees
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@robert-s-lee
Copy link
Contributor

Is your feature request related to a problem? Please describe.

the import command stops immediately when the data does not match import specifications.

pq: nodelocal:///quotetestdata.csv: row 4: expected 3 fields, got 2

it takes extra steps to look at the row and find the offending lines and find each and every line. This becomes iterative process if there are many errors, especially if the offending data is at the end of the large data files that takes hours to process.

Describe the solution you'd like

save the offending data to another destination so that the file can be re-processed without having to reprocess the entire file again only to find one error at a time.

Describe alternatives you've considered

Additional context

@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 30, 2019
@rolandcrosby rolandcrosby assigned spaskob and unassigned rolandcrosby Sep 19, 2019
@rolandcrosby
Copy link

To support this, we can add additional IMPORT syntax to allow the user to pass an additional cloud storage URI, to which we will write any malformed input lines. The code to parse cloud storage URIs and write to cloud storage is in export_storage.go. Statements that use this functionality include BACKUP, EXPORT, and CREATE CHANGEFEED.

As this is being added rather late in the release, I'd hesitate to implement this via a SQL grammar change at this point. Instead, we could implement this via an experimental WITH option to start, and consider reifying it in the grammar in 20.1 after getting additional feedback from the SQL team.

I'm thinking the syntax could be something like:

import table foo (a int primary key, b string, c string) delimited data ('s3://path/to/infile.csv') with experimental_malformed_input_uri='s3://place/to/store/malformed/input'

We should try to implement this for the delimited and csv import formats to start. I think it should be possible to make this work for the mysqldump and pg_dump formats in the future as well, but there are other open UX questions we'll need to answer if we want to do that.

Open question: Should the cloud storage URI be a file name or directory name? Since IMPORT runs in parallel across multiple nodes, if we want to write all the malformed lines to a single file, we'll need to buffer all the malformed lines on a single coordinator node and have that node write the malformed lines to a file all at once. Alternatively, if we allow the path to be a directory, we could allow each node to accumulate the malformed lines separately and then write the data as they wish.

spaskob pushed a commit to spaskob/cockroach that referenced this issue Sep 23, 2019
… fields

Previously DELIMITED import mode will complain about importing field
`"This a nested single quote " inside quoted field"`
when importing using option `WITH fields_enclosed_by '"'`.
MySql implementation handles this case correctly.
For more info and examples see cockroachdb#40959.

Touches cockroachdb#40374.
Fixes cockroachdb#40959.

Release note (enterprise change): Fixes incorrect behavior for enclosed fields.

Release justification: This is low risk since only business logic of
import has changed. Moreover this functionality has been requested
from a potential client.
@robert-s-lee
Copy link
Contributor Author

@rolandcrosby we can ask if single or multiple files are preferred with AWM.

spaskob pushed a commit to spaskob/cockroach that referenced this issue Sep 24, 2019
… fields

Previously DELIMITED import mode will complain about importing field
`"This a nested single quote " inside quoted field"`
when importing using option `WITH fields_enclosed_by '"'`.
MySql implementation handles this case correctly.
For more info and examples see cockroachdb#40959.

Touches cockroachdb#40374.
Fixes cockroachdb#40959.

Release note (enterprise change): Fixes incorrect behavior for enclosed fields.

Release justification: This is low risk since only business logic of
import has changed. Moreover this functionality has been requested
from a potential client.
craig bot pushed a commit that referenced this issue Sep 24, 2019
40960: importccl: DELIMITED should handle delimiters only at end of fields r=spaskob a=spaskob

Previously DELIMITED import mode will complain about importing field
`"This a nested single quote " inside quoted field"`
when importing using option `WITH fields_enclosed_by '"'`.
Mysql implementation handles this case correctly.
For more info and examples see #40959.

Touches #40374.
Fixes #40959.

Release note (enterprise change): Fixes incorrect behavior for enclosed fields.

Release justification: This is low risk since only business logic of
import has changed. Moreover this functionality has been requested
from a potential client.

Co-authored-by: spaskob <[email protected]>
craig bot pushed a commit that referenced this issue Oct 14, 2019
41430: importccl: enabling saving of bad csv rows to a side file r=spaskob a=spaskob

importccl: enabling saving of bad csv rows to a side file

When importing from large csv files it's common to have a few offending rows.
Currently `IMPORT` will abort at the first error which makes for a tedious
task of manually fixing ech problem and re-running. Instead users can specify
new option `WITH experimental_save_rejected` which will not stop on bad rows
but save them in a side file called `<original_csv_file>.rejected` and
continue. The user then can re-run the import command using `IMPORT INTO` using
the rejected file after fixing the problems in it.

Touches #40374.

Release note (sql change): enable skipping of faulty rows in IMPORT.

Co-authored-by: Spas Bojanov <[email protected]>
spaskob added a commit to spaskob/cockroach that referenced this issue Oct 30, 2019
This is part of an ongoing refactor to simplify the IMPORT code base.
Particularly here we remove calls to inputFinished which is supposed to be
called after all input files are ingested to close the cannel kvCh on which
the KVs are sent to the routine that drains this channel and sends them to KV.
Instead the creation and close of the channel is moved closer to where it is used.
inputFinished was really only used for a special case in CSV where we have a fan out to
a set of workers that forward the KVs to kvCh and so the closing logic needs to be called
after these workers are done. Now instead the reading of the files and the workers are grouped
so that we can wait for all routines from the group to finish and then close the channel.
This will simplify how we save rejected rows. See the issue below.

Touches: cockroachdb#40374.

Release note: none.
craig bot pushed a commit that referenced this issue Oct 30, 2019
41972: importccl: remove unnecessary interface function r=spaskob a=spaskob

importccl: remove unnecessary interface function
    
This is part of an ongoing refactor to simplify the IMPORT code base.
 Particularly here we remove calls to inputFinished which is supposed to be
 called after all input files are ingested to close the cannel kvCh on which
 the KVs are sent to the routine that drains this channel and sends them to KV.
 Instead the creation and close of the channel is moved closer to where it is used.
 inputFinished was really only used for a special case in CSV where we have a fan out to
 a set of workers that forward the KVs to kvCh and so the closing logic needs to be called
 after these workers are done. Now instead the reading of the files and the workers are grouped
 so that we can wait for all routines from the group to finish and then close the channel.
 This will simplify how we save rejected rows. See the issue below.
 
 Touches: #40374.
    
 Release note: none.

Co-authored-by: Spas Bojanov <[email protected]>
craig bot pushed a commit that referenced this issue Nov 5, 2019
42112: importccl: add experimental_save_rejected option to CSV IMPORT  r=spaskob a=spaskob

This was promised to a client and will be backported to 19.2.1.
The feature should stay undocumented for now since the semantics
and UX are still not well understood.

To make the change work, we had to remove the parsing workers from
`conv.start` and move them to `readFile` which means that for each
file a separate set of workers will be brought up and down. Also the
tally for all total number of rejected rows was moved to `read_import_base.go`.

Touches #40374.

Release note (sql change): add undocumented experimental_save_rejected
option to CSV IMPORT.

Co-authored-by: Spas Bojanov <[email protected]>
spaskob added a commit to spaskob/cockroach that referenced this issue Nov 12, 2019
This is part of an ongoing refactor to simplify the IMPORT code base.
Particularly here we remove calls to inputFinished which is supposed to be
called after all input files are ingested to close the cannel kvCh on which
the KVs are sent to the routine that drains this channel and sends them to KV.
Instead the creation and close of the channel is moved closer to where it is used.
inputFinished was really only used for a special case in CSV where we have a fan out to
a set of workers that forward the KVs to kvCh and so the closing logic needs to be called
after these workers are done. Now instead the reading of the files and the workers are grouped
so that we can wait for all routines from the group to finish and then close the channel.
This will simplify how we save rejected rows. See the issue below.

Touches: cockroachdb#40374.

Release note: none.
craig bot pushed a commit that referenced this issue Mar 2, 2020
45269: importccl: Parallelize avro import r=miretskiy a=miretskiy

Parallelize avro importer to improve its throughput (2.8x improvement).

Touches #40374.
Fixes #45097.

Release notes (performance): Faster avro import

45482: storage: integrate Concurrency Manager into Replica request path r=nvanbenschoten a=nvanbenschoten

Related to #41720.
Related to #44976.

This commit integrates the new concurrency package into the storage package. Each Replica is given a concurrency manager, which replaces its existing latch manager and txn wait queue. The change also uses the concurrency manager to simplify the role of the intent resolver. The intent resolver no longer directly handles WriteIntentErrors. As a result, we are able to delete the contention queue entirely.

With this change, all requests are now sequenced through the concurrency manager. When sequencing, latches are acquired and conflicting locks are detected. If any locks are found, the requests wait in lock wait-queues for the locks to be resolved. This is a major deviation from how things currently work because today, even with the contention queue, requests end up waiting for conflicting transactions to commit/abort in the txnWaitQueue after at least one RPC. Now, requests wait directly next to the intents/locks that they are waiting on and react directly to the resolution of these intents/locks.

Once requests are sequenced by the concurrency manager, they are theoretically fully isolated from all conflicting requests. However, this is not strictly true today because we have not yet pulled all replicated locks into the concurrency manager's lock table. We will do so in a future change. Until then, the concurrency manager maintains a notion of "intent discovery", which is integrated into the Replica-level concurrency retry loop.

Performance numbers will be published shortly. This will be followed by performance numbers using the SELECT FOR UPDATE locking (#40205) improvements that this change enables.

45484: sql: simplify connection state machine - stop tracking retry intent r=andreimatei a=andreimatei

Before this patch, the SQL connection state machine had an optimization:
if a transaction that hadn't used "SAVEPOINT cockroach_restart"
encountered a retriable error that we can't auto-retry, then we'd
release the txn's locks eagerly and enter the Aborted state. As opposed
to transactions that had used the "SAVEPOINT cockroach_restart", which
go to RestartWait.
This optimization is a significant complication for the state machine,
so this patch is removing it. All transactions now go to RestartWait,
and wait for a ROLLBACK to release the locks.

On the flip side, doing "RELEASE SAVEPOINT cockroach_restart" and
"ROLLBACK SAVEPOINT cockroach_restart" now works even for transactions
that haven't explicitly declared that savepoint, which is nice. Although
I don't promise I'll keep it working.

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
@spaskob spaskob assigned miretskiy and unassigned spaskob Mar 3, 2020
@miretskiy
Copy link
Contributor

@spaskob can you describe what else do you see happing on this issue?

@spaskob
Copy link
Contributor

spaskob commented Mar 4, 2020

Nothing in particular - feel free to close it after your PR merges.
It makes sense for you to own it since I am no longer actively working on IMPORT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants