Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV validation behaviour is suprising and buggy #401

Open
zachary-povey opened this issue Aug 30, 2024 · 1 comment
Open

CSV validation behaviour is suprising and buggy #401

zachary-povey opened this issue Aug 30, 2024 · 1 comment

Comments

@zachary-povey
Copy link

zachary-povey commented Aug 30, 2024

Overview

Hi, so I've come to datacontract-cli with the hope of replacing our current piecemeal data import validation and as a way to formally define our team's expectations regarding data we receive (I realise that, in theory, contracts should be owned by data producers, but that's not the reality at my org).

One of our use cases, that the docs seem to suggest is supported, doesn't really work with this cli as far as I can tell: validating an example CSV file against the model schema.

I understand that this is probably because in reality, csv files have no proper types, so validating them against the schema is not straightforward. However, this kind of validation is a valid use case* and from the docs, it looks like something the cli would do: it says it supports csv and that it validates the schema as part of the test command.

* Something we currently do using frictionless.

Detailed Examples

I've attached a zip file containing a datacontract yaml and some example files to demonstrate my point, with an equivalent example for the same data in a json format, highlighting how the behaviour is different and therefore surprising given the difference is not documented.

datacontract_csv_validation_github_issue.zip

Bad Data Type

In this example I have simply set the integer_field to be equal to a string value. Running the test against the bad_data_type_json server gives exactly the result we would expect:
image

However when we run test for bad_data_type_csv we get this:
image

On inspection, it appears that for csv files the cli basically just checks all the fields are there during a test, but does not actually check the values.

Missing Field

Given the above result, I thought I'd check out if the validation would at least therefore pick up missing fields in my CSV so I set up servers for json/csv files where the integer_field was missing completely.

Again when running against missing_field_json i got what I was expecting:
image

But when running against missing_field_csv, I got what appears to be an unhandled error:
image

(I think there is also a slight subtlety here with nulls vs actually missing keys/columns but I'll avoid getting into that for brevity).

Summary

My basic question is: is schema validation of csv data ever likely to be in scope of this tool? If not, I can understand to some extent taking the line that csv validation against a tech-agnostic schema can't really go much further than checking columns exist without getting messy, but that does limit the usefulness of it as a tool and I think needs making more explicit in the documentation.

@ctrabold
Copy link

ctrabold commented Jan 8, 2025

I probably face the same issue as @zachary-povey but see much more verbose output (see below).

Any pointers how to tackle this is appreciated as this basically renders the CSV functionality of the datacontract-cli kinda useless as in: isn't the whole point of the contract test to catch missing columns or wrongly typed columns (as in if column is an unquoted number type: text would fail the test?). Or am I missing a point here? 🤔

To reproduce

The version I run:

datacontract --version
0.10.16

My test data:

cat test_data.csv
A,B
1,2

The contract test (reducted for brevity):

cat datacontract.yaml
servers:
  local:
    type: local
    path: test_data.csv
    format: csv
...
models:
  my_model:
    description: Test Model
    type: table
    fields:
      A:
        type: text
        description: A Test
        required: true
      B:
        type: text
        description: B Test
        required: true
      C:
        type: text
        description: B Test
        required: true

Running the test fails loudly with exceptions:

datacontract test --server local datacontract.yaml
Testing datacontract.yaml
ERROR:root:Exception occurred
Traceback (most recent call last):
  File ".venv/lib/python3.13/site-packages/datacontract/data_contract.py", line 202, in test
    check_soda_execute(run, data_contract, server, self._spark, tmp_dir)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.13/site-packages/datacontract/engines/soda/check_soda_execute.py", line 28, in check_soda_execute
    con = get_duckdb_connection(data_contract, server, run)
  File ".venv/lib/python3.13/site-packages/datacontract/engines/soda/connections/duckdb.py", line 50, in get_duckdb_connection
    con.sql(
    ~~~~~~~^
        f"""CREATE VIEW "{model_name}" AS SELECT * FROM read_csv('{model_path}', hive_partitioning=1, columns={columns});"""
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
duckdb.duckdb.InvalidInputException: Invalid Input Error: Error when sniffing file "test_data.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ',', '|', ';', ' '
Quote/Escape Candidates: ['"','"'],['"','\0'],['"','''],['"','\'],[''','\'],['\0','\0']
Comment Candidates: '#', '\0'
Possible fixes:

* Set delimiter (e.g., delim=',')
* Set quote (e.g., quote='"')
* Set escape (e.g., escape='"')
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd')

ERROR:root:Invalid Input Error: Error when sniffing file "test_data.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ',', '|', ';', ' '
Quote/Escape Candidates: ['"','"'],['"','\0'],['"','''],['"','\'],[''','\'],['\0','\0']
Comment Candidates: '#', '\0'
Possible fixes:

* Set delimiter (e.g., delim=',')
* Set quote (e.g., quote='"')
* Set escape (e.g., escape='"')
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd')

╭────────┬────────────────────┬───────┬────────────────────────────────────────────────────╮
│ Result │ Check              │ Field │ Details                                            │
├────────┼────────────────────┼───────┼────────────────────────────────────────────────────┤
│ error  │ Test Data Contract │       │ Invalid Input Error: Error when sniffing file      │
│        │                    │       │ "test_data.csv".                       │
│        │                    │       │ It was not possible to automatically detect the    │
│        │                    │       │ CSV Parsing dialect/types                          │
│        │                    │       │ The search space used was:                         │
│        │                    │       │ Delimiter Candidates: ',', '|', ';', '  '          │
│        │                    │       │ Quote/Escape Candidates:                           │
│        │                    │       │ ['"','"'],['"','\0'],['"','''],['"','\'],[''','\'… │
│        │                    │       │ Comment Candidates: '#', '\0'                      │
│        │                    │       │ Possible fixes:                                    │
│        │                    │       │ *Set delimiter (e.g., delim=',')                  │
│        │                    │       │* Set quote (e.g., quote='"')                      │
│        │                    │       │ *Set escape (e.g., escape='"')                    │
│        │                    │       │* Set comment (e.g., comment='#')                  │
│        │                    │       │ *Set skip (skip=${n}) to skip ${n} lines at the   │
│        │                    │       │ top of the file                                    │
│        │                    │       │* Enable ignore errors (ignore_errors=true) to     │
│        │                    │       │ ignore potential errors                            │
│        │                    │       │ *Enable null padding (null_padding=true) to pad   │
│        │                    │       │ missing columns with NULL values                   │
│        │                    │       │* Check you are using the correct file             │
│        │                    │       │ compression, otherwise set it (e.g., compression = │
│        │                    │       │ 'zstd')                                            │
│        │                    │       │                                                    │
╰────────┴────────────────────┴───────┴────────────────────────────────────────────────────╯
🔴 data contract is invalid, found the following errors:

1) Test Data Contract: Invalid Input Error: Error when sniffing file "test_data.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ',', '|', ';', '  '
Quote/Escape Candidates: ['"','"'],['"','\0'],['"','''],['"','\'],[''','\'],['\0','\0']
Comment Candidates: '#', '\0'
Possible fixes:

* Set delimiter (e.g., delim=',')
* Set quote (e.g., quote='"')
* Set escape (e.g., escape='"')
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd')

Expected behaviour

I was expecting something like this:

Testing datacontract.yaml
╭────────┬────────────────────────────────────────────────┬───────┬─────────╮
│ Result │ Check                                          │ Field │ Details │
├────────┼────────────────────────────────────────────────┼───────┼─────────┤
│ passed │ Check that field A is present                  │       │         │
│ passed │ Check that field B is present                  │       │         │
│ error  │ Check that field C is present                  │       │         │
│ passed │ Check that required field A has no null values │ A     │         │
│ passed │ Check that required field B has no null values │ B     │         │
╰────────┴────────────────────────────────────────────────┴───────┴─────────╯
🔴 data contract is invalid, found the following errors:
Field C is missing

The same happens when the CSV file contains columns that are not covered by the contract.

To me it looks like DuckDB is rejecting the data before the data hits the contract tests, or am I wrong?

Any pointers where this can be tackled in the code (I'm very new to the project). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants