Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv cannot correctly read a specific text due to "remaining bytes non-empty" #20062

Open
2 tasks done
hanjinliu opened this issue Nov 29, 2024 · 1 comment
Open
2 tasks done
Labels
bug Something isn't working needs repro Bug does not yet have a reproducible example python Related to Python Polars

Comments

@hanjinliu
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I encountered this problem when I was analyzing my experiment data with polars.
I tried to make a minimum text that can reproduce the error, but the following text was the simplest as far as I tried.

import polars as pl
from io import StringIO
buf = StringIO("""A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V
,"B",,,,,,,,,A,,,,,,,,
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,0.0,1.0,2.0,3.0
""")
pl.read_csv(buf)
Output
ComputeError                              Traceback (most recent call last)
Cell In[45], line 5
      1 buf = StringIO("""A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V
      2 ,"B",,,,,,,,,A,,,,,,,,
      3 a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,0.0,1.0,2.0,3.0
      4 """)
----> 5 pl.read_csv(buf)

File ~\mambaforge\envs\mt\Lib\site-packages\polars\_utils\deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File ~\mambaforge\envs\mt\Lib\site-packages\polars\_utils\deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File ~\mambaforge\envs\mt\Lib\site-packages\polars\_utils\deprecation.py:92, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     87 @wraps(function)
     88 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     89     _rename_keyword_argument(
     90         old_name, new_name, kwargs, function.__qualname__, version
     91     )
---> 92     return function(*args, **kwargs)

File ~\mambaforge\envs\mt\Lib\site-packages\polars\io\csv\functions.py:527, in read_csv(source, has_header, columns, new_columns, separator, comment_prefix, quote_char, skip_rows, schema, schema_overrides, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma, glob)
    519 else:
    520     with prepare_file_arg(
    521         source,
    522         encoding=encoding,
   (...)
    525         storage_options=storage_options,
    526     ) as data:
--> 527         df = _read_csv_impl(
    528             data,
    529             has_header=has_header,
    530             columns=columns if columns else projection,
    531             separator=separator,
    532             comment_prefix=comment_prefix,
    533             quote_char=quote_char,
    534             skip_rows=skip_rows,
    535             schema_overrides=schema_overrides,
    536             schema=schema,
    537             null_values=null_values,
    538             missing_utf8_is_empty_string=missing_utf8_is_empty_string,
    539             ignore_errors=ignore_errors,
    540             try_parse_dates=try_parse_dates,
    541             n_threads=n_threads,
    542             infer_schema_length=infer_schema_length,
    543             batch_size=batch_size,
    544             n_rows=n_rows,
    545             encoding=encoding if encoding == "utf8-lossy" else "utf8",
    546             low_memory=low_memory,
    547             rechunk=rechunk,
    548             skip_rows_after_header=skip_rows_after_header,
    549             row_index_name=row_index_name,
    550             row_index_offset=row_index_offset,
    551             eol_char=eol_char,
    552             raise_if_empty=raise_if_empty,
    553             truncate_ragged_lines=truncate_ragged_lines,
    554             decimal_comma=decimal_comma,
    555             glob=glob,
    556         )
    558 if new_columns:
    559     return _update_columns(df, new_columns)

File ~\mambaforge\envs\mt\Lib\site-packages\polars\io\csv\functions.py:672, in _read_csv_impl(source, has_header, columns, separator, comment_prefix, quote_char, skip_rows, schema, schema_overrides, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma, glob)
    668         raise ValueError(msg)
    670 projection, columns = parse_columns_arg(columns)
--> 672 pydf = PyDataFrame.read_csv(
    673     source,
    674     infer_schema_length,
    675     batch_size,
    676     has_header,
    677     ignore_errors,
    678     n_rows,
    679     skip_rows,
    680     projection,
    681     separator,
    682     rechunk,
    683     columns,
    684     encoding,
    685     n_threads,
    686     path,
    687     dtype_list,
    688     dtype_slice,
    689     low_memory,
    690     comment_prefix,
    691     quote_char,
    692     processed_null_values,
    693     missing_utf8_is_empty_string,
    694     try_parse_dates,
    695     skip_rows_after_header,
    696     parse_row_index_args(row_index_name, row_index_offset),
    697     eol_char=eol_char,
    698     raise_if_empty=raise_if_empty,
    699     truncate_ragged_lines=truncate_ragged_lines,
    700     decimal_comma=decimal_comma,
    701     schema=schema,
    702 )
    703 return wrap_df(pydf)

ComputeError: could not parse `a` as dtype `f64` at column 'T' (column number 20)

The current offset in the file is 23 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `schema_overrides` argument
- setting `ignore_errors` to `True`,
- adding `a` to the `null_values` list.

Original error: ```remaining bytes non-empty```

When I tried to read this text with additional keyword arguments, dataframe was truncated.

buf = StringIO("""A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V
,"B",,,,,,,,,A,,,,,,,,
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,0.0,1.0,2.0,3.0
""")
pl.read_csv(buf, ignore_errors=True, truncate_ragged_lines=True)

Output:

shape: (1, 22)
┌──────┬─────┬──────┬──────┬───┬──────┬──────┬──────┬──────┐
│ A    ┆ B   ┆ C    ┆ D    ┆ … ┆ S    ┆ T    ┆ U    ┆ V    │
│ ---  ┆ --- ┆ ---  ┆ ---  ┆   ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str  ┆ str ┆ str  ┆ str  ┆   ┆ str  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪═════╪══════╪══════╪═══╪══════╪══════╪══════╪══════╡
│ null ┆ B   ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
└──────┴─────┴──────┴──────┴───┴──────┴──────┴──────┴──────┘

Log output

No response

Issue description

For the same type of data, older version of polars worked, but I do not remember what version it was.

Expected behavior

read_csv should be able to read that kind of texts.

Installed versions

--------Version info---------
Polars:              1.15.0
Index type:          UInt32
Platform:            Windows-10-10.0.22631-SP0
Python:              3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:27:10) [MSC v.1938 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                <not installed>
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            0.10.4
fsspec               2024.5.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.1.3
openpyxl             3.1.5
pandas               2.2.3
pyarrow              16.1.0
pydantic             2.9.2
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                2.5.1+cpu
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@hanjinliu hanjinliu added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 29, 2024
@ritchie46
Copy link
Member

I cannot reproduce. I added a test: #20182

@ritchie46 ritchie46 added needs repro Bug does not yet have a reproducible example and removed needs triage Awaiting prioritization by a maintainer labels Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs repro Bug does not yet have a reproducible example python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants