Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-unicode encoded datafiles #421

Closed
sp35 opened this issue Jul 17, 2021 · 5 comments · Fixed by #688
Closed

Handle non-unicode encoded datafiles #421

sp35 opened this issue Jul 17, 2021 · 5 comments · Fixed by #688
Assignees
Labels
work: backend Related to Python, Django, and simple SQL

Comments

@sp35
Copy link

sp35 commented Jul 17, 2021

Description

Upload CSV feature fails for non utf-8 encoded CSVs. The exception is unhandled so the server returns 500.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 265: invalid start byte

Expected behavior

We can ask for the encoding of the CSV file in the UI and handle the errors if any.

To Reproduce

  1. Upload non utf-8 encoded CSV (e.g.- unknown-8bit: https://sample-videos.com/csv/Sample-Spreadsheet-10-rows.csv)
  2. Observe error (Internal Server Error)

Environment

Docker

@sp35 sp35 added the type: bug label Jul 17, 2021
@sp35 sp35 changed the title Handle non unicode encoded CSVs datafiles Handle non-unicode encoded datafiles Jul 17, 2021
@kgodey
Copy link
Contributor

kgodey commented Jul 18, 2021

Thanks for reporting this bug, @sp35!

If possible, I'd like to avoid asking users for the encoding since we're aiming to be friendly to non-technical users. I think we should investigate a solution that tries different encoding options automatically.

@kgodey kgodey added help wanted Community contributors can implement this ready Ready for implementation work: backend Related to Python, Django, and simple SQL good first issue Everything in "Help wanted", PLUS being relatively easy and straightforward to implement. labels Jul 18, 2021
@kgodey kgodey added this to the 05. Tables from File Import milestone Jul 18, 2021
@asharonbaltazar
Copy link
Contributor

I've used a small 2kb sample file I found on the web and it failed with this error:

Unable to tabulate data

Here's the file I used:

grades.csv

@eito-fis
Copy link
Contributor

eito-fis commented Aug 17, 2021

The grades file seems to cause the Unable to tabulate data error due to a missing comma on line 10. However, after fixing that, we still get an error - it seems like having quotes in the column name also breaks things. I'll open a new issue for this.

Edit: #572

@kgodey kgodey modified the milestones: 2021-08 improvements, 06. 2021-09 Stability, 2021-09 improvements Aug 30, 2021
@andyharley-cci andyharley-cci removed their assignment Sep 13, 2021
@silentninja
Copy link
Contributor

Proposal


Use encoding detection libraries like Charset Normalizer or cChardet.

Issues


Blocking request - Encoding detection is not instant and must be dealt asynchronously. This is also the case with creating data from csv(will be blocking for huge csv files) and solving it should also solve this problem.
Return null instead of throwing an exception

Remarks


I am using charset-normalizer and haven't had issues so far, so I am leaning towards using that library.
I can send a PR if this sounds good.

@kgodey
Copy link
Contributor

kgodey commented Sep 27, 2021

@silentninja That sounds good to me. If you're still interested, please feel free to send a PR.

We've put off handling operations asynchronously for now in order to avoid adding additional dependencies such as a task queue, etc. for Mathesar. Let's handle this synchronously for now and create a separate issue for handling it asynchronously.

silentninja added a commit to silentninja/mathesar that referenced this issue Sep 30, 2021
@kgodey kgodey moved this from Ready to Review in [NO LONGER USED] Mathesar work tracker Oct 8, 2021
@kgodey kgodey removed good first issue Everything in "Help wanted", PLUS being relatively easy and straightforward to implement. help wanted Community contributors can implement this ready Ready for implementation labels Oct 12, 2021
@kgodey kgodey added pr-status: review A PR awaiting review status: started and removed pr-status: review A PR awaiting review labels Oct 12, 2021
@kgodey kgodey moved this from Review to Started in [NO LONGER USED] Mathesar work tracker Oct 12, 2021
Repository owner moved this from Started to Done in [NO LONGER USED] Mathesar work tracker Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work: backend Related to Python, Django, and simple SQL
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

6 participants