Investigate solutions for handling encoding errors during file upload and parsing #2854

ADPennington · 2024-02-16T22:51:46Z

Background

An aggregate file was rejected in TDP due to an encoding issue, but the error messages returned were misleading, indicating a different cause:

After further investigation, it was discovered that the file failed because of its encoding format. When the file was re-encoded to UTF-8, the file was processed successfully:

This issue raises questions about how to catch and handle encoding problems early in the process, particularly at the file upload stage, and whether more helpful error messages can be provided. Additionally, there is the possibility that accepting UTF-8 with BOM could bypass the issue, which needs to be explored further.

The purpose of this spike is to explore potential solutions to better handle encoding errors, whether at the file upload stage, pre-parsing stage, or through on-the-fly encoding adjustments.

Tasks

Investigate whether encoding issues can be detected at the file upload stage (e.g., in FileUpload.jsx) and if more meaningful error messages can be provided to the user.
Research the feasibility of converting file encoding on-the-fly during the upload or parsing process, ensuring no data loss occurs.
Investigate if converting to UTF-8 or UTF-8 with BOM could be a viable solution to prevent encoding-related errors.
Create a set of tests (or integration tests) that simulate different file encodings to ensure that the system can handle various encoding formats without errors.
Document findings and potential solutions

Acceptance Criteria

Research findings on blocking file uploads with unsupported encodings (e.g., using nginx or other tools).
Evaluation of on-the-fly encoding conversion, ensuring no data loss and that files are processed correctly.
Test cases for different file encodings and UTF-8 with BOM, verifying that files are handled appropriately.
Documentation summarizing research findings, proposed solutions, and recommendations.

The text was updated successfully, but these errors were encountered:

ADPennington · 2024-05-15T13:59:36Z

this issue came up again for tribal file submitted this week. an audit of rejected files needs to be conducted. suggested criteria for the audit:

files submitted for FY24Q1 (last quarter)
file status == Rejected
error message includes Header length is 24...

cc: @lfrohlich @ysong001 @ttran-hub

lhuxraft · 2024-11-15T15:54:53Z

May want to explore accepting UTF-8 with BOM, potentially bypass the issue

elipe17 · 2025-01-01T15:11:43Z

@ADPennington There are two paths I see this going.

First, if we detect that the file is not plain old UTF-8 we provide the user an error in the same way we let them know about file extension errors and force them to correct it. We could also provide a KC link on how to change file encoding.

Second, if we detect that the file is not UTF-8 we could give the user a warning that the file is not encoded with UTF-8 and we will be encoding it as UTF-8 before submitting. If we go this route, it would make sense to install another frontend dependency like jschardet. This would help us also cover the case where the file is not encoded as UTF-8 and could not be safely converted to UTF-8 without data loss or corruption. Thus, we would inform the user they have some work to do to fix their file.

Both of these options can be done strictly in the frontend and would not require the backend. Let me know what you think. cc. @reitermb

ADPennington · 2025-01-09T23:58:17Z

@ADPennington There are two paths I see this going.

First, if we detect that the file is not plain old UTF-8 we provide the user an error in the same way we let them know about file extension errors and force them to correct it. We could also provide a KC link on how to change file encoding.

Second, if we detect that the file is not UTF-8 we could give the user a warning that the file is not encoded with UTF-8 and we will be encoding it as UTF-8 before submitting. If we go this route, it would make sense to install another frontend dependency like jschardet. This would help us also cover the case where the file is not encoded as UTF-8 and could not be safely converted to UTF-8 without data loss or corruption. Thus, we would inform the user they have some work to do to fix their file.

Both of these options can be done strictly in the frontend and would not require the backend. Let me know what you think. cc. @reitermb

If feasible, I like the 2nd option. In most cases I've come across, the data submitter does not know anything about encoding or how to fix it. So warning about this, fixing it if possible, and informing when it can't be fixed would be great.

ADPennington added dev spike labels Feb 16, 2024

jtimpe added the Refined Ticket has been refined at the backlog refinement label Feb 27, 2024

robgendron added Old and removed Old labels Apr 16, 2024

andrew-jameson added the triage Needs to be triaged label Oct 10, 2024

lhuxraft removed the triage Needs to be triaged label Nov 15, 2024

lhuxraft removed the Refined Ticket has been refined at the backlog refinement label Nov 25, 2024

lhuxraft changed the title ~~Spike - catch file encoding error?~~ Investigate solutions for handling encoding errors during file upload and parsing Dec 11, 2024

lhuxraft added Refined Ticket has been refined at the backlog refinement P2 Needed – Can wait indefinitely and removed P2 Needed – Can wait indefinitely labels Dec 11, 2024

lhuxraft mentioned this issue Dec 23, 2024

Data Validation & Integrity Improvements #3366

Open

12 tasks

elipe17 self-assigned this Jan 1, 2025

elipe17 mentioned this issue Jan 9, 2025

Catch File Encoding Errors with Package #3414

Closed

28 tasks

elipe17 linked a pull request Jan 21, 2025 that will close this issue

Updated File Upload to Auto Encode to UTF-8 #3438

Open

28 tasks

lhuxraft mentioned this issue Jan 22, 2025

Implement FRA report selection and upload interface #3398

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate solutions for handling encoding errors during file upload and parsing #2854

Investigate solutions for handling encoding errors during file upload and parsing #2854

ADPennington commented Feb 16, 2024 •

edited by lhuxraft

Loading

ADPennington commented May 15, 2024 •

edited

Loading

lhuxraft commented Nov 15, 2024

elipe17 commented Jan 1, 2025

ADPennington commented Jan 9, 2025

Investigate solutions for handling encoding errors during file upload and parsing #2854

Investigate solutions for handling encoding errors during file upload and parsing #2854

Comments

ADPennington commented Feb 16, 2024 • edited by lhuxraft Loading

Background

Tasks

Acceptance Criteria

ADPennington commented May 15, 2024 • edited Loading

lhuxraft commented Nov 15, 2024

elipe17 commented Jan 1, 2025

ADPennington commented Jan 9, 2025

ADPennington commented Feb 16, 2024 •

edited by lhuxraft

Loading

ADPennington commented May 15, 2024 •

edited

Loading