Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate solutions for handling encoding errors during file upload and parsing #2854

Open
9 tasks
ADPennington opened this issue Feb 16, 2024 · 4 comments · May be fixed by #3438
Open
9 tasks

Investigate solutions for handling encoding errors during file upload and parsing #2854

ADPennington opened this issue Feb 16, 2024 · 4 comments · May be fixed by #3438
Assignees
Labels
dev Refined Ticket has been refined at the backlog refinement spike

Comments

@ADPennington
Copy link
Collaborator

ADPennington commented Feb 16, 2024

Background

An aggregate file was rejected in TDP due to an encoding issue, but the error messages returned were misleading, indicating a different cause:
encodingerror

After further investigation, it was discovered that the file failed because of its encoding format. When the file was re-encoded to UTF-8, the file was processed successfully:
encoding

This issue raises questions about how to catch and handle encoding problems early in the process, particularly at the file upload stage, and whether more helpful error messages can be provided. Additionally, there is the possibility that accepting UTF-8 with BOM could bypass the issue, which needs to be explored further.

The purpose of this spike is to explore potential solutions to better handle encoding errors, whether at the file upload stage, pre-parsing stage, or through on-the-fly encoding adjustments.

Tasks

  • Investigate whether encoding issues can be detected at the file upload stage (e.g., in FileUpload.jsx) and if more meaningful error messages can be provided to the user.
  • Research the feasibility of converting file encoding on-the-fly during the upload or parsing process, ensuring no data loss occurs.
  • Investigate if converting to UTF-8 or UTF-8 with BOM could be a viable solution to prevent encoding-related errors.
  • Create a set of tests (or integration tests) that simulate different file encodings to ensure that the system can handle various encoding formats without errors.
  • Document findings and potential solutions

Acceptance Criteria

  • Research findings on blocking file uploads with unsupported encodings (e.g., using nginx or other tools).
  • Evaluation of on-the-fly encoding conversion, ensuring no data loss and that files are processed correctly.
  • Test cases for different file encodings and UTF-8 with BOM, verifying that files are handled appropriately.
  • Documentation summarizing research findings, proposed solutions, and recommendations.
@jtimpe jtimpe added the Refined Ticket has been refined at the backlog refinement label Feb 27, 2024
@robgendron robgendron added Old and removed Old labels Apr 16, 2024
@ADPennington
Copy link
Collaborator Author

ADPennington commented May 15, 2024

this issue came up again for tribal file submitted this week. an audit of rejected files needs to be conducted. suggested criteria for the audit:

  • files submitted for FY24Q1 (last quarter)
  • file status == Rejected
  • error message includes Header length is 24...

cc: @lfrohlich @ysong001 @ttran-hub

@andrew-jameson andrew-jameson added the triage Needs to be triaged label Oct 10, 2024
@lhuxraft
Copy link
Collaborator

May want to explore accepting UTF-8 with BOM, potentially bypass the issue

@lhuxraft lhuxraft removed the triage Needs to be triaged label Nov 15, 2024
@lhuxraft lhuxraft removed the Refined Ticket has been refined at the backlog refinement label Nov 25, 2024
@lhuxraft lhuxraft changed the title Spike - catch file encoding error? Investigate solutions for handling encoding errors during file upload and parsing Dec 11, 2024
@lhuxraft lhuxraft added Refined Ticket has been refined at the backlog refinement P2 Needed – Can wait indefinitely and removed P2 Needed – Can wait indefinitely labels Dec 11, 2024
@elipe17
Copy link

elipe17 commented Jan 1, 2025

@ADPennington There are two paths I see this going.

First, if we detect that the file is not plain old UTF-8 we provide the user an error in the same way we let them know about file extension errors and force them to correct it. We could also provide a KC link on how to change file encoding.

Second, if we detect that the file is not UTF-8 we could give the user a warning that the file is not encoded with UTF-8 and we will be encoding it as UTF-8 before submitting. If we go this route, it would make sense to install another frontend dependency like jschardet. This would help us also cover the case where the file is not encoded as UTF-8 and could not be safely converted to UTF-8 without data loss or corruption. Thus, we would inform the user they have some work to do to fix their file.

Both of these options can be done strictly in the frontend and would not require the backend. Let me know what you think. cc. @reitermb

@elipe17 elipe17 self-assigned this Jan 1, 2025
@ADPennington
Copy link
Collaborator Author

@ADPennington There are two paths I see this going.

First, if we detect that the file is not plain old UTF-8 we provide the user an error in the same way we let them know about file extension errors and force them to correct it. We could also provide a KC link on how to change file encoding.

Second, if we detect that the file is not UTF-8 we could give the user a warning that the file is not encoded with UTF-8 and we will be encoding it as UTF-8 before submitting. If we go this route, it would make sense to install another frontend dependency like jschardet. This would help us also cover the case where the file is not encoded as UTF-8 and could not be safely converted to UTF-8 without data loss or corruption. Thus, we would inform the user they have some work to do to fix their file.

Both of these options can be done strictly in the frontend and would not require the backend. Let me know what you think. cc. @reitermb

If feasible, I like the 2nd option. In most cases I've come across, the data submitter does not know anything about encoding or how to fix it. So warning about this, fixing it if possible, and informing when it can't be fixed would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev Refined Ticket has been refined at the backlog refinement spike
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants