Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xlsx with new Lines (CR+LF) : Parsing error #392

Closed
olivbrau opened this issue Feb 8, 2024 · 5 comments
Closed

xlsx with new Lines (CR+LF) : Parsing error #392

olivbrau opened this issue Feb 8, 2024 · 5 comments

Comments

@olivbrau
Copy link

olivbrau commented Feb 8, 2024

Hi, I'm trying to open an xslx file.
I get this error : Current state not START_ELEMENT, END_ELEMENT or ENTITY_REFERENCE
This error lies in OPCPackage.extractFormat() :
In the while loop, the instruction if ("numFmt".equals(reader.getLocalName())) {
creates en error because the current token type of the reader is CHARACTERS, and then getLocalName() create an exception.
It is because my xlsx file has new lines (CRL+LF) in the styles.xml and so, when we are in the loop that comes after getting the <cellXFS>, insideCellXfs is true, but we can't call getLocalName() since there are CHARACTERS (the new line).
I hope I'm clear in spite of my bad english.

To get this error, take a valid xlsx file, and put CR+LF after each tag in the styles.xml
Excel doesn't put CRLF in the xml it creates, but my xlsx file comes from another tools which put theses CRLF. I think that xlsx format doesn't forbid this, so fastexcel should consider this possibility.

Another side effect of adding CRLF after each tag, is when fastexcel reads sharedStrings.xml :
instead of keeping the text in <t> tags, it retrieves all the text between <si> tags, so including all the CRLF. As a consequence, the strings returned are not the good one (but it doesn't crash the reading, compare to the styles.xml problem explained above).

@meiMingle
Copy link
Collaborator

I cannot reproduce this problem locally. Can you upload a copy of the xlsx file that caused the error?

@olivbrau
Copy link
Author

olivbrau commented Apr 9, 2024

ExcelFileWithCRLF.xlsx
Sure.
Can you try with this one ?
(don't open-save the file with excel, because excel automatically delete all the CRLF)

meiMingle added a commit to meiMingle/fastexcel that referenced this issue Apr 9, 2024
ochedru added a commit that referenced this issue Apr 10, 2024
@olivbrau
Copy link
Author

olivbrau commented Apr 16, 2024

Hello,
Thanks a lot for the fix.
However, I'm wondering if this fix also the 2nd bug I mentioned : in the sharedStrings.xml, if there are CRLF after <si> tag, the shared string read is false (there are no exception thrown however) : the string contains CRLF, because readUpTo() reads every characters between <si> and </si>, including the CRLF before <t> and after </t>

ex. if the string is written like this (like in the excel file I uploaded) :
<si>CRLF <t>MyString</t>CRLF </si>
-->readUpTo() should read only what is between <t> and </t>

@meiMingle
Copy link
Collaborator

I may not have noticed the second issue you mentioned, I will take the time to look into it. But probably not soon because currently I have some things to do outside of the coding world.

meiMingle added a commit to meiMingle/fastexcel that referenced this issue Apr 24, 2024
meiMingle added a commit to meiMingle/fastexcel that referenced this issue Apr 24, 2024
@meiMingle
Copy link
Collaborator

This should be fully fixed since #419 was merged,don't you think? @olivbrau

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants