Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to load Worksheets on some .xlsx-files with namespaces #1482

Closed
lanort opened this issue May 21, 2020 · 6 comments · Fixed by #3137
Closed

Failed to load Worksheets on some .xlsx-files with namespaces #1482

lanort opened this issue May 21, 2020 · 6 comments · Fixed by #3137

Comments

@lanort
Copy link

lanort commented May 21, 2020

This is:

- [X] a bug report
- [ ] a feature request
- [X] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

List all the worksheets / Load them

What is the current behavior?

Worksheets are empty (printing the following in the below example:

Array
(
)

What are the steps to reproduce?

<?php

require __DIR__ . '/vendor/autoload.php';

$reader = new \PhpOffice\PhpSpreadsheet\Reader\Xlsx();
print_r($reader->listWorksheetNames($filename));

This, however, relies on an .xlsx-file with namespaces. I know you asked for code without that - but that is not possible in this case.

I know this file was generated by a system - not Excel itself. It should however be valid.

I can't post the original file, but the content of xl\workbook.xml in the .xlsx-Archive is (originally without line breaks:

<?xml version="1.0" encoding="utf-8"?>
<x:workbook xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
    <x:sheets>
        <x:sheet name="Daten" sheetId="1" r:id="rId1" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" />
        <x:sheet name="Legende" sheetId="2" r:id="rId2" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" />
    </x:sheets>
</x:workbook> 

If I open the file in Excel and save it again it becomes (again originally without formatting)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x15" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main">
    <fileVersion appName="xl" lastEdited="6" lowestEdited="6" rupBuild="14420"/>
    <workbookPr defaultThemeVersion="153222"/>
    <mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
        <mc:Choice Requires="x15">
            <x15ac:absPath url="---- path on my computer ----" xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac"/>
        </mc:Choice>
    </mc:AlternateContent>
    <bookViews>
        <workbookView xWindow="0" yWindow="0" windowWidth="28800" windowHeight="12435"/>
    </bookViews>
    <sheets>
        <sheet name="Daten" sheetId="1" r:id="rId1"/>
        <sheet name="Legende" sheetId="2" r:id="rId2"/>
    </sheets>
    <calcPr calcId="0"/>
</workbook>

This again is readable by PhpSpreadsheet. However, it does not make sense to do this a 1000 times ;)

What I found is, that in the first namespaced example the following code produces a false and so the worksheets are empty: if ($xmlWorkbook->sheets) { in vendor\phpoffice\phpspreadsheet\src\PhpSpreadsheet\Reader\Xlsx.php on line 600). The corresponding loading procedure can be found in line 574.

I think one needs to just swap the simplexml_load_string with a namespace-aware method - but I don't know how.

Which versions of PhpSpreadsheet and PHP are affected?

I am using the current PhpSpreadsheet 1.12.0 and PHP 7.4.5.

@stale
Copy link

stale bot commented Jul 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If this is still an issue for you, please try to help by debugging it further and sharing your results.
Thank you for your contributions.

@stale stale bot added the stale label Jul 25, 2020
@stale stale bot closed this as completed Aug 1, 2020
@JKoblitz
Copy link

Just in case smebody else is stumbling over this issue as I did: in my case it helped to change Line 740 of Reader/Xlsx.php to:

$xmlSheet = $this->loadZip("$dir/$fileWorksheet", $mainNS);

Note the change from loadZipNoNamespace to loadZip.

@oleibman
Copy link
Collaborator

Can you upload a file which PhpSpreadsheet handles incorrectly without your suggested change, but correctly with it?

@JKoblitz
Copy link

JKoblitz commented Oct 20, 2022

Sure. The appended file is exported from a thrid party software. The textual content itself is parsed correctly but you cannot access e.g. the image inside of this file (which I urgently needed). $xmlSheet->drawing is empty.

An example file: 2022-09-05 15-02-45+02-00-02_1.xlsx

Btw, this might be related to #274.

Edit: one additional remark: when you open the file with excel, save it (without changes) and try again, PhpSpreadsheet will recognize the image since Excel removed all the namespaces.

@oleibman
Copy link
Collaborator

Thank you for the sample. Problem confirmed. There are several problems with how the code handles this particular spreadsheet. Namespacing doesn't appear to be one of the problems, but the use of absolute paths in the zip file where relative paths are expected turns up in several places. I will work on it. In the meantime, can I ask you to open a new issue to which I can ultimately tie the solution, rather than using this old and closed issue.

@oleibman
Copy link
Collaborator

I have now identified some namespacing problems as well (which is why your workaround sort-of worked), affecting things like row height as well as reading the image on your spreadsheet.

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this issue Oct 23, 2022
…tes, MergeCells

Fix PHPOffice#1482 (actually fix a problem recently attached to that ticket long after it closed). There were problems processing a spreadsheet generated by third party software. That spreadsheet used unexpected namespacing, and absolute paths within the zip file where relative paths were expected.

Xlsx Reader handles most, but not all, of its processing in a namespace-aware manner. Two versions of a worksheet's xml are available - `$xmlSheet` is not namespace aware and `$xmlSheetNS` is aware. This was necessary in order to add namespace support in an incremental manner. The primary reason to continue to use the unaware version is the absence of test cases. In particular, drawings, row and column attributes, and merge cells continue to use the unaware version; this PR changes those to use the aware version.

As noted in the summary above, a couple of new places in the handling of the those items were expecting file locations to be specified as relative paths in the zip file, but the file used absolute paths instead. Those unexpected usages are now addressed.

The user reporting the new problem tried a change which effectively made all uses of `$xmlSheet` namespace aware, and that seemed helpful. It may be time eliminate its usage altogether, whether or not we have appropriate examples of unexpected namespaces to test with. I will not do that with this change, but I may add a new PR to do so after this one is merged. Remaining areas which still use the unaware version include conditional formatting (internal or external), sheet view options, sheet protection, auto filters, unparsed loaded data, data validation (internal or external), alternate content, and header/footer images.

There is an interesting anomaly with the new test file. When I load it and save it, the appearance of the output file does not quite match the input. Oddly, the output file seems much better than the input - the picture no longer covers any data, for example. This is because, in particular, the output file row heights and column widths seem to match the xml, but the input file does not. For example, the xml in both files seems to indicate that row 5 should have a height of 234, which it does in the output file, but the height of that row when the input file is opened is 156. It appears that all row heights and column widths when the input file is opened are very close to 2/3 of what is expected. I will continue to research that anomaly for a few days, but I will not let it prevent me from moving forward with this PR if I don't find the explanation. Whatever that problem is, it seems distinct from the namespacing/pathing problems which the PR addresses.
oleibman added a commit that referenced this issue Nov 1, 2022
…ls (#3137)

* Xlsx Reader Namespace Aware Handling of Drawings, RowAndColumnAttributes, MergeCells

Fix #1482 (actually fix a problem recently attached to that ticket long after it closed). There were problems processing a spreadsheet generated by third party software. That spreadsheet used unexpected namespacing, and absolute paths within the zip file where relative paths were expected.

Xlsx Reader handles most, but not all, of its processing in a namespace-aware manner. Two versions of a worksheet's xml are available - `$xmlSheet` is not namespace aware and `$xmlSheetNS` is aware. This was necessary in order to add namespace support in an incremental manner. The primary reason to continue to use the unaware version is the absence of test cases. In particular, drawings, row and column attributes, and merge cells continue to use the unaware version; this PR changes those to use the aware version.

As noted in the summary above, a couple of new places in the handling of the those items were expecting file locations to be specified as relative paths in the zip file, but the file used absolute paths instead. Those unexpected usages are now addressed.

The user reporting the new problem tried a change which effectively made all uses of `$xmlSheet` namespace aware, and that seemed helpful. It may be time eliminate its usage altogether, whether or not we have appropriate examples of unexpected namespaces to test with. I will not do that with this change, but I may add a new PR to do so after this one is merged. Remaining areas which still use the unaware version include conditional formatting (internal or external), sheet view options, sheet protection, auto filters, unparsed loaded data, data validation (internal or external), alternate content, and header/footer images.

There is an interesting anomaly with the new test file. When I load it and save it, the appearance of the output file does not quite match the input. Oddly, the output file seems much better than the input - the picture no longer covers any data, for example. This is because, in particular, the output file row heights and column widths seem to match the xml, but the input file does not. For example, the xml in both files seems to indicate that row 5 should have a height of 234, which it does in the output file, but the height of that row when the input file is opened is 156. It appears that all row heights and column widths when the input file is opened are very close to 2/3 of what is expected. I will continue to research that anomaly for a few days, but I will not let it prevent me from moving forward with this PR if I don't find the explanation. Whatever that problem is, it seems distinct from the namespacing/pathing problems which the PR addresses.

* Scrutinizer New False Positives

Eliminate them with annotations.
@oleibman oleibman removed the stale label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants