Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load file xlsx is very slow #3917

Open
nguyenthao1988 opened this issue Feb 22, 2024 · 31 comments
Open

Load file xlsx is very slow #3917

nguyenthao1988 opened this issue Feb 22, 2024 · 31 comments

Comments

@nguyenthao1988
Copy link

nguyenthao1988 commented Feb 22, 2024

`<?php

require 'vendor/autoload.php';

use PhpOffice\PhpSpreadsheet\IOFactory;
$filename = 'data/danhsach.xlsx';
//$filename = 'data/nhanvien.xlsx';
$spreadsheet = IOFactory::load($filename);`

I use this code load file "danhsach.xlsx" with 16k record, file size excel: 7MB.
+> this code load finish taken time: 32s

how to optimize this, i load full column and row, not limit.

I used:
PHP: 8.2.4
PhpSpreadsheet: 2.0

Thanks bro,

@tvillafane
Copy link

I have an XLSX that is 1.4MB and my code just hangs when I use the same method, doesn't even finish.

@nguyenthao1988
Copy link
Author

@tvillafane please use https://github.com/aVadim483/fast-excel-reader is better faster

@henno
Copy link

henno commented Apr 18, 2024

Same problem.

@nineday90
Copy link

相同问题,2.2mb的文件,执行IOFactory::load()竟然需要四五十秒,有优化的方案吗?

@hetao29
Copy link

hetao29 commented Aug 28, 2024

specialize $reader->setReadDataOnly(true) if you only read the file.

$reader = \PhpOffice\PhpSpreadsheet\IOFactory::createReader(\PhpOffice\PhpSpreadsheet\IOFactory::identify($filename));
$reader->setReadDataOnly(true);
$spreadsheet = $reader->load($filename);

@konovalovk
Copy link

konovalovk commented Aug 30, 2024

My issue appeared for >2.1.0 versions.
$t = microtime(true); $this->workbook = PhpSpreadsheet\IOFactory::load($path); echo microtime(true) - $t;

2.1.0 – 0.29 sec
2.2.2 – 8.28 sec

I started to delete sheets from the workbook and left for you to judge the only one. The opening in 2.2.2 takes 2.71 sec, in 2.1.0 takes 0.08 sec.
ru.xlsx

@DanielRuf
Copy link

According to git bisect, the following commit might have introduced this issue:

image

e404389

@oleibman could this be the cause?

@oleibman
Copy link
Collaborator

@DanielRuf Thank you for the information. Yes, it appears that might indeed be the cause. I'm flabbergasted. Our unit test suite reads tons of Xlsx files, and there didn't seem to be any significant change in run-time after that PR. I will see if I can improve it.

@sharat-rhinofish
Copy link

Thanks for looking into this. My composer which upgraded to this version also indicated that version 2.2.2 seems to have caused a problem.

@DanielRuf
Copy link

DanielRuf commented Aug 30, 2024

@henno @tvillafane @nguyenthao1988 can you provide test files regarding your exact issue? Then I can check that.

The commit identified by git bisect is only relevant for the issue by @konovalovk.

My test setup:

test.php

<?php

namespace PhpOffice;

require 'vendor/autoload.php';

$filename = 'ru.xlsx';
$spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load($filename);

test command: time php test.php

test steps:

git checkout master
git bisect start
git bisect bad
git bisect good 2.1.0
time php test.php
#  when result of "time php test.php" is high, run "git bisect bad"
git bisect bad
#  when result of "time php test.php" is low, run "git bisect good"
git bisect good
# test again after each "git bisect good" / "git bisect bad"
time php test.php
# repeat "git bisect bad" / "git bisect good" + "time php test.php" until you see "... is the first bad commit"

@oleibman
Copy link
Collaborator

Please test against PR #4153 if possible. That should help a lot for the issue reported by bisect.

@oleibman oleibman mentioned this issue Aug 30, 2024
11 tasks
@DanielRuf
Copy link

Performance-wise it looks good after applying the changes from the linked PR:

wget https://github.com/PHPOffice/PhpSpreadsheet/pull/4153.patch
git apply -v --index 4153.patch
# Checking patch src/PhpSpreadsheet/Worksheet/Worksheet.php...
# Hunk #1 succeeded at 3730 (offset 6 lines).
# Checking patch src/PhpSpreadsheet/Cell/Cell.php...
# Checking patch tests/PhpSpreadsheetTests/Worksheet/ApplyStylesTest.php...
# Checking patch tests/PhpSpreadsheetTests/Worksheet/ApplyStylesTest.php...
# Applied patch src/PhpSpreadsheet/Worksheet/Worksheet.php cleanly.
# Applied patch src/PhpSpreadsheet/Cell/Cell.php cleanly.
# Applied patch tests/PhpSpreadsheetTests/Worksheet/ApplyStylesTest.php cleanly.
# Applied patch tests/PhpSpreadsheetTests/Worksheet/ApplyStylesTest.php cleanly.
time php8.2 test.php
# real    0m0.303s
# user    0m0.126s
# sys     0m0.055s

@oleibman
Copy link
Collaborator

@DanielRuf Thank you for confirming. I will wait a few days before merging the PR in case somebody else supplies a spreadsheet that needs to be checked.

@DanielRuf
Copy link

DanielRuf commented Aug 30, 2024

@oleibman I probably have another spreadsheet which results in very long execution times, if no one else provides one. But it would be better if the other participants of this issue can provide a corresponding file.

@sharat-rhinofish
Copy link

@oleibman , here is the file I used which was causing long delays.

Altair_Design_Inventory_Update_08.30.24_.xlsx

@DanielRuf
Copy link

@sharat-rhinofish on 2.1.0 I get these times with the timecommand:

real    0m5.463s
user    0m5.211s
sys     0m0.192s

On 2.2.0 and master I had to cancel:

real    0m48.314s
user    0m47.656s
sys     0m0.209s

Wit the patch from the PR I get these times (which is better but not yet ideal):

real    0m13.610s
user    0m12.470s
sys     0m0.349s

On 1.29.0 I get these times:

real    0m3.323s
user    0m3.091s
sys     0m0.184s

So there seem to be 2 changes affecting the performance: 1.29.0 => 2.1.0 an 2.1.0 => 2.2.0 and the latter is only partly improved by the PR.

The 1.29.0 => 2.1.0 performance change seems to come from this:
Bildschirmfoto 2024-08-31 um 10 07 34

Commit: 49a04bd

And the other performance change came from #3917 (comment), which is a bit improved by the PR but still has ~3 times the execution time compared to previous versions.

@oleibman
Copy link
Collaborator

@sharat-rhinofish Something interesting happens when I open your file in Excel.
Screenshot 2024-08-31 010616
When I let Excel correct it:
Screenshot 2024-08-31 010650
And I get much better performance when I process the results of Excel's correction.
issue.3917b2.xlsx
Which doesn't, of course, mean that I won't take a look at what @DanielRuf has reported on your original file, but I will do that over the weekend (it's very late here).

@oleibman
Copy link
Collaborator

@DanielRuf I don't really see any potential performance problem in the commit you've newly identified. Not having a good sense of how you're arriving at your results, can you tell if there might have been a difference due to PR #3497?

@sharat-rhinofish
Copy link

@DanielRuf @oleibman , thanks for sharing the findings. Interesting.

FYI, this excel file comes from our vendors who are sending back order information. It is likely that they have unoptimized excel data ravaged by use over a period of time. Ideally, I want to be able to use it directly without having to necessarily optimize it.

If we can get to 2.1.0 timings, I would be happy. With PR, the times are still 2.5 times higher than version 2.1.0, which is a little bit of a concern.

@oleibman
Copy link
Collaborator

It seems than any performance problems associated with PR #3497 should have been fixed by PR #3810.

@DanielRuf
Copy link

With setReadDataOnly from #3917 (comment) and on master without the patch I get these times:

real    0m25.964s
user    0m25.418s
sys     0m0.282s

Without setReadDataOnly on master and without the patch I get these times:

real    0m57.842s
user    0m57.135s
sys     0m0.201s

So the following factors seem to contribute to the performance changes:

We did not yet check our affected excel file regarding #3917 (comment).

@oleibman in our case we moved from 1.26 to 1.27 due to Structured References in some excel file, which threw some error during processing. With that we also had much slower processing, but this is probably a different issue.

1.29.0 times for Altair_Design_Inventory_Update_08.30.24_.xlsx:

real    0m3.234s
user    0m3.025s
sys     0m0.189s

1.27.0 times for Altair_Design_Inventory_Update_08.30.24_.xlsx:

real    0m3.274s
user    0m3.055s
sys     0m0.188s

1.26.0 times for Altair_Design_Inventory_Update_08.30.24_.xlsx:

real    0m3.180s
user    0m2.949s
sys     0m0.188s

So to me it seems PR #3497 had no effect on Altair_Design_Inventory_Update_08.30.24_.xlsx. I'll try to skip this commit via git bisect to see if any other change might lead to this.

@oleibman
Copy link
Collaborator

FWIW, I see no significant time difference between 2.1.0 and the new PR. I do see a big difference between 2.2.0 and 2.1.0, and between 1.29.0 and 2.1.0.

@oleibman
Copy link
Collaborator

@DanielRuf I found a small optimization I could make. Can you please re-test against the new PR.

@DanielRuf
Copy link

I've skipped one step during git bisect at 06c4aff. With 06c4aff I get 3.3s, with the commit after that (which is 49a04bd) I get 5.3s. The test was done with the original file from #3917 (comment).

@DanielRuf
Copy link

DanielRuf commented Aug 31, 2024

@DanielRuf I found a small optimization I could make. Can you please re-test against the new PR.

Without the latest PR changes:

time php8.2 test.php
# real    0m57.377s
# user    0m56.828s
# sys     0m0.201s

# with \PhpOffice\PhpSpreadsheet\Reader\IReader::IGNORE_EMPTY_CELLS
time php8.2 test.php
# real    0m2.106s
# user    0m1.942s
# sys     0m0.144s

# with \PhpOffice\PhpSpreadsheet\Reader\IReader::READ_DATA_ONLY
time php8.2 test.php
# real    0m25.482s
# user    0m24.983s
# sys     0m0.260s

# with \PhpOffice\PhpSpreadsheet\Reader\IReader::IGNORE_EMPTY_CELLS | \PhpOffice\PhpSpreadsheet\Reader\IReader::READ_DATA_ONLY
time php8.2 test.php
# real    0m1.621s
# user    0m1.404s
# sys     0m0.160s

With the latest PR changes:

wget https://github.com/PHPOffice/PhpSpreadsheet/pull/4153.patch
git apply -v --index 4153.patch
# Checking patch src/PhpSpreadsheet/Worksheet/Worksheet.php...
# Hunk #1 succeeded at 3730 (offset 6 lines).
# Checking patch src/PhpSpreadsheet/Cell/Cell.php...
# Checking patch tests/PhpSpreadsheetTests/Worksheet/ApplyStylesTest.php...
# Checking patch tests/PhpSpreadsheetTests/Worksheet/ApplyStylesTest.php...
# Checking patch src/PhpSpreadsheet/Cell/Cell.php...
# Checking patch src/PhpSpreadsheet/Reader/Xlsx.php...
# Hunk #1 succeeded at 950 (offset 3 lines).
# Applied patch src/PhpSpreadsheet/Worksheet/Worksheet.php cleanly.
# Applied patch src/PhpSpreadsheet/Cell/Cell.php cleanly.
# Applied patch tests/PhpSpreadsheetTests/Worksheet/ApplyStylesTest.php cleanly.
# Applied patch tests/PhpSpreadsheetTests/Worksheet/ApplyStylesTest.php cleanly.
# Applied patch src/PhpSpreadsheet/Cell/Cell.php cleanly.
# Applied patch src/PhpSpreadsheet/Reader/Xlsx.php cleanly.
time php8.2 test.php
# real    0m9.529s
# user    0m9.264s
# sys     0m0.184s

# with \PhpOffice\PhpSpreadsheet\Reader\IReader::IGNORE_EMPTY_CELLS
time php8.2 test.php
# real    0m1.306s
# user    0m1.137s
# sys     0m0.168s

# with \PhpOffice\PhpSpreadsheet\Reader\IReader::READ_DATA_ONLY
time php8.2 test.php
# real    0m8.441s
# user    0m8.124s
# sys     0m0.216s

# with \PhpOffice\PhpSpreadsheet\Reader\IReader::IGNORE_EMPTY_CELLS | \PhpOffice\PhpSpreadsheet\Reader\IReader::READ_DATA_ONLY
time php8.2 test.php
# real    0m1.233s
# user    0m1.105s
# sys     0m0.125s

If anyone needs the patched version (master + #4153): src-patched-4153-202408311248.zip

@oleibman
Copy link
Collaborator

@sharat-rhinofish Your Excel file has tons of empty cells defined in rows 56 through 76. This is probably what Excel was complaining about above. You can boost your performance by a huge amount through the use of IGNORE_EMPTY_CELLS:

$spreadsheet = IOFactory::load($path, IReader::IGNORE_EMPTY_CELLS);
// or
$reader = new Reader\Xlsx();
$reader->setReadEmptyCells(false);
$spreadsheet = $reader->load($path);

And now I really am going to bed.

@sharat-rhinofish
Copy link

@oleibman , Thanks. This helps and does speed things up. Thanks for the useful tip. I have added this to and am back on version 2.2.2

@oleibman
Copy link
Collaborator

Based on what Excel did with that spreadsheet, it could make sense for PhpSpreadsheet Xlsx Reader to ignore all cells with only attribute r and no children, regardless of the "ignore empty cells" option. As that might be a breaking change, I am not prepared to do that with the current PR. But I will consider it for future.

@oleibman
Copy link
Collaborator

oleibman commented Sep 1, 2024

Oh well, no need to consider it for future after all. That change made it through our test suite with only one failure, but that failure unfortunately would also apply to the "bad" spreadsheet. (Row specifies a style, cell does not - cell should therefore get default style rather than row style; ironically, the row style is not applied to any cell in the row in the spreadsheet, but determining that will give back all our performance gains.) We will just have to rely on user specifying "ignore empty" when appropriate.

@DanielRuf
Copy link

@oleibman will this be available with 2.2.3 and is there some ETA?

Also we need a new release of PhpPresentation: PHPOffice/PHPPresentation#812

So maybe there is some coordination needed between both projects (not sure about that).

@oleibman
Copy link
Collaborator

I hope to release 2.3.0 (without dynamic tables) and 3.0.0 (with) within the next few weeks. This change will be part of both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

9 participants