Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter Not working? #251

Closed
1 of 3 tasks
tresero opened this issue Oct 14, 2017 · 6 comments
Closed
1 of 3 tasks

Filter Not working? #251

tresero opened this issue Oct 14, 2017 · 6 comments
Labels

Comments

@tresero
Copy link

tresero commented Oct 14, 2017

This is:

What is the expected behavior?

Using the code in the docs, this should only display rows 3 - 5.

What is the current behavior?

This outputs all rows up to 5.

What are the steps to reproduce?

Please provide a Minimal, Complete, and Verifiable example of code that exhibits the issue without relying on an external Excel file or a web server:

<?php

require __DIR__ . '/vendor/autoload.php';

// Create new Spreadsheet object
$spreadsheet = new \PhpOffice\PhpSpreadsheet\Spreadsheet();

class MyReadFilter implements \PhpOffice\PhpSpreadsheet\Reader\IReadFilter
{
    private $startRow = 0;
    private $endRow   = 0;
    private $columns  = array();

    /**  Get the list of rows and columns to read  */
    public function __construct($startRow, $endRow, $columns) {
        $this->startRow = $startRow;
        $this->endRow   = $endRow;
        $this->columns  = $columns;
    }

    public function readCell($column, $row, $worksheetName = '') {
        //  Only read the rows and columns that were configured
        if ($row >= $this->startRow && $row <= $this->endRow) {
            if (in_array($column,$this->columns)) {
                return true;
            }
        }
        return false;
    }
}

$filterSubset = new MyReadFilter(3,5,range('A','O'));
$reader->setReadFilter($filterSubset);

/**  Load $inputFileName to a Spreadsheet Object  **/
$spreadsheet = $reader->load($inputFileName);
// here is the loop right after.

echo '<table style="border: 1px solid black;">' . PHP_EOL;
foreach ($worksheet->getRowIterator() as $row) {
    echo '<tr style="border: 1px solid black;">' . PHP_EOL;
    $cellIterator = $row->getCellIterator();
    $cellIterator->setIterateOnlyExistingCells(FALSE); // This loops through all cells,
                                                       //    even if a cell value is not set.
                                                       // By default, only cells that have a value
                                                       //    set will be iterated.
    foreach ($cellIterator as $cell) {
        echo '<td style="border: 1px solid black;">' .
             $cell->getValue() .
             '</td>' . PHP_EOL;
    }
    echo '</tr>' . PHP_EOL;
}
echo '</table>' . PHP_EOL;

Output displays 1st 5 rows. The first 2 are garbage and that is what I'm trying to get rid of.

Which versions of PhpSpreadsheet and PHP are affected?

1.0.0 beta

@PowerKiKi
Copy link
Member

Your MCVE did not run at all. But once adapted like so:

<?php

require __DIR__ . '/vendor/autoload.php';

use PhpOffice\PhpSpreadsheet\Spreadsheet;
use PhpOffice\PhpSpreadsheet\Writer\Xlsx;

class MyReadFilter implements \PhpOffice\PhpSpreadsheet\Reader\IReadFilter
{
    private $startRow = 0;
    private $endRow = 0;
    private $columns = [];

    /**  Get the list of rows and columns to read  */
    public function __construct($startRow, $endRow, $columns)
    {
        $this->startRow = $startRow;
        $this->endRow = $endRow;
        $this->columns = $columns;
    }

    public function readCell($column, $row, $worksheetName = '')
    {
        //  Only read the rows and columns that were configured
        if ($row >= $this->startRow && $row <= $this->endRow) {
            if (in_array($column, $this->columns)) {
                return true;
            }
        }

        return false;
    }
}

// Create new Spreadsheet on disk
$spreadsheet = new Spreadsheet();
foreach (range('A', 'C') as $column) {
    foreach (range(1, 3) as $row) {
        $coord = $column . $row;
        $spreadsheet->getActiveSheet()->setCellValue($coord, $coord);
    }
}
$originalFilename = '/tmp/original.xlsx';
$writer = new Xlsx($spreadsheet);
$writer->save($originalFilename);

// Reload spreadsheet from disk with filter
$reader = new \PhpOffice\PhpSpreadsheet\Reader\Xlsx();
$filterSubset = new MyReadFilter(2, 2, range('A', 'B'));
$reader->setReadFilter($filterSubset);
$spreadsheet = $reader->load($originalFilename);
foreach ($spreadsheet->getActiveSheet()->getRowIterator() as $row) {
    $cellIterator = $row->getCellIterator();
    $cellIterator->setIterateOnlyExistingCells(false);

    foreach ($cellIterator as $cell) {
        var_dump([$cell->getCoordinate(), $cell->getValue()]);
    }
}

It will then correctly ouput the following, effectively ignoring anything higher than row 2 and column B, and showing NULL values for rows lower than 2:

array(2) {
  [0]=>
  string(2) "A1"
  [1]=>
  NULL
}
array(2) {
  [0]=>
  string(2) "B1"
  [1]=>
  NULL
}
array(2) {
  [0]=>
  string(2) "A2"
  [1]=>
  string(2) "A2"
}
array(2) {
  [0]=>
  string(2) "B2"
  [1]=>
  string(2) "B2"
}

So the filter is working as intended as far as I can tell.

@tresero
Copy link
Author

tresero commented Oct 16, 2017

Well, that's what was exactly in the docs. Besides that, it seems counter intuitive that ignoring the rows just makes them null.

Anyway, thanks for answering at least. I will have to just loop and not use the bad rows.

@Benjact
Copy link

Benjact commented May 24, 2021

The problem is that if you want to make chunks of a large excel for the memory issue, and you are filling all the object inside for any row that you are filtering, at the end I have the issue again

In my case, I divided the excel in parts of 5000 rows, and each it's a complete new petition to the server. At the end of the first petition it consumes 208MB, at the end of the second (5000 nulls + 5000 useful rows) 412MB, at the end of the third 658MB... and each starts from 0.

I think I should have a way to ask only the rows of the filter.
Probably you did it in this way to maintain the exact number of each row, but I don't care about it... I don't want to have so ammount of memory consumed

@MarkBaker
Copy link
Member

MarkBaker commented May 24, 2021

The problem is that if you want to make chunks of a large excel for the memory issue, and you are filling all the object inside for any row that you are filtering, at the end I have the issue again

I don't want to have so ammount of memory consumed

That memory shouldn't be consumed if you use a filter, cells outside the filter range are not loaded into memory, so they don't consume memory, unless you then try to access any of those cells.... they don't contain nulls, they do not exist in memory at all.

Note that using toArray() will attempt to access those cells, and create them and assign the with a null value if they don't exist, increasing memory usage.... use rangeToArray() instead to avoid this.

What you're suggesting should happen is basically asking the loader to change the addresses of the cells that are actually loaded, so if rows 1-10 are filtered, then all cells in row 11 actually becomes cells in rows 1; i.e. cell A11 is physically re-addresses to become cell A1.... that can create a lot of additional loader overhead.
Consider that cells A11, B11 contain 1 and 2, and cell C11 contains a formula =$A11+$B11.... if cells A11, B11 and C11 are re-addressed as A1, B1, and C1 respectively, then either the formula =$A11+$B11 is now pointing to completely different values; or we have to modify the formula itself.... what difference would it make if the formula had been =$A$11+$B$11? What effect would re-addressing cells have on defined names? on Print Areas?

@Benjact
Copy link

Benjact commented May 24, 2021

First, thanks for the option of the rangeToArray(), I will check it.

I get your point regarding formulas, but if I'm using a Filter and there are formulas that include cells in the area outside the filter, the formula will give a wrong number anyway. So, it will be my problem to solve that (changing the Filter or whatever). What I mean is that you are giving a wonderful utility to minimize the memory use and at the same time using some of that space because the Cells should continue existing. In fact, it's interesting that if my filter is for the first 100 rows, the row 101 doesn't exist and you don't care if a formula in the cell A100 asks for the A101, but if I'm reading the second 100 rows, then the first group of 100 should exist because of any formula...

Sorry if I'm not expressing right

@MarkBaker
Copy link
Member

I get your point regarding formulas, but if I'm using a Filter and there are formulas that include cells in the area outside the filter, the formula will give a wrong number anyway

That would always be a problem; but in the example I cited for a formula in a cell inside the filtered range that references other cells that are also inside the filtered range, it would need to be updated otherwise I'd get 10x the number of complaints for wrong values; and with the differences between relative and absolute references, it might not be possible to update it; and any update would be a processing overhead, slowing the load time.

The general principle behind filtering was that if the filter said to load rows 101-200, then it would load only those rows, and they would still be loaded as rows addressable as 101-200... that seemed intuitive at the time, and I still believe that is how it should work.
It can more confusing if the filter said to load rows 1-100, 201-300, 401-500, etc; or only odd numbered rows - which are entirely possible and valid options when filtering.

My biggest mistake was that toArray() doesn't take the filter into account. I separated the logic of the filter completely from the logic of the main Spreadsheet class... perhaps the filter should have been stored in the loaded spreadsheet as well, so that toArray() could return the same filtered set.
However, even that becomes problematic with loading CSV files, where it would be possible to load multiple CSV files, each with different filters, into the same Worksheet object (which is possible). e.g. load entirety of first CSV, which includes header rows and unlimited lines; load filtered lines from CSV file 2, filtering out the heading row because that was already loaded in CSV file #1, and append the rows to the current worksheet, so row 2 from that csv might become row 101 in the worksheet if csv #1 contained 100 ows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants