Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Html Reader Non-UTF8 Charsets #4019

Merged
merged 6 commits into from
May 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ and this project adheres to [Semantic Versioning](https://semver.org).
- Default Value for Conditional::$text [PR #3946](https://github.com/PHPOffice/PhpSpreadsheet/pull/3946)
- Table Filter Buttons [Issue #3988](https://github.com/PHPOffice/PhpSpreadsheet/issues/3988) [PR #3992](https://github.com/PHPOffice/PhpSpreadsheet/pull/3992)
- Improvements to Xml Reader [Issue #3999](https://github.com/PHPOffice/PhpSpreadsheet/issues/3999) [Issue #4000](https://github.com/PHPOffice/PhpSpreadsheet/issues/4000) [Issue #4001](https://github.com/PHPOffice/PhpSpreadsheet/issues/4001) [Issue #4002](https://github.com/PHPOffice/PhpSpreadsheet/issues/4002) [PR #4003](https://github.com/PHPOffice/PhpSpreadsheet/pull/4003) [PR #4007](https://github.com/PHPOffice/PhpSpreadsheet/pull/4007)
- Html Reader non-UTF8 [Issue #3995](https://github.com/PHPOffice/PhpSpreadsheet/issues/3995) [Issue #866](https://github.com/PHPOffice/PhpSpreadsheet/issues/866) [Issue #1681](https://github.com/PHPOffice/PhpSpreadsheet/issues/1681) [PR #4019](https://github.com/PHPOffice/PhpSpreadsheet/pull/4019)

## 2.0.0 - 2024-01-04

Expand Down
4 changes: 2 additions & 2 deletions docs/topics/reading-and-writing-to-file.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ versions of Microsoft Excel.
**Excel 2003 XML limitations** Please note that Excel 2003 XML format
has some limits regarding to styling cells and handling large
spreadsheets via PHP.
Also, only files using charset UTF-8 are supported.
Also, only files using charset UTF-8 or ISO-8859-* are supported.

### \PhpOffice\PhpSpreadsheet\Reader\Xml

Expand Down Expand Up @@ -718,7 +718,7 @@ extension.

**HTML limitations** Please note that HTML file format has some limits
regarding to styling cells, number formatting, ...
Also, only files using charset UTF-8 are supported.
Declared charsets compatible with ASCII in range 00-7F, and UTF-8/16 with BOM are supported.

### \PhpOffice\PhpSpreadsheet\Reader\Html

Expand Down
35 changes: 23 additions & 12 deletions src/PhpSpreadsheet/Reader/Html.php
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@ class Html extends BaseReader
*/
const TEST_SAMPLE_SIZE = 2048;

private const STARTS_WITH_BOM = '/^(?:\xfe\xff|\xff\xfe|\xEF\xBB\xBF)/';

private const DECLARES_CHARSET = '/ charset=/i';

/**
* Input encoding.
*/
Expand Down Expand Up @@ -144,6 +148,9 @@ public function canRead(string $filename): bool
}

$beginning = $this->readBeginning();
if (preg_match(self::STARTS_WITH_BOM, $beginning)) {
return true;
}
$startWithTag = self::startsWithTag($beginning);
$containsTags = self::containsTags($beginning);
$endsWithTag = self::endsWithTag($this->readEnding());
Expand Down Expand Up @@ -638,12 +645,7 @@ public function loadIntoExisting(string $filename, Spreadsheet $spreadsheet): Sp
// Reload the HTML file into the DOM object
try {
$convert = $this->getSecurityScannerOrThrow()->scanFile($filename);
$lowend = "\u{80}";
$highend = "\u{10ffff}";
$regexp = "/[$lowend-$highend]/u";
/** @var callable $callback */
$callback = [self::class, 'replaceNonAscii'];
$convert = preg_replace_callback($regexp, $callback, $convert);
$convert = self::replaceNonAsciiIfNeeded($convert);
$loaded = ($convert === null) ? false : $dom->loadHTML($convert);
} catch (Throwable $e) {
$loaded = false;
Expand Down Expand Up @@ -736,6 +738,20 @@ private static function replaceNonAscii(array $matches): string
return '&#' . mb_ord($matches[0], 'UTF-8') . ';';
}

private static function replaceNonAsciiIfNeeded(string $convert): ?string
{
if (preg_match(self::STARTS_WITH_BOM, $convert) !== 1 && preg_match(self::DECLARES_CHARSET, $convert) !== 1) {
$lowend = "\u{80}";
$highend = "\u{10ffff}";
$regexp = "/[$lowend-$highend]/u";
/** @var callable $callback */
$callback = [self::class, 'replaceNonAscii'];
$convert = preg_replace_callback($regexp, $callback, $convert);
}

return $convert;
}

/**
* Spreadsheet from content.
*/
Expand All @@ -747,12 +763,7 @@ public function loadFromString(string $content, ?Spreadsheet $spreadsheet = null
// Reload the HTML file into the DOM object
try {
$convert = $this->getSecurityScannerOrThrow()->scan($content);
$lowend = "\u{80}";
$highend = "\u{10ffff}";
$regexp = "/[$lowend-$highend]/u";
/** @var callable $callback */
$callback = [self::class, 'replaceNonAscii'];
$convert = preg_replace_callback($regexp, $callback, $convert);
$convert = self::replaceNonAsciiIfNeeded($convert);
$loaded = ($convert === null) ? false : $dom->loadHTML($convert);
} catch (Throwable $e) {
$loaded = false;
Expand Down
45 changes: 45 additions & 0 deletions tests/PhpSpreadsheetTests/Reader/Html/HtmlCharsetTest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
<?php

declare(strict_types=1);

namespace PhpOffice\PhpSpreadsheetTests\Reader\Html;

use PhpOffice\PhpSpreadsheet\Reader\Exception as ReaderException;
use PhpOffice\PhpSpreadsheet\Reader\Html;
use PHPUnit\Framework\TestCase;

class HtmlCharsetTest extends TestCase
{
/**
* @dataProvider providerCharset
*/
public function testCharset(string $filename, string $expectedResult): void
{
if ($expectedResult === 'exception') {
$this->expectException(ReaderException::class);
$this->expectExceptionMessage('Failed to load');
}
$directory = 'tests/data/Reader/HTML';
$reader = new Html();
$spreadsheet = $reader->load("$directory/$filename");
$sheet = $spreadsheet->getActiveSheet();
self::assertSame($expectedResult, $sheet->getCell('A1')->getValue());
$spreadsheet->disconnectWorksheets();
}

public static function providerCharset(): array
{
return [
['charset.ISO-8859-1.html', 'À1'],
['charset.ISO-8859-1.html4.html', 'À1'],
['charset.ISO-8859-2.html', 'Ŕ1'],
['charset.nocharset.html', 'À1'],
['charset.UTF-8.html', 'À1'],
['charset.UTF-8.bom.html', 'À1'],
['charset.UTF-16.bebom.html', 'À1'],
['charset.UTF-16.lebom.html', 'À1'],
['charset.gb18030.html', '电视机'],
['charset.unknown.html', 'exception'],
];
}
}
25 changes: 25 additions & 0 deletions tests/data/Reader/HTML/charset.ISO-8859-1.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='ISO-8859-1'>
<title>ISO-8859-1</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>À1</td>
<td>B1</td>
<td>ç1</td>
<td>D1</td>
</tr>
<tr>
<td>Ã2</td>
<td>B2</td>
<td>C2</td>
<td>Ð2</td>
</tr>
</tbody>
</table>
</body>
</html>
25 changes: 25 additions & 0 deletions tests/data/Reader/HTML/charset.ISO-8859-1.html4.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang='en'>
<head>
<meta http-equiv="Content-Type" content="text/html; CHARSET=ISO-8859-1">
<title>ISO-8859-1 Html4 Doctype and Meta</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>À1</td>
<td>B1</td>
<td>ç1</td>
<td>D1</td>
</tr>
<tr>
<td>Ã2</td>
<td>B2</td>
<td>C2</td>
<td>Ð2</td>
</tr>
</tbody>
</table>
</body>
</html>
25 changes: 25 additions & 0 deletions tests/data/Reader/HTML/charset.ISO-8859-2.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='ISO-8859-2'>
<title>ISO-8859-2</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>À1</td>
<td>B1</td>
<td>ç1</td>
<td>D1</td>
</tr>
<tr>
<td>Ã2</td>
<td>B2</td>
<td>C2</td>
<td>Ð2</td>
</tr>
</tbody>
</table>
</body>
</html>
Binary file added tests/data/Reader/HTML/charset.UTF-16.bebom.html
Binary file not shown.
Binary file added tests/data/Reader/HTML/charset.UTF-16.lebom.html
Binary file not shown.
24 changes: 24 additions & 0 deletions tests/data/Reader/HTML/charset.UTF-8.bom.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
 <!DOCTYPE html>
<html lang='en'>
<head>
<title>UTF-8</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>À1</td>
<td>B1</td>
<td>ç1</td>
<td>D1</td>
</tr>
<tr>
<td>Ã2</td>
<td>B2</td>
<td>C2</td>
<td>Ð2</td>
</tr>
</tbody>
</table>
</body>
</html>
25 changes: 25 additions & 0 deletions tests/data/Reader/HTML/charset.UTF-8.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='UTF-8'>
<title>UTF-8</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>À1</td>
<td>B1</td>
<td>ç1</td>
<td>D1</td>
</tr>
<tr>
<td>Ã2</td>
<td>B2</td>
<td>C2</td>
<td>Ð2</td>
</tr>
</tbody>
</table>
</body>
</html>
15 changes: 15 additions & 0 deletions tests/data/Reader/HTML/charset.gb18030.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb18030">
<title>gb18030</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>µçÊÓ»ú</td>
</tr>
</tbody>
</table>
</body>
</html>
16 changes: 16 additions & 0 deletions tests/data/Reader/HTML/charset.nocharset.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<table>
<tbody>
<tr>
<td>À1</td>
<td>B1</td>
<td>ç1</td>
<td>D1</td>
</tr>
<tr>
<td>Ã2</td>
<td>B2</td>
<td>C2</td>
<td>Ð2</td>
</tr>
</tbody>
</table>
25 changes: 25 additions & 0 deletions tests/data/Reader/HTML/charset.unknown.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='unknown'>
<title>UTF-8</title>
</head>
<body>
<table>
<tbody>
<tr>
<td>À1</td>
<td>B1</td>
<td>ç1</td>
<td>D1</td>
</tr>
<tr>
<td>Ã2</td>
<td>B2</td>
<td>C2</td>
<td>Ð2</td>
</tr>
</tbody>
</table>
</body>
</html>
Loading