Merge pull request #11 from facelessuser/encoding

Encoding
facelessuser · Oct 17, 2018 · c4ccdfe · c4ccdfe
2 parents 8afba39 + 877564f
commit c4ccdfe
Show file tree

Hide file tree

Showing 12 changed files with 126 additions and 45 deletions.
diff --git a/docs/src/dictionary/en-custom.txt b/docs/src/dictionary/en-custom.txt
@@ -44,4 +44,4 @@ sortable
 squidfunk
 subclassing
 sublicense
-wildcard
+wildcard
diff --git a/docs/src/markdown/changelog.md b/docs/src/markdown/changelog.md
@@ -1,5 +1,11 @@
 # Changelog
 
+## 0.2.0a4
+
+- **NEW**: Text filter can handle Unicode normalization and converting to other encodings.
+- **NEW**: Default encoding is now `utf-8` for all filters.
+- **FIX**: Internal encoding handling.
+
 ## 0.2.0a3
 
 - **FIX**: Text filter was returning old Parser name instead of new Filter name.

diff --git a/docs/src/markdown/filters.md b/docs/src/markdown/filters.md
@@ -14,15 +14,28 @@ PySpelling comes with a couple of built-in filters.
 
 ### Text
 
-This is a filter that simply retrieves the buffer's text and returns it as Unicode.  It takes a file or file buffer and returns a single `SourceText` object containing all the text in the file.  It is the default is no filter is specified and can be manually included via `pyspelling.filters.text`. When first in the chain, the file's default, assumed encoding of the is `ascii` unless otherwise overridden by the user.
+This is a filter that simply retrieves the buffer's text and returns it as Unicode.  It takes a file or file buffer and returns a single `SourceText` object containing all the text in the file.  It is the default is no filter is specified and can be manually included via `pyspelling.filters.text`. When first in the chain, the file's default, assumed encoding is `utf-8` unless otherwise overridden by the user.
 
-Options               | Type          | Default    | Description
---------------------- | ------------- | ---------- | -----------
-`disallow`            | [string]      | `#!py3 []` | `SourceText` names to avoid processing.
+Options               | Type          | Default          | Description
+--------------------- | ------------- | ---------------- | -----------
+`disallow`            | [string]      | `#!py3 []`       | `SourceText` names to avoid processing.
+`normalize`           | string        | `#!py3 ''`       | Performs Unicode normalization. Valid values are `NFC`, `NFD`, `NFKC`, and `NFKD`.
+`convert_encoding`    | string        | `#!py3 ''`       | Assuming a valid encoding, the text will be converted to the specified encoding.
+`errors`              | string        | `#!py3 'strict'` | Specifies what to do when converting the encoding, and a character can't be converted. Valid values are `strict`, `ignore`, `replace`, `xmlcharrefreplace`, `backslashreplace`, and `namereplace`.
+
+```yaml
+- name: Text
+  default_encoding: cp1252
+  filters:
+  - pyspelling.filters.text:
+      convert_encoding: utf-8
+  source:
+  - **/*.txt
+```
 
 ### Markdown
 
-The Markdown filter converts a text file's buffer using Python Markdown and returns a single `SourceText` object containing the text as HTML. It can be included via `pyspelling.filters.markdown`. When first in the chain, the file's default, assumed encoding of the is `utf-8` unless otherwise overridden by the user.
+The Markdown filter converts a text file's buffer using Python Markdown and returns a single `SourceText` object containing the text as HTML. It can be included via `pyspelling.filters.markdown`. When first in the chain, the file's default, assumed encoding is `utf-8` unless otherwise overridden by the user.
 
 Options               | Type          | Default    | Description
 --------------------- | ------------- | ---------- | -----------
@@ -107,7 +120,7 @@ Options          | Type     | Default       | Description
 
 ### JavaScript
 
-When first int the chain, the JavaScript filter uses no special encoding detection. It will assume `ascii` if no encoding BOM is found, and the user has not overridden the fallback encoding. Text is returned in blocks based on the context of the text depending on what is enabled.  The parser can return JSDoc comments, block comments, and/or inline comments. Each is returned as its own object.
+When first int the chain, the JavaScript filter uses no special encoding detection. It will assume `utf-8` if no encoding BOM is found, and the user has not overridden the fallback encoding. Text is returned in blocks based on the context of the text depending on what is enabled.  The parser can return JSDoc comments, block comments, and/or inline comments. Each is returned as its own object.
 
 Options          | Type     | Default       | Description
 ---------------- | -------- | ------------- | -----------

diff --git a/docs/src/markdown/index.md b/docs/src/markdown/index.md
@@ -151,7 +151,7 @@ Notice that `file_patterns` should be an array of values.
 
 ### Default Encoding
 
-When parsing a file, PySpelling only checks for low hanging fruit that it has 100% confidence in, such as UTF BOMs, and depending on the file parser, there may be additional logic like the file type's encoding declaration in the file header. If there is no BOM, encoding declaration, or other special logic, PySpelling will use the default encoding specified by the first filter which will initially parse the file. Depending on the file type, this could differ, but if you specify no filter, the `text` filter will be used which has a default of "ASCII" as the fallback. You can override the fallback with `default_encoding`:
+When parsing a file, PySpelling only checks for low hanging fruit that it has 100% confidence in, such as UTF BOMs, and depending on the file parser, there may be additional logic like the file type's encoding declaration in the file header. If there is no BOM, encoding declaration, or other special logic, PySpelling will use the default encoding specified by the first filter which will initially parse the file. Depending on the file type, this could differ, but if you specify no filter, the `text` filter will be used which has a default of `utf-8` as the fallback. You can override the fallback with `default_encoding`:
 
 ```yaml
 - name: Markdown

diff --git a/pyspelling/__init__.py b/pyspelling/__init__.py
@@ -1,7 +1,6 @@
 """Spell check with Aspell or Hunspell."""
 from __future__ import unicode_literals
 import os
-import codecs
 import importlib
 from . import util
 from . import __version__
@@ -53,17 +52,6 @@ def log(self, text, level):
         if self.verbose >= level:
             print(text)
 
-    def _normalize_utf(self, encoding):
-        """Normalize UTF encoding."""
-
-        if encoding == 'utf-8-sig':
-            encoding = 'utf-8'
-        if encoding.startswith('utf-16'):
-            encoding = 'utf-16'
-        elif encoding.startswith('utf-32'):
-            encoding = 'utf-32'
-        return encoding
-
     def setup_command(self, encoding, options, personal_dict):
         """Setup the command."""
 
@@ -95,15 +83,16 @@ def _check_spelling(self, sources, options, personal_dict):
             if source._has_error():
                 yield util.Results([], source.context, source.category, source.error)
             else:
+                encoding = source.encoding
                 if source._is_bytes():
                     text = source.text
                 else:
-                    text = source.text.encode(self._normalize_utf(source.encoding))
+                    text = source.text.encode(encoding)
                 self.log(text, 3)
-                cmd = self.setup_command(self._normalize_utf(source.encoding), options, personal_dict)
+                cmd = self.setup_command(encoding, options, personal_dict)
                 self.log(str(cmd), 2)
 
-                wordlist = util.call_spellchecker(cmd, input_text=text)
+                wordlist = util.call_spellchecker(cmd, input_text=text, encoding=encoding)
                 yield util.Results([w for w in sorted(set(wordlist.split('\n'))) if w], source.context, source.category)
 
     def compile_dictionary(self, lang, wordlists, output):
@@ -249,7 +238,7 @@ def setup_command(self, encoding, options, personal_dict):
         cmd = [
             self.binary,
             'list',
-            '--encoding', codecs.lookup(encoding).name
+            '--encoding', encoding
         ]
 
         if personal_dict:
@@ -341,7 +330,7 @@ def setup_command(self, encoding, options, personal_dict):
         cmd = [
             self.binary,
             '-l',
-            '-i', codecs.lookup(encoding).name
+            '-i', encoding
         ]
 
         if personal_dict:

diff --git a/pyspelling/__version__.py b/pyspelling/__version__.py
@@ -1,7 +1,7 @@
 """`PySpelling` package."""
 
 #   (major, minor, micro, release type, pre-release build, post-release build)
-version_info = (0, 2, 0, 'alpha', 3, 0)
+version_info = (0, 2, 0, 'alpha', 4, 0)
 
 
 def _version():

diff --git a/pyspelling/filters/__init__.py b/pyspelling/filters/__init__.py
@@ -38,6 +38,18 @@ class SourceText(namedtuple('SourceText', ['text', 'context', 'encoding', 'categ
     def __new__(cls, text, context, encoding, category, error=None):
         """Allow defaults."""
 
+        encoding = PYTHON_ENCODING_NAMES.get(encoding, encoding).lower()
+
+        if encoding == 'utf-8-sig':
+            encoding = 'utf-8'
+        if encoding.startswith('utf-16'):
+            encoding = 'utf-16'
+        elif encoding.startswith('utf-32'):
+            encoding = 'utf-32'
+
+        if encoding:
+            encoding = codecs.lookup(encoding).name
+
         return super(SourceText, cls).__new__(cls, text, context, encoding, category, error)
 
     def _is_bytes(self):
@@ -56,11 +68,11 @@ class Filter(object):
 
     MAX_GUESS_SIZE = 31457280
 
-    def __init__(self, config, default_encoding='ascii'):
+    def __init__(self, config, default_encoding='utf-8'):
         """Initialize."""
 
         self.config = config
-        self.default_encoding = PYTHON_ENCODING_NAMES.get(default_encoding, default_encoding)
+        self.default_encoding = PYTHON_ENCODING_NAMES.get(default_encoding, default_encoding).lower()
 
     def _is_very_large(self, size):
         """Check if content is very large."""
@@ -101,11 +113,11 @@ def _utf_strip_bom(self, encoding):
 
         if encoding is None:
             pass
-        elif encoding == 'utf-8':
+        elif encoding.lower() == 'utf-8':
             encoding = 'utf-8-sig'
-        elif encoding.startswith('utf-16'):
+        elif encoding.lower().startswith('utf-16'):
             encoding = 'utf-16'
-        elif encoding.startswith('utf-32'):
+        elif encoding.lower().startswith('utf-32'):
             encoding = 'utf-32'
         return encoding
 

diff --git a/pyspelling/filters/context.py b/pyspelling/filters/context.py
@@ -9,7 +9,7 @@
 class ContextFilter(filters.Filter):
     """Context filter."""
 
-    def __init__(self, options, default_encoding='ascii'):
+    def __init__(self, options, default_encoding='utf-8'):
         """Initialization."""
 
         self.context_visible_first = options.get('context_visible_first', False) is True

diff --git a/pyspelling/filters/javascript.py b/pyspelling/filters/javascript.py
@@ -25,7 +25,7 @@
 class JavaScriptFilter(filters.Filter):
     """JavaScript filter."""
 
-    def __init__(self, options, default_encoding='ascii'):
+    def __init__(self, options, default_encoding='utf-8'):
         """Initialization."""
 
         self.blocks = options.get('block_comments', True) is True

diff --git a/pyspelling/filters/python.py b/pyspelling/filters/python.py
@@ -19,8 +19,6 @@
 )
 RE_NON_PRINTABLE_ASCII = re.compile(br"[^ -~]+")
 
-DEFAULT_ENCODING = 'utf-8'
-
 
 class PythonFilter(filters.Filter):
     """Spelling Python."""
@@ -29,7 +27,7 @@ class PythonFilter(filters.Filter):
     FUNCTION = 1
     CLASS = 2
 
-    def __init__(self, options, default_encoding=DEFAULT_ENCODING):
+    def __init__(self, options, default_encoding='utf-8'):
         """Initialization."""
 
         self.comments = options.get('comments', True) is True
@@ -51,7 +49,7 @@ def header_check(self, content):
             elif m.group(2):
                 encode = m.group(2).decode('ascii')
         if encode is None:
-            encode = DEFAULT_ENCODING
+            encode = 'utf-8'
         return encode
 
     def get_ascii(self, text):

diff --git a/pyspelling/filters/text.py b/pyspelling/filters/text.py
@@ -1,9 +1,65 @@
 """Text filter."""
 from __future__ import unicode_literals
 from .. import filters
+import codecs
+import unicodedata
+
+
+class TextFilter(filters.Filter):
+    """Spelling Text."""
+
+    def __init__(self, options, default_encoding='utf-8'):
+        """Initialization."""
+
+        self.normalize = options.get('normalize', '').upper()
+        self.convert_encoding = options.get('convert_encoding', '').lower()
+        self.errors = options.get('errors', 'strict').lower()
+        if self.convert_encoding:
+            self.convert_encoding = codecs.lookup(
+                filters.PYTHON_ENCODING_NAMES.get(default_encoding, default_encoding).lower()
+            ).name
+
+            # Don't generate content with BOMs
+            if (
+                self.convert_encoding.startswith(('utf-32', 'utf-16')) and
+                not self.convert_encoding.endswith(('le', 'be'))
+            ):
+                self.convert_encoding += '-le'
+
+            if self.convert_encoding == 'utf-8-sig':
+                self.convert_encoding = 'utf-8'
+
+        super(TextFilter, self).__init__(options, default_encoding)
+
+    def convert(self, text, encoding):
+        """Convert the text."""
+
+        if self.normalize in ('NFC', 'NFKC', 'NFD', 'NFKD'):
+            text = unicodedata.normalize(self.normalize, text)
+        if self.convert_encoding:
+            text = text.encode(self.convert_encoding, self.errors).decode(self.convert_encoding)
+            encoding = self.convert_encoding
+        return text, encoding
+
+    def filter(self, source_file, encoding):  # noqa A001
+        """Open and filter the file from disk."""
+
+        with codecs.open(source_file, 'r', encoding=encoding) as f:
+            text = f.read()
+
+        text, encoding = self.convert(text, encoding)
+
+        return [filters.SourceText(text, source_file, encoding, 'text')]
+
+    def sfilter(self, source):
+        """Execute filter."""
+
+        text, encoding = self.convert(source.text, source.encoding)
+
+        return [filters.SourceText(text, source.context, encoding, 'text')]
 
 
 def get_filter():
     """Return the filter."""
 
-    return filters.Filter
+    return TextFilter
diff --git a/pyspelling/util.py b/pyspelling/util.py
@@ -5,6 +5,7 @@
 import string
 import random
 import re
+import locale
 from collections import namedtuple
 
 RE_LAST_SPACE_IN_CHUNK = re.compile(rb'(\s+)(?=\S+\Z)')
@@ -35,19 +36,25 @@ def get_process(cmd):
     return process
 
 
-def get_process_output(process):
+def get_process_output(process, encoding=None):
     """Get the output from the process."""
 
     output = process.communicate()
     returncode = process.returncode
 
+    if encoding is None:
+        try:
+            encoding = sys.stdout.encoding
+        except Exception:
+            encoding = locale.getpreferredencoding()
+
     if returncode != 0:
-        raise RuntimeError("Runtime Error: %s" % (output[0].rstrip().decode('utf-8')))
+        raise RuntimeError("Runtime Error: %s" % (output[0].rstrip().decode(encoding)))
 
-    return output[0].decode('utf-8')
+    return output[0].decode(encoding)
 
 
-def call(cmd, input_file=None, input_text=None):
+def call(cmd, input_file=None, input_text=None, encoding=None):
     """Call with arguments."""
 
     process = get_process(cmd)
@@ -58,10 +65,10 @@ def call(cmd, input_file=None, input_text=None):
     if input_text is not None:
         process.stdin.write(input_text)
 
-    return get_process_output(process)
+    return get_process_output(process, encoding)
 
 
-def call_spellchecker(cmd, input_text):
+def call_spellchecker(cmd, input_text, encoding=None):
     """Call spell checker with arguments."""
 
     process = get_process(cmd)
@@ -86,7 +93,7 @@ def call_spellchecker(cmd, input_text):
                 if offset >= end:
                     break
 
-    return get_process_output(process)
+    return get_process_output(process, encoding)
 
 
 def random_name_gen(size=6):