Skip to content

Commit

Permalink
Merge pull request #11 from facelessuser/encoding
Browse files Browse the repository at this point in the history
Encoding
  • Loading branch information
facelessuser authored Oct 17, 2018
2 parents 8afba39 + 877564f commit c4ccdfe
Show file tree
Hide file tree
Showing 12 changed files with 126 additions and 45 deletions.
2 changes: 1 addition & 1 deletion docs/src/dictionary/en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ sortable
squidfunk
subclassing
sublicense
wildcard
wildcard
6 changes: 6 additions & 0 deletions docs/src/markdown/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Changelog

## 0.2.0a4

- **NEW**: Text filter can handle Unicode normalization and converting to other encodings.
- **NEW**: Default encoding is now `utf-8` for all filters.
- **FIX**: Internal encoding handling.

## 0.2.0a3

- **FIX**: Text filter was returning old Parser name instead of new Filter name.
Expand Down
25 changes: 19 additions & 6 deletions docs/src/markdown/filters.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,28 @@ PySpelling comes with a couple of built-in filters.

### Text

This is a filter that simply retrieves the buffer's text and returns it as Unicode. It takes a file or file buffer and returns a single `SourceText` object containing all the text in the file. It is the default is no filter is specified and can be manually included via `pyspelling.filters.text`. When first in the chain, the file's default, assumed encoding of the is `ascii` unless otherwise overridden by the user.
This is a filter that simply retrieves the buffer's text and returns it as Unicode. It takes a file or file buffer and returns a single `SourceText` object containing all the text in the file. It is the default is no filter is specified and can be manually included via `pyspelling.filters.text`. When first in the chain, the file's default, assumed encoding is `utf-8` unless otherwise overridden by the user.

Options | Type | Default | Description
--------------------- | ------------- | ---------- | -----------
`disallow` | [string] | `#!py3 []` | `SourceText` names to avoid processing.
Options | Type | Default | Description
--------------------- | ------------- | ---------------- | -----------
`disallow` | [string] | `#!py3 []` | `SourceText` names to avoid processing.
`normalize` | string | `#!py3 ''` | Performs Unicode normalization. Valid values are `NFC`, `NFD`, `NFKC`, and `NFKD`.
`convert_encoding` | string | `#!py3 ''` | Assuming a valid encoding, the text will be converted to the specified encoding.
`errors` | string | `#!py3 'strict'` | Specifies what to do when converting the encoding, and a character can't be converted. Valid values are `strict`, `ignore`, `replace`, `xmlcharrefreplace`, `backslashreplace`, and `namereplace`.

```yaml
- name: Text
default_encoding: cp1252
filters:
- pyspelling.filters.text:
convert_encoding: utf-8
source:
- **/*.txt
```

### Markdown

The Markdown filter converts a text file's buffer using Python Markdown and returns a single `SourceText` object containing the text as HTML. It can be included via `pyspelling.filters.markdown`. When first in the chain, the file's default, assumed encoding of the is `utf-8` unless otherwise overridden by the user.
The Markdown filter converts a text file's buffer using Python Markdown and returns a single `SourceText` object containing the text as HTML. It can be included via `pyspelling.filters.markdown`. When first in the chain, the file's default, assumed encoding is `utf-8` unless otherwise overridden by the user.

Options | Type | Default | Description
--------------------- | ------------- | ---------- | -----------
Expand Down Expand Up @@ -107,7 +120,7 @@ Options | Type | Default | Description

### JavaScript

When first int the chain, the JavaScript filter uses no special encoding detection. It will assume `ascii` if no encoding BOM is found, and the user has not overridden the fallback encoding. Text is returned in blocks based on the context of the text depending on what is enabled. The parser can return JSDoc comments, block comments, and/or inline comments. Each is returned as its own object.
When first int the chain, the JavaScript filter uses no special encoding detection. It will assume `utf-8` if no encoding BOM is found, and the user has not overridden the fallback encoding. Text is returned in blocks based on the context of the text depending on what is enabled. The parser can return JSDoc comments, block comments, and/or inline comments. Each is returned as its own object.

Options | Type | Default | Description
---------------- | -------- | ------------- | -----------
Expand Down
2 changes: 1 addition & 1 deletion docs/src/markdown/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ Notice that `file_patterns` should be an array of values.

### Default Encoding

When parsing a file, PySpelling only checks for low hanging fruit that it has 100% confidence in, such as UTF BOMs, and depending on the file parser, there may be additional logic like the file type's encoding declaration in the file header. If there is no BOM, encoding declaration, or other special logic, PySpelling will use the default encoding specified by the first filter which will initially parse the file. Depending on the file type, this could differ, but if you specify no filter, the `text` filter will be used which has a default of "ASCII" as the fallback. You can override the fallback with `default_encoding`:
When parsing a file, PySpelling only checks for low hanging fruit that it has 100% confidence in, such as UTF BOMs, and depending on the file parser, there may be additional logic like the file type's encoding declaration in the file header. If there is no BOM, encoding declaration, or other special logic, PySpelling will use the default encoding specified by the first filter which will initially parse the file. Depending on the file type, this could differ, but if you specify no filter, the `text` filter will be used which has a default of `utf-8` as the fallback. You can override the fallback with `default_encoding`:

```yaml
- name: Markdown
Expand Down
23 changes: 6 additions & 17 deletions pyspelling/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""Spell check with Aspell or Hunspell."""
from __future__ import unicode_literals
import os
import codecs
import importlib
from . import util
from . import __version__
Expand Down Expand Up @@ -53,17 +52,6 @@ def log(self, text, level):
if self.verbose >= level:
print(text)

def _normalize_utf(self, encoding):
"""Normalize UTF encoding."""

if encoding == 'utf-8-sig':
encoding = 'utf-8'
if encoding.startswith('utf-16'):
encoding = 'utf-16'
elif encoding.startswith('utf-32'):
encoding = 'utf-32'
return encoding

def setup_command(self, encoding, options, personal_dict):
"""Setup the command."""

Expand Down Expand Up @@ -95,15 +83,16 @@ def _check_spelling(self, sources, options, personal_dict):
if source._has_error():
yield util.Results([], source.context, source.category, source.error)
else:
encoding = source.encoding
if source._is_bytes():
text = source.text
else:
text = source.text.encode(self._normalize_utf(source.encoding))
text = source.text.encode(encoding)
self.log(text, 3)
cmd = self.setup_command(self._normalize_utf(source.encoding), options, personal_dict)
cmd = self.setup_command(encoding, options, personal_dict)
self.log(str(cmd), 2)

wordlist = util.call_spellchecker(cmd, input_text=text)
wordlist = util.call_spellchecker(cmd, input_text=text, encoding=encoding)
yield util.Results([w for w in sorted(set(wordlist.split('\n'))) if w], source.context, source.category)

def compile_dictionary(self, lang, wordlists, output):
Expand Down Expand Up @@ -249,7 +238,7 @@ def setup_command(self, encoding, options, personal_dict):
cmd = [
self.binary,
'list',
'--encoding', codecs.lookup(encoding).name
'--encoding', encoding
]

if personal_dict:
Expand Down Expand Up @@ -341,7 +330,7 @@ def setup_command(self, encoding, options, personal_dict):
cmd = [
self.binary,
'-l',
'-i', codecs.lookup(encoding).name
'-i', encoding
]

if personal_dict:
Expand Down
2 changes: 1 addition & 1 deletion pyspelling/__version__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""`PySpelling` package."""

# (major, minor, micro, release type, pre-release build, post-release build)
version_info = (0, 2, 0, 'alpha', 3, 0)
version_info = (0, 2, 0, 'alpha', 4, 0)


def _version():
Expand Down
22 changes: 17 additions & 5 deletions pyspelling/filters/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,18 @@ class SourceText(namedtuple('SourceText', ['text', 'context', 'encoding', 'categ
def __new__(cls, text, context, encoding, category, error=None):
"""Allow defaults."""

encoding = PYTHON_ENCODING_NAMES.get(encoding, encoding).lower()

if encoding == 'utf-8-sig':
encoding = 'utf-8'
if encoding.startswith('utf-16'):
encoding = 'utf-16'
elif encoding.startswith('utf-32'):
encoding = 'utf-32'

if encoding:
encoding = codecs.lookup(encoding).name

return super(SourceText, cls).__new__(cls, text, context, encoding, category, error)

def _is_bytes(self):
Expand All @@ -56,11 +68,11 @@ class Filter(object):

MAX_GUESS_SIZE = 31457280

def __init__(self, config, default_encoding='ascii'):
def __init__(self, config, default_encoding='utf-8'):
"""Initialize."""

self.config = config
self.default_encoding = PYTHON_ENCODING_NAMES.get(default_encoding, default_encoding)
self.default_encoding = PYTHON_ENCODING_NAMES.get(default_encoding, default_encoding).lower()

def _is_very_large(self, size):
"""Check if content is very large."""
Expand Down Expand Up @@ -101,11 +113,11 @@ def _utf_strip_bom(self, encoding):

if encoding is None:
pass
elif encoding == 'utf-8':
elif encoding.lower() == 'utf-8':
encoding = 'utf-8-sig'
elif encoding.startswith('utf-16'):
elif encoding.lower().startswith('utf-16'):
encoding = 'utf-16'
elif encoding.startswith('utf-32'):
elif encoding.lower().startswith('utf-32'):
encoding = 'utf-32'
return encoding

Expand Down
2 changes: 1 addition & 1 deletion pyspelling/filters/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
class ContextFilter(filters.Filter):
"""Context filter."""

def __init__(self, options, default_encoding='ascii'):
def __init__(self, options, default_encoding='utf-8'):
"""Initialization."""

self.context_visible_first = options.get('context_visible_first', False) is True
Expand Down
2 changes: 1 addition & 1 deletion pyspelling/filters/javascript.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
class JavaScriptFilter(filters.Filter):
"""JavaScript filter."""

def __init__(self, options, default_encoding='ascii'):
def __init__(self, options, default_encoding='utf-8'):
"""Initialization."""

self.blocks = options.get('block_comments', True) is True
Expand Down
6 changes: 2 additions & 4 deletions pyspelling/filters/python.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,6 @@
)
RE_NON_PRINTABLE_ASCII = re.compile(br"[^ -~]+")

DEFAULT_ENCODING = 'utf-8'


class PythonFilter(filters.Filter):
"""Spelling Python."""
Expand All @@ -29,7 +27,7 @@ class PythonFilter(filters.Filter):
FUNCTION = 1
CLASS = 2

def __init__(self, options, default_encoding=DEFAULT_ENCODING):
def __init__(self, options, default_encoding='utf-8'):
"""Initialization."""

self.comments = options.get('comments', True) is True
Expand All @@ -51,7 +49,7 @@ def header_check(self, content):
elif m.group(2):
encode = m.group(2).decode('ascii')
if encode is None:
encode = DEFAULT_ENCODING
encode = 'utf-8'
return encode

def get_ascii(self, text):
Expand Down
58 changes: 57 additions & 1 deletion pyspelling/filters/text.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,65 @@
"""Text filter."""
from __future__ import unicode_literals
from .. import filters
import codecs
import unicodedata


class TextFilter(filters.Filter):
"""Spelling Text."""

def __init__(self, options, default_encoding='utf-8'):
"""Initialization."""

self.normalize = options.get('normalize', '').upper()
self.convert_encoding = options.get('convert_encoding', '').lower()
self.errors = options.get('errors', 'strict').lower()
if self.convert_encoding:
self.convert_encoding = codecs.lookup(
filters.PYTHON_ENCODING_NAMES.get(default_encoding, default_encoding).lower()
).name

# Don't generate content with BOMs
if (
self.convert_encoding.startswith(('utf-32', 'utf-16')) and
not self.convert_encoding.endswith(('le', 'be'))
):
self.convert_encoding += '-le'

if self.convert_encoding == 'utf-8-sig':
self.convert_encoding = 'utf-8'

super(TextFilter, self).__init__(options, default_encoding)

def convert(self, text, encoding):
"""Convert the text."""

if self.normalize in ('NFC', 'NFKC', 'NFD', 'NFKD'):
text = unicodedata.normalize(self.normalize, text)
if self.convert_encoding:
text = text.encode(self.convert_encoding, self.errors).decode(self.convert_encoding)
encoding = self.convert_encoding
return text, encoding

def filter(self, source_file, encoding): # noqa A001
"""Open and filter the file from disk."""

with codecs.open(source_file, 'r', encoding=encoding) as f:
text = f.read()

text, encoding = self.convert(text, encoding)

return [filters.SourceText(text, source_file, encoding, 'text')]

def sfilter(self, source):
"""Execute filter."""

text, encoding = self.convert(source.text, source.encoding)

return [filters.SourceText(text, source.context, encoding, 'text')]


def get_filter():
"""Return the filter."""

return filters.Filter
return TextFilter
21 changes: 14 additions & 7 deletions pyspelling/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import string
import random
import re
import locale
from collections import namedtuple

RE_LAST_SPACE_IN_CHUNK = re.compile(rb'(\s+)(?=\S+\Z)')
Expand Down Expand Up @@ -35,19 +36,25 @@ def get_process(cmd):
return process


def get_process_output(process):
def get_process_output(process, encoding=None):
"""Get the output from the process."""

output = process.communicate()
returncode = process.returncode

if encoding is None:
try:
encoding = sys.stdout.encoding
except Exception:
encoding = locale.getpreferredencoding()

if returncode != 0:
raise RuntimeError("Runtime Error: %s" % (output[0].rstrip().decode('utf-8')))
raise RuntimeError("Runtime Error: %s" % (output[0].rstrip().decode(encoding)))

return output[0].decode('utf-8')
return output[0].decode(encoding)


def call(cmd, input_file=None, input_text=None):
def call(cmd, input_file=None, input_text=None, encoding=None):
"""Call with arguments."""

process = get_process(cmd)
Expand All @@ -58,10 +65,10 @@ def call(cmd, input_file=None, input_text=None):
if input_text is not None:
process.stdin.write(input_text)

return get_process_output(process)
return get_process_output(process, encoding)


def call_spellchecker(cmd, input_text):
def call_spellchecker(cmd, input_text, encoding=None):
"""Call spell checker with arguments."""

process = get_process(cmd)
Expand All @@ -86,7 +93,7 @@ def call_spellchecker(cmd, input_text):
if offset >= end:
break

return get_process_output(process)
return get_process_output(process, encoding)


def random_name_gen(size=6):
Expand Down

0 comments on commit c4ccdfe

Please sign in to comment.