Changes:
- Updated README and docs with a more comprehensive — and correct — usage example; also added tests to ensure it doesn't get stale
- Updated requirements to latest version of spaCy, as well as added matplotlib for viz
Bugfixes:
- textacy.preprocess.preprocess_text() is now, once again, imported at the top level, so easily reachable via textacy.preprocess_text() (@bretdabaker #14)
- viz subpackage now included in the docs' API reference
Changes:
- Added a viz subpackage, with two types of plots (so far):
- viz.draw_termite_plot(), typically used to evaluate and interpret topic models; conveniently accessible from the tm.TopicModel class
- viz.draw_semantic_network() for visualizing networks such as those output by representations.network
- Added a "Bernie & Hillary" corpus with 3000 congressional speeches made by Bernie Sanders and Hillary Clinton since 1996
corpora.fetch_bernie_and_hillary()
function automatically downloads to and loads from disk this corpus
- Modified
data.load_depechemood
function, now downloads data from GitHub source if not found on disk - Removed
resources/
directory from GitHub, hence all the downloadin' - Updated to spaCy v0.100.7
- German is now supported! although some functionality is English-only
- added textacy.load_spacy() function for loading spaCy packages, taking advantage of the new spacy.load() API; added a DeprecationWarning for textacy.data.load_spacy_pipeline()
- proper nouns' and pronouns'
.pos_
attributes are now correctly assigned 'PROPN' and 'PRON'; hence, modifiedregexes_etc.POS_REGEX_PATTERNS['en']
to include 'PROPN' - modified
spacy_utils.preserve_case()
to check for language-agnostic 'PROPN' POS rather than English-specific 'NNP' and 'NNPS' tags
- Added text_utils.clean_terms() function for cleaning up a sequence of single- or multi-word strings by stripping leading/trailing junk chars, handling dangling parens and odd hyphenation, etc.
Bugfixes:
textstats.readability_stats()
now correctly gets the number of words in a doc from its generator function (@gryBox #8)- removed NLTK dependency, which wasn't actually required
text_utils.detect_language()
now warns vialogging
rather than aprint()
statementfileio.write_conll()
documentation now correctly indicates that the filename param is not optional
Changes:
- Added
representations
subpackage; includes modules for network and vector space model (VSM) document and corpus representations - Document-term matrix creation now takes documents represented as a list of terms (rather than as spaCy Docs); splits the tokenization step from vectorization for added flexibility
- Some of this functionality was refactored from existing parts of the package
- Added
- Added
tm
(topic modeling) subpackage, with a mainTopicModel
class for training, applying, persisting, and interpreting NMF, LDA, and LSA topic models through a single interface - Various improvements to
TextDoc
andTextCorpus
classes TextDoc
can now be initialized from a spaCy Doc- Removed caching from
TextDoc
, because it was a pain and weird and probably not all that useful extract
-based methods are now generators, like the functions they wrap- Added
.as_semantic_network()
and.as_terms_list()
methods toTextDoc
TextCorpus.from_texts()
now takes advantage of multithreading via spaCy, if available, and document metadata can be passed in as a paired iterable of dicts
- Various improvements to
- Added read/write functions for sparse scipy matrices
- Added
fileio.read.split_content_and_metadata()
convenience function for splitting (text) content from associated metadata when reading data from disk into aTextDoc
orTextCorpus
- Renamed
fileio.read.get_filenames_in_dir()
tofileio.read.get_filenames()
and added functionality for matching/ignoring files by their names, file extensions, and ignoring invisible files - Rewrote
export.docs_to_gensim()
, now significantly faster - Imports in
__init__.py
files for main and subpackages now explicit
Bugfixes:
textstats.readability_stats()
no longer filters out stop words (@henningko #7)- Wikipedia article processing now recursively removes nested markup
extract.ngrams()
now filters out ngrams with any space-only tokens- functions with
include_nps
kwarg changed toinclude_ncs
, to match the renaming of the associated function fromextract.noun_phrases()
toextract.noun_chunks()
Changes:
- Added
corpora
subpackage withwikipedia.py
module; functions for streaming pages from a Wikipedia db dump as plain text or structured data - Added
fileio
subpackage with functions for reading/writing content from/to disk in common formats - JSON formats, both standard and streaming-friendly - text, optionally compressed - spacy documents to/from binary
Changes:
- Added
export.py
module for exporting textacy/spacy objects into "third-party" formats; so far, just gensim and conll-u - Added
compat.py
module for Py2/3 compatibility hacks - Renamed
extract.noun_phrases()
toextract.noun_chunks()
to match Spacy's API - Changed extract functions to generators, rather than returning lists
- Added
TextDoc.merge()
andspacy_utils.merge_spans()
for merging spans into single tokens within aspacy.Doc
, uses Spacy's recent implementation
Bug fixes:
- Whitespace tokens now always filtered out of
extract.words()
lists - Some Py2/3 str/unicode issues fixed
- Broken tests in
test_extract.py
no longer broken