Skip to content

Commit

Permalink
Merge pull request #2 from piskvorky/develop
Browse files Browse the repository at this point in the history
Merging DIffs
  • Loading branch information
Arttii committed Sep 28, 2015
2 parents 0436558 + 022bb30 commit 7f95027
Show file tree
Hide file tree
Showing 130 changed files with 36,374 additions and 1,755 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
*.pkl
*.bak
*.npy
*.npz

# OS generated files #
######################
Expand All @@ -41,7 +42,10 @@ Thumbs.db
#########
.project
.pydevproject
.ropeproject
.settings/
.eggs
cython_debug
docs/src/_build/
docs/_static
dedan_gensim.tmproj
Expand Down
4 changes: 1 addition & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
sudo: false
language: python
python:
- "2.6"
Expand All @@ -10,9 +11,6 @@ before_install:
- ./miniconda.sh -b
- export PATH=/home/travis/miniconda/bin:$PATH
- conda update --yes conda
# The next couple lines fix a crash with multiprocessing on Travis and are not specific to using Miniconda
- sudo rm -rf /dev/shm
- sudo ln -s /run/shm /dev/shm
install:
- conda create --yes -n gensim-test python=$TRAVIS_PYTHON_VERSION pip atlas numpy scipy
- source activate gensim-test
Expand Down
104 changes: 103 additions & 1 deletion CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,109 @@
Changes
=======

0.10.0rc1

0.12.2

* tutorial on text summarization (Ólavur Mortensen, #436)
* more flexible vocabulary construction in word2vec & doc2vec (Philipp Dowling, #434)
* added support for sliced TransformedCorpus objects, so that after applying (for instance) TfidfModel the returned corpus remains randomly indexable. (Matti Lyra, #425)
* changed the LdaModel.save so that a custom `ignore` list can be passed in (Matti Lyra, #331)
* added support for NumPy style fancy indexing to corpus objects (Matti Lyra, #414)
* py3k fix in distributed LSI (spacecowboy, #433)
* Windows fix for setup.py (#428)
* fix compatibility for scipy 0.16.0 (#415)

0.12.1, 20/07/2015

* improvements to testing, switch to Travis CI containers
* support for loading old word2vec models (<=0.11.1) in 0.12+ (Gordon Mohr, #405)
* various bug fixes to word2vec, doc2vec (Gordon Mohr, #393, #386, #404)
* TextSummatization support for very short texts (Federico Barrios, #390)
* support for word2vec[['word1', 'word2'...]] convenience API calls (Satish Palaniappan, #395)
* MatrixSimilarity supports indexing generator corpora (single pass)

0.12.0, 06/07/2015

* complete API, performance, memory overhaul of doc2vec (Gordon Mohr, #356, #373, #380, #384)
- fast infer_vector(); optional memory-mapped doc vectors; memory savings with int doc IDs
- 'dbow_words' for combined DBOW & word skip-gram training; new 'dm_concat' mode
- multithreading & negative-sampling optimizations (also benefitting word2vec)
- API NOTE: doc vectors must now be accessed/compared through model's 'docvecs' field
(eg: "model.docvecs['my_ID']" or "model.docvecs.most_similar('my_ID')")
- https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
* new "text summarization" module (PR #324: Federico Lopez, Federico Barrios)
- https://github.com/summanlp/docs/raw/master/articulo/articulo-en.pdf
* new matutils.argsort with partial sort
- performance speedups to all similarity queries (word2vec, Similarity classes...)
* word2vec can compute likelihood scores for classification (Mat Addy, #358)
- http://arxiv.org/abs/1504.07295
- http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-inversion.ipynb
* word2vec supports "encoding" parameter when loading from C format, for non-utf8 models
* more memory-efficient word2vec training (#385)
* fixes to Python3 compatibility (Pavel Kalaidin #330, S-Eugene #369)
* enhancements to save/load format (Liang Bo Wang #363, Gordon Mohr #356)
- pickle defaults to protocol=2 for better py3 compatibility
* fixes and improvements to wiki parsing (Lukas Elmer #357, Excellent5 #333)
* fix to phrases scoring (Ikuya Yamada, #353)
* speed up of phrases generation (Dave Challis, #349)
* changes to multipass LDA training (Christopher Corley, #298)
* various doc improvements and fixes (Matti Lyra #331, Hongjoo Lee #334)
* fixes and improvements to LDA (Christopher Corley #323)

0.11.0 = 0.11.1 = 0.11.1-1, 10/04/2015

* added "topic ranking" to sort topics by coherence in LdaModel (jtmcmc, #311)
* new fast ShardedCorpus out-of-core corpus (Jan Hajic jr., #284)
* utils.smart_open now uses the smart_open package (#316)
* new wrapper for LDA in Vowpal Wabbit (Dave Challis, #304)
* improvements to the DtmModel wrapper (Yang Han, #272, #277)
* move wrappers for external modeling programs into a submodule (Christopher Corley, #295)
* allow transparent compression of NumPy files in save/load (Christopher Corley, #248)
* save/load methods now accept file handles, in addition to file names (macks22, #292)
* fixes to LdaMulticore on Windows (Feng Mai, #305)
* lots of small fixes & py3k compatibility improvements (Chyi-Kwei Yau, Daniel Nouri, Timothy Emerick, Juarez Bochi, Christopher Corley, Chirag Nagpal, Jan Hajic jr., Flávio Codeço Coelho)
* re-released as 0.11.1 and 0.11.1-1 because of a packaging bug

0.10.3, 17/11/2014

* added streamed phrases = collocation detection (Miguel Cabrera, #258)
* added param for multiple word2vec epochs (sebastienj, #243)
* added doc2vec (=paragraph2vec = extension of word2vec) model (Timothy Emerick, #231)
* initialize word2vec deterministically, for increased experiment reproducibility (KCzar, #240)
* all indexed corpora now allow full Python slicing syntax (Christopher Corley, #246)
* update distributed code for new Pyro4 API and py3k (Michael Brooks, Marco Bonzanini, #255, #249)
* fixes to six module version (Lars Buitinck, #259)
* fixes to setup.py (Maxim Avanov and Christopher Corley, #260, #251)
* ...and lots of minor fixes & updates all around

0.10.2, 18/09/2014

* new parallelized, LdaMulticore implementation (Jan Zikes, #232)
* Dynamic Topic Models (DTM) wrapper (Arttii, #205)
* word2vec compiled from bundled C file at install time: no more pyximport (#233)
* standardize show_/print_topics in LdaMallet (Benjamin Bray, #223)
* add new word2vec multiplicative objective (3CosMul) of Levy & Goldberg (Gordon Mohr, #224)
* preserve case in MALLET wrapper (mcburton, #222)
* support for matrix-valued topic/word prior eta in LdaModel (mjwillson, #208)
* py3k fix to SparseCorpus (Andreas Madsen, #234)
* fix to LowCorpus when switching dictionaries (Christopher Corley, #237)

0.10.1, 22/07/2014

* word2vec: new n_similarity method for comparing two sets of words (François Scharffe, #219)
* make LDA print/show topics parameters consistent with LSI (Bram Vandekerckhove, #201)
* add option for efficient word2vec subsampling (Gordon Mohr, #206)
* fix length calculation for corpora on empty files (Christopher Corley, #209)
* improve file cleanup of unit tests (Christopher Corley)
* more unit tests
* unicode now stored everywhere in gensim internally; accepted input stays either utf8 or unicode
* various fixes to the py3k ported code
* allow any dict-like input in Dictionary.from_corpus (Andreas Madsen)
* error checking improvements to the MALLET wrapper
* ignore non-articles during wiki parsig
* utils.lemmatize now (optionally) ignores stopwords

0.10.0 (aka "PY3K port"), 04/06/2014

* full Python 3 support (targeting 3.3+, #196)
* all internal methods now expect & store unicode, instead of utf8
Expand Down
4 changes: 3 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,7 @@ include COPYING
include COPYING.LESSER
include ez_setup.py
include gensim/models/voidptr.h
include gensim/models/word2vec_inner.c
include gensim/models/word2vec_inner.pyx
include gensim_addons/models/word2vec_inner.pyx
include gensim/models/doc2vec_inner.c
include gensim/models/doc2vec_inner.pyx
34 changes: 23 additions & 11 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ gensim -- Topic Modelling in Python
|Downloads|_
|License|_

.. |Travis| image:: https://api.travis-ci.org/piskvorky/gensim.png?branch=develop
.. |Downloads| image:: https://pypip.in/d/gensim/badge.png
.. |License| image:: https://pypip.in/license/gensim/badge.png
.. |Travis| image:: https://img.shields.io/travis/piskvorky/gensim/develop.svg
.. |Downloads| image:: https://img.shields.io/pypi/dm/gensim.svg
.. |License| image:: https://img.shields.io/pypi/l/gensim.svg
.. _Travis: https://travis-ci.org/piskvorky/gensim
.. _Downloads: https://pypi.python.org/pypi/gensim
.. _License: http://radimrehurek.com/gensim/about.html
Expand All @@ -19,15 +19,15 @@ Target audience is the *natural language processing* (NLP) and *information retr
Features
---------

* All algorithms are **memory-independent** w.r.t. the corpus size (can process input larger than RAM),
* All algorithms are **memory-independent** w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
* **Intuitive interfaces**

* easy to plug in your own input corpus/datastream (trivial streaming API)
* easy to extend with other Vector Space algorithms (trivial transformation API)

* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis (LSA/LSI)**,
* Efficient multicore implementations of popular algorithms, such as online **Latent Semantic Analysis (LSA/LSI/SVD)**,
**Latent Dirichlet Allocation (LDA)**, **Random Projections (RP)**, **Hierarchical Dirichlet Process (HDP)** or **word2vec deep learning**.
* **Distributed computing**: can run *Latent Semantic Analysis* and *Latent Dirichlet Allocation* on a cluster of computers, and *word2vec* on multiple cores.
* **Distributed computing**: can run *Latent Semantic Analysis* and *Latent Dirichlet Allocation* on a cluster of computers.
* Extensive `HTML documentation and tutorials <http://radimrehurek.com/gensim/>`_.


Expand All @@ -45,19 +45,26 @@ It is also recommended you install a fast BLAS library before installing NumPy.

The simple way to install `gensim` is::

sudo easy_install gensim
pip install -U gensim

Or, if you have instead downloaded and unzipped the `source tar.gz <http://pypi.python.org/pypi/gensim>`_ package,
you'll need to run::
you'd run::

python setup.py test
sudo python setup.py install
python setup.py install


For alternative modes of installation (without root privileges, development
installation, optional install features), see the `documentation <http://radimrehurek.com/gensim/install.html>`_.

This version has been tested under Python 2.6, 2.7 and 3.3.
This version has been tested under Python 2.6, 2.7, 3.3 and 3.4 (support for Python 2.5 was dropped in gensim 0.10.0; install gensim 0.9.1 if you *must* use Python 2.5). Gensim's github repo is hooked to `Travis CI for automated testing <https://travis-ci.org/piskvorky/gensim>`_ on every commit push and pull request.

How come gensim is so fast and memory efficient? Isn't it pure Python, and isn't Python slow and greedy?
--------------------------------------------------------------------------------------------------------

Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).

Memory-wise, gensim makes heavy use of Python's built-in generators and iterators for streamed data processing. Memory efficiency was one of gensim's `design goals <http://radimrehurek.com/gensim/about.html>`_, and is a central feature of gensim, rather than something bolted on as an afterthought.

Documentation
-------------
Expand All @@ -69,4 +76,9 @@ It is also included in the source distribution package.
----------------

Gensim is open source software released under the `GNU LGPL license <http://www.gnu.org/licenses/lgpl.html>`_.
Copyright (c) 2009-2014 Radim Rehurek
Copyright (c) 2009-now Radim Rehurek

|Analytics|_

.. |Analytics| image:: https://ga-beacon.appspot.com/UA-24066335-5/your-repo/page-name
.. _Analytics: https://github.com/igrigorik/ga-beacon
4 changes: 0 additions & 4 deletions __init__.py

This file was deleted.

Loading

0 comments on commit 7f95027

Please sign in to comment.