Skip to content

Commit

Permalink
Merge branch 'release/v.0.6.4'
Browse files Browse the repository at this point in the history
evfro committed May 2, 2019

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
2 parents b10ec32 + 989e1b8 commit ae90f14
Showing 38 changed files with 1,968 additions and 875 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
*.pyc
polara.egg-info/
examples/.ipynb_checkpoints/
.ipynb_checkpoints/
42 changes: 20 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -2,16 +2,15 @@
Polara is the first recommendation framework that allows a deeper analysis of recommender systems performance, based on the idea of feedback polarity (by analogy with sentiment polarity in NLP).

In addition to standard question of "how good a recommender system is at recommending relevant items", it allows assessing the ability of a recommender system to **avoid irrelevant recommendations** (thus, less likely to disappoint a user). You can read more about this idea in a research paper [Fifty Shades of Ratings: How to Benefit from a Negative Feedback in Top-N Recommendations Tasks](http://arxiv.org/abs/1607.04228). The research results can be easily reproduced with this framework, visit a "fixed state" version of the code at https://github.com/Evfro/fifty-shades (there're also many usage examples).

The framework also features efficient tensor-based implementation of an algorithm, proposed in the paper, that takes full advantage of the polarity-based formulation. Currently, there is an [online demo](http://coremodel.azurewebsites.net) (for test purposes only), that demonstrates the effect of taking into account feedback polarity.
The framework also features efficient tensor-based implementation of an algorithm, proposed in the paper, that takes full advantage of the polarity-based formulation.


## Prerequisites
Current version of Polara supports both Python 2 and Python 3 environments. Future versions are likely to drop support of Python 2 to make a better use of Python 3 features.

The framework heavily depends on `Pandas, Numpy, Scipy` and `Numba` packages. Better performance can be achieved with `mkl` (optional). It's also recommended to use `jupyter notebook` for experimentation. Visualization of results can be done with help of `matplotlib` and optionally `seaborn`. The easiest way to get all those at once is to use the latest [Anaconda distribution](https://www.continuum.io/downloads).
The framework heavily depends on `Pandas, Numpy, Scipy` and `Numba` packages. Better performance can be achieved with `mkl` (optional). It's also recommended to use `jupyter notebook` for experimentation. Visualization of results can be done with help of `matplotlib`. The easiest way to get all those at once is to use the latest [Anaconda distribution](https://www.continuum.io/downloads).

If you use a separate `conda` environment for testing, the following command can be used to ensure that all required dependencies are in place (see [this](http://conda.pydata.org/docs/commands/conda-install.html) for more info):
If you use a separate `conda` environment for testing, the following command can be issued to ensure that all required dependencies are in place (see [this](http://conda.pydata.org/docs/commands/conda-install.html) for more info):

`conda install --file conda_req.txt`

@@ -98,30 +97,26 @@ random = RandomModel(data_model)
models = [i2i, svd, popular, random]

metrics = ['ranking', 'relevance'] # metrics for evaluation: NDGC, Precision, Recall, etc.
folds = [1, 2, 3, 4, 5] # use all 5 folds for cross-validation
folds = [1, 2, 3, 4, 5] # use all 5 folds for cross-validation (default)
topk_values = [1, 5, 10, 20, 50] # values of k to experiment with

# run experiment
topk_result = {}
for fold in folds:
data_model.test_fold = fold
topk_result[fold] = ee.topk_test(models, topk_list=topk_values, metrics=metrics)

# rearrange results into a more friendly representation
# this is just a dictionary of Pandas Dataframes
result = ee.consolidate_folds(topk_result, folds, metrics)
result.keys() # outputs ['ranking', 'relevance']
# run 5-fold CV experiment
result = ee.run_cv_experiment(models, folds, metrics,
fold_experiment=ee.topk_test,
topk_list=topk_values)

# calculate average values across all folds for e.g. relevance metrics
result['relevance'].mean(axis=0).unstack() # use .std instead of .mean for standard deviation
scores = result.mean(axis=0, level=['top-n', 'model']) # use .std instead of .mean for standard deviation
scores.xs('recall', level='metric', axis=1).unstack('model')
```
which results in something like:

| metric/model |item-to-item | SVD | mostpopular | random |
| **model** | **MP** | **PureSVD** | **RND** | **item-to-item** |
| ---: |:---:|:---:|:---:|:---:|
| *precision* | 0.348212 | 0.600066 | 0.411126 | 0.016159 |
| *recall* | 0.147969 | 0.304338 | 0.182472 | 0.005486 |
| *miss_rate* | 0.852031 | 0.695662 | 0.817528 | 0.994514 |
| **top-n** |
| **1** | 0.017828 | 0.079428 | 0.000055 | 0.024673 |
| **5** | 0.086604 | 0.219408 | 0.001104 | 0.126013 |
| **10** | 0.138546 | 0.300658 | 0.001987 | 0.202134 |
| ... | ... | ... | ... | ... |

## Custom pipelines
@@ -137,7 +132,7 @@ Now you are ready to build your models (as in examples above) and export them to

### Warm-start and known-user scenarios
By default polara makes testset and trainset disjoint by users, which allows to evaluate models against *user warm-start*.
However in some situations (for example, when polara is used within a larger pipeline) you might want to implement strictly a *known user* scenario to assess the quality of your recommender system on the unseen (held-out) items for the known users. The change between these two scenarios as controlled by setting `data_model.warm_start` attribute to `True` or `False`. See [Warm-start and standard scenarios](examples/Warm-start and standard scenarios.ipynb) Jupyter notebook as an example.
However in some situations (for example, when polara is used within a larger pipeline) you might want to implement strictly a *known user* scenario to assess the quality of your recommender system on the unseen (held-out) items for the known users. The change between these two scenarios as controlled by setting `data_model.warm_start` attribute to `True` or `False`. See [Warm-start and standard scenarios](examples/Warm_start_and_standard_scenarios.ipynb) Jupyter notebook as an example.

### Externally provided test data
If you don't want polara to perform data splitting (for example, when your test data is already provided), you can use the `set_test_data` method of a `RecommenderData` instance. It has a number of input arguments that cover all major cases of externally provided data. For example, assuming that you have new users' preferences encoded in the `unseen_data` dataframe and the corresponding held-out preferences in the `holdout` dataframe, the following command allows to include them into the data model:
@@ -150,4 +145,7 @@ svd.build()
svd.evaluate()
```
In this case the recommendations are generated based on the testset and evaluated against the holdout.
See more usage examples in the [Custom evaluation](examples/Custom evaluation.ipynb) notebook.
See more usage examples in the [Custom evaluation](examples/Custom_evaluation.ipynb) notebook.

### Reproducing others work
Polara offers even more options to highly customize experimentation pipeline and tailor it to specific needs. See, for example, [Reproducing EIGENREC results](examples/Reproducing_EIGENREC_results.ipynb) notebook to learn how Polara can be used to reproduce experiments from the *"[EIGENREC: generalizing PureSVD for effective and efficient top-N recommendations](https://arxiv.org/abs/1511.06033)"* paper.
2 changes: 1 addition & 1 deletion conda_req.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>

python>=3.6
jupyter>=1.0.0
numba>=0.21.0
numpy>=1.10.1
matplotlib>=1.4.3
pandas>=0.17.1
requests>=2.7.0
scipy>=0.16.0
seaborn>=0.6.0
Original file line number Diff line number Diff line change
@@ -45,7 +45,6 @@
"metadata": {},
"outputs": [],
"source": [
"from __future__ import print_function\n",
"import numpy as np\n",
"from polara.datasets.movielens import get_movielens_data"
]
@@ -178,7 +177,8 @@
"output_type": "stream",
"text": [
"Preparing data...\n",
"Done.\n"
"Done.\n",
"There are 766928 events in the training and 0 events in the holdout.\n"
]
}
],
@@ -466,7 +466,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"PureSVD training time: 0.0981479023320091s\n"
"PureSVD training time: 0.128s\n"
]
}
],
@@ -563,7 +563,10 @@
{
"data": {
"text/plain": [
"Hits(true_positive=4443, false_positive=1131, true_negative=15870, false_negative=19081)"
"[Relevance(precision=0.4671998676068703, recall=0.18790260795025798, fallout=0.0587545652398142, specifity=0.7075255418074651, miss_rate=0.7362722359391265),\n",
" Ranking(nDCG=0.16791941496631102, nDCL=0.07078245692187013),\n",
" Experience(coverage=0.14883215643671918),\n",
" Hits(true_positive=4443, false_positive=1131, true_negative=15870, false_negative=19081)]"
]
},
"execution_count": 19,
@@ -770,47 +773,47 @@
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>new</th>\n",
" <th>old</th>\n",
" <th>new</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>775</th>\n",
" <td>775</td>\n",
" <td>776</td>\n",
" <td>775</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1965</th>\n",
" <td>1965</td>\n",
" <td>1966</td>\n",
" <td>1965</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4137</th>\n",
" <td>4137</td>\n",
" <td>4138</td>\n",
" <td>4137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4422</th>\n",
" <td>4422</td>\n",
" <td>4423</td>\n",
" <td>4422</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4746</th>\n",
" <td>4746</td>\n",
" <td>4747</td>\n",
" <td>4746</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" new old\n",
"775 775 776\n",
"1965 1965 1966\n",
"4137 4137 4138\n",
"4422 4422 4423\n",
"4746 4746 4747"
" old new\n",
"775 776 775\n",
"1965 1966 1965\n",
"4137 4138 4137\n",
"4422 4423 4422\n",
"4746 4747 4746"
]
},
"execution_count": 27,
@@ -1102,47 +1105,47 @@
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>new</th>\n",
" <th>old</th>\n",
" <th>new</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>4833</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>4834</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>4835</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>4836</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>4837</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" new old\n",
"0 0 4833\n",
"1 1 4834\n",
"2 2 4835\n",
"3 3 4836\n",
"4 4 4837"
" old new\n",
"0 4833 0\n",
"1 4834 1\n",
"2 4835 2\n",
"3 4836 3\n",
"4 4837 4"
]
},
"execution_count": 35,
@@ -1180,47 +1183,47 @@
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>new</th>\n",
" <th>old</th>\n",
" <th>new</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" new old\n",
"0 0 1\n",
"1 1 2\n",
"2 2 3\n",
"3 3 4\n",
"4 4 5"
" old new\n",
"0 1 0\n",
"1 2 1\n",
"2 3 2\n",
"3 4 3\n",
"4 5 4"
]
},
"execution_count": 36,
@@ -1603,7 +1606,10 @@
{
"data": {
"text/plain": [
"Hits(true_positive=1063, false_positive=245, true_negative=3505, false_negative=4666)"
"[Relevance(precision=0.48771352650892064, recall=0.1962177724147278, fallout=0.05871125477406844, specifity=0.6808813050133541, miss_rate=0.741780456106087),\n",
" Ranking(nDCG=0.17149113058615703, nDCL=0.06967069219097612),\n",
" Experience(coverage=0.10999456816947312),\n",
" Hits(true_positive=1063, false_positive=245, true_negative=3505, false_negative=4666)]"
]
},
"execution_count": 41,
@@ -1655,6 +1661,13 @@
"source": [
"svd.evaluate('relevance')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
637 changes: 292 additions & 345 deletions examples/Example_ML1M.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -151,15 +151,15 @@
{
"data": {
"text/plain": [
"{'holdout_size': 3,\n",
" 'negative_prediction': False,\n",
" 'permute_tops': False,\n",
" 'random_holdout': False,\n",
"{'permute_tops': False,\n",
" 'warm_start': True,\n",
" 'holdout_size': 3,\n",
" 'test_sample': None,\n",
" 'shuffle_data': False,\n",
" 'random_holdout': False,\n",
" 'negative_prediction': False,\n",
" 'test_fold': 5,\n",
" 'test_ratio': 0.2,\n",
" 'test_sample': None,\n",
" 'warm_start': True}"
" 'test_ratio': 0.2}"
]
},
"execution_count": 4,
@@ -189,7 +189,8 @@
"output_type": "stream",
"text": [
"Preparing data...\n",
"Done.\n"
"Done.\n",
"There are 996585 events in the training and 3624 events in the holdout.\n"
]
}
],
@@ -251,13 +252,16 @@
"name": "stdout",
"output_type": "stream",
"text": [
"PureSVD training time: 0.12987111022293263s\n"
"PureSVD training time: 0.149s\n"
]
},
{
"data": {
"text/plain": [
"Hits(true_positive=512, false_positive=112, true_negative=1164, false_negative=1836)"
"[Relevance(precision=0.34864790286975716, recall=0.2008830022075055, fallout=0.05601545253863134, specifity=0.6227924944812362, miss_rate=0.7287527593818984),\n",
" Ranking(nDCG=0.1426077960282924, nDCL=0.04915993850533),\n",
" Experience(coverage=0.12169454937938479),\n",
" Hits(true_positive=512, false_positive=112, true_negative=1164, false_negative=1836)]"
]
},
"execution_count": 8,
@@ -294,17 +298,27 @@
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:root:Intel MKL BLAS detected. Its highly recommend to set the environment variable 'export MKL_NUM_THREADS=1' to disable its internal multithreading\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"iALS training time: 1.5761851485s\n"
"iALS training time: 1.837s\n"
]
},
{
"data": {
"text/plain": [
"Hits(true_positive=514, false_positive=116, true_negative=1160, false_negative=1834)"
"[Relevance(precision=0.34864790286975716, recall=0.2015728476821192, fallout=0.06084437086092715, specifity=0.6179635761589404, miss_rate=0.7280629139072847),\n",
" Ranking(nDCG=0.14200497688853128, nDCL=0.055115784933481085),\n",
" Experience(coverage=0.14813815434430652),\n",
" Hits(true_positive=514, false_positive=118, true_negative=1158, false_negative=1834)]"
]
},
"execution_count": 10,
@@ -388,7 +402,7 @@
{
"data": {
"text/plain": [
"Relevance(precision=0.3536147902869757, recall=0.20516004415011035, fallout=0.06070640176600441, specifity=0.6181015452538631, miss_rate=0.7244757174392936)"
"Relevance(precision=0.34864790286975716, recall=0.2015728476821192, fallout=0.06084437086092715, specifity=0.6179635761589404, miss_rate=0.7280629139072847)"
]
},
"execution_count": 13,
@@ -426,9 +440,10 @@
"Preparing data...\n",
"19 unique movieid's within 26 testset interactions were filtered. Reason: not in the training data.\n",
"1 unique movieid's within 1 holdout interactions were filtered. Reason: not in the training data.\n",
"1 of 1208 userid's were filtered out from holdout. Reason: not enough items.\n",
"1 of 1208 userid's were filtered out from holdout. Reason: incompatible number of items.\n",
"1 userid's were filtered out from testset. Reason: inconsistent with holdout.\n",
"Done.\n"
"Done.\n",
"There are 807458 events in the training and 3621 events in the holdout.\n"
]
}
],
@@ -489,13 +504,16 @@
"name": "stdout",
"output_type": "stream",
"text": [
"PureSVD training time: 0.10033848882716256s\n"
"PureSVD training time: 0.107s\n"
]
},
{
"data": {
"text/plain": [
"Hits(true_positive=515, false_positive=111, true_negative=1164, false_negative=1831)"
"[Relevance(precision=0.34907484120408727, recall=0.2020160176746755, fallout=0.05661419497376415, specifity=0.6219276442971554, miss_rate=0.7275614471140569),\n",
" Ranking(nDCG=0.1425979962160517, nDCL=0.04954089721828449),\n",
" Experience(coverage=0.12042310821806347),\n",
" Hits(true_positive=515, false_positive=111, true_negative=1164, false_negative=1831)]"
]
},
"execution_count": 16,
@@ -528,18 +546,28 @@
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:root:Intel MKL BLAS detected. Its highly recommend to set the environment variable 'export MKL_NUM_THREADS=1' to disable its internal multithreading\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"iALS model is not ready. Rebuilding.\n",
"iALS training time: 1.25310956069s\n"
"iALS training time: 1.615s\n"
]
},
{
"data": {
"text/plain": [
"Hits(true_positive=509, false_positive=117, true_negative=1158, false_negative=1837)"
"[Relevance(precision=0.35183650925158794, recall=0.19994476663904998, fallout=0.05979011322838994, specifity=0.6187517260425297, miss_rate=0.7296326981496823),\n",
" Ranking(nDCG=0.1389285485663426, nDCL=0.05443458388828408),\n",
" Experience(coverage=0.14591809058855437),\n",
" Hits(true_positive=510, false_positive=117, true_negative=1158, false_negative=1836)]"
]
},
"execution_count": 17,
@@ -593,7 +621,7 @@
{
"data": {
"text/plain": [
"Relevance(precision=0.34907484120408727, recall=0.2020160176746755, fallout=0.056614194973764152, specifity=0.62192764429715541, miss_rate=0.7275614471140569)"
"Relevance(precision=0.34907484120408727, recall=0.2020160176746755, fallout=0.05661419497376415, specifity=0.6219276442971554, miss_rate=0.7275614471140569)"
]
},
"execution_count": 19,
@@ -613,7 +641,7 @@
{
"data": {
"text/plain": [
"Relevance(precision=0.34727975697321178, recall=0.19884009942004971, fallout=0.059375863021264838, specifity=0.61916597624965475, miss_rate=0.73073736536868272)"
"Relevance(precision=0.35183650925158794, recall=0.19994476663904998, fallout=0.05979011322838994, specifity=0.6187517260425297, miss_rate=0.7296326981496823)"
]
},
"execution_count": 20,
16 changes: 10 additions & 6 deletions polara/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
# import standard baseline models
from polara.recommender.models import RecommenderModel
from polara.recommender.models import SVDModel
from polara.recommender.models import CooccurrenceModel
from polara.recommender.models import RandomModel
from polara.recommender.models import PopularityModel
from polara.recommender.models import (RecommenderModel,
SVDModel,
ScaledSVD,
CooccurrenceModel,
RandomModel,
PopularityModel)

# import data model
from polara.recommender.data import RecommenderData

# import data management routines
from polara.datasets.movielens import get_movielens_data
from polara.datasets.bookcrossing import get_bx_data
from polara.datasets.bookcrossing import get_bookcrossing_data
from polara.datasets.netflix import get_netflix_data
from polara.datasets.amazon import get_amazon_data
25 changes: 25 additions & 0 deletions polara/datasets/amazon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from ast import literal_eval
import gzip
import pandas as pd


def parse_meta(path):
with gzip.open(path, 'rt') as gz:
for line in gz:
yield literal_eval(line)


def get_amazon_data(path=None, meta_path=None, nrows=None):
res = []
if path:
data = pd.read_csv(path, header=None,
names=['userid', 'asin', 'rating', 'timestamp'],
usecols=['userid', 'asin', 'rating'],
nrows=nrows)
res.append(data)
if meta_path:
meta = pd.DataFrame.from_records(parse_meta(meta_path), nrows=nrows)
res.append(meta)
if len(res) == 1:
res = res[0]
return res
2 changes: 1 addition & 1 deletion polara/datasets/bookcrossing.py
Original file line number Diff line number Diff line change
@@ -7,7 +7,7 @@
from zipfile import ZipFile


def get_bx_data(local_file=None, get_ratings=True, get_users=False, get_books=False):
def get_bookcrossing_data(local_file=None, get_ratings=True, get_users=False, get_books=False):
if not local_file:
# downloading data
from requests import get
51 changes: 51 additions & 0 deletions polara/datasets/epinions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import numpy as np
import scipy as sp
import pandas as pd


def compute_graph_laplacian(edges, index):
all_edges = set()
for a, b in edges:
try:
a = index.get_loc(a)
b = index.get_loc(b)
except KeyError:
continue
if a == b: # exclude self links
continue
# make graph undirectional
all_edges.add((a, b))
all_edges.add((b, a))

sp_edges = sp.sparse.csr_matrix((np.ones(len(all_edges)), zip(*all_edges)))
assert (sp_edges.diagonal() == 0).all()
return sp.sparse.csgraph.laplacian(sp_edges).tocsr(), sp_edges


def get_epinions_data(ratings_path=None, trust_data_path=None):
res = []
if ratings_path:
ratings = pd.read_csv(ratings_path,
delim_whitespace=True,
skiprows=[0],
skipfooter=1,
engine='python',
header=None,
skipinitialspace=True,
names=['user', 'film', 'rating'],
usecols=['user', 'film', 'rating'])
res.append(ratings)

if trust_data_path:
edges = pd.read_table(trust_data_path,
delim_whitespace=True,
skiprows=[0],
skipfooter=1,
engine='python',
header=None,
skipinitialspace=True,
usecols=[0, 1])
res.append(edges)

if len(res)==1: res = res[0]
return res
12 changes: 8 additions & 4 deletions polara/datasets/movielens.py
Original file line number Diff line number Diff line change
@@ -81,10 +81,14 @@ def get_movielens_data(local_file=None, get_ratings=True, get_genres=False,


def get_split_genres(genres_data):
genres_data.index.name = 'movie_idx'
genres_stacked = genres_data.genres.str.split('|', expand=True).stack().to_frame('genreid')
ml_genres = genres_data[['movieid', 'movienm']].join(genres_stacked).reset_index(drop=True)
return ml_genres
return (genres_data[['movieid', 'movienm']]
.join(pd.DataFrame([(i, x)
for i, g in enumerate(genres_data['genres'])
for x in g.split('|')
], columns=['index', 'genreid']
).set_index('index'))
.reset_index(drop=True))



def filter_short_head(data, threshold=0.01):
53 changes: 38 additions & 15 deletions polara/datasets/netflix.py
Original file line number Diff line number Diff line change
@@ -2,22 +2,45 @@
import tarfile


def get_netflix_data(gz_file):
def get_netflix_data(gz_file, get_ratings=True, get_probe=False):
movie_data = []
movie_name = []
movie_inds = []
with tarfile.open(gz_file) as tar:
training_data = tar.getmember('download/training_set.tar')
with tarfile.open(fileobj=tar.extractfile(training_data)) as inner:
for item in inner.getmembers():
if item.isfile():
f = inner.extractfile(item.name)
df = pd.read_csv(f)
movieid = df.columns[0]
movie_name.append(movieid)
movie_data.append(df[movieid])
if get_ratings:
training_data = tar.getmember('download/training_set.tar')
# maybe try with threads, e.g.
# https://stackoverflow.com/questions/43727520/speed-up-json-to-dataframe-w-a-lot-of-data-manipulation
with tarfile.open(fileobj=tar.extractfile(training_data)) as inner:
for item in inner.getmembers():
if item.isfile():
f = inner.extractfile(item.name)
df = pd.read_csv(f)
movieid = df.columns[0]
movie_inds.append(int(movieid[:-1]))
movie_data.append(df[movieid])

data = pd.concat(movie_data, keys=movie_name)
data = data.reset_index().iloc[:, :3].rename(columns={'level_0': 'movieid',
'level_1': 'userid',
'level_2': 'rating'})
if get_probe:
probe_data = tar.getmember('download/probe.txt')
probe_file = tar.extractfile(probe_data)
probe = []
for line in probe_file:
line = line.strip()
if line.endswith(b':'):
movieid = int(line[:-1])
else:
userid = int(line)
probe.append((movieid, userid))

data = None
if movie_data:
data = pd.concat(movie_data, keys=movie_inds)
data = data.reset_index().iloc[:, :3].rename(columns={'level_0': 'movieid',
'level_1': 'userid',
'level_2': 'rating'})
if get_probe:
probe = pd.DataFrame.from_records(probe, columns=['movieid', 'userid'])
if data is not None:
data = (data, probe)
else:
data = probe
return data
113 changes: 56 additions & 57 deletions polara/evaluation/evaluation_engine.py
Original file line number Diff line number Diff line change
@@ -12,6 +12,9 @@
def sample_ci(df, coef=2.776, level=None): # 95% CI for sample under Student's t-test
# http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
# example from http://onlinestatbook.com/2/estimation/mean.html
if isinstance(level, str):
level = df.index.names.index(level)

nlevels = df.index.nlevels
if (nlevels == 1) & (level is None):
n = df.shape[0]
@@ -46,16 +49,22 @@ def average_results(scores):
return averaged, errors


def evaluate_models(models, metrics, topk=None):
metric_scores = []
for metric in metrics:
model_scores = []
for model in models:
# print('model {}'.format(model.method))
scores = model.evaluate(method=metric, topk=topk)
model_scores.append(scores)
metric_scores.append(pd.DataFrame(model_scores, index=[model.method for model in models]).T)
return metric_scores
def evaluate_models(models, metrics, **kwargs):
scores = []
for model in models:
model_scores = model.evaluate(metric_type=metrics, **kwargs)
# ensure correct format
model_scores = model_scores if isinstance(model_scores, list) else [model_scores]
# concatenate all scores
name = [model.method]
metric_types = [s.__class__.__name__.lower() for s in model_scores]
scores_df = pd.concat([pd.DataFrame([s], index=name) for s in model_scores],
keys=metric_types, axis=1)
scores.append(scores_df)
res = pd.concat(scores, axis=0)
res.columns.names = ['type', 'metric']
res.index.names = ['model']
return res


def set_topk(models, topk):
@@ -69,69 +78,59 @@ def build_models(models, force=True):
model.build()


def consolidate(scores, params, metrics):
res = {}
for i, metric in enumerate(metrics):
res[metric] = pd.concat([scores[j][i] for j in range(len(params))],
keys=params).unstack().swaplevel(0, 1, 1).sort_index()
return res


def consolidate_folds(scores, folds, metrics, index_names=['fold', 'top-n']):
res = {}
for metric in metrics:
data = pd.concat([scores[j][metric] for j in folds], keys=folds)
data.index.names = index_names
res[metric] = data
return res


def holdout_test_pair(model1, model2, holdout_sizes=[1], metrics=['hits']):
holdout_scores = []
models = [model1, model2]

data1 = model1.data
data2 = model2.data
for i in holdout_sizes:
print(i, end=' ')
data1.holdout_size = i
data1.update()
data2.holdout_size = i
data2.update()

metric_scores = evaluate_models(models, metrics)
holdout_scores.append(metric_scores)

return consolidate(holdout_scores, holdout_sizes, metrics)
def consolidate(scores, level_name=None, level_keys=None):
level_names = [level_name] + scores[0].index.names
return pd.concat(scores, axis=0, keys=level_keys, names=level_names)


def holdout_test(models, holdout_sizes=[1], metrics=['hits'], force_build=True):
def holdout_test(models, holdout_sizes=[1], metrics='all'):
holdout_scores = []
data = models[0].data
assert all([model.data is data for model in models[1:]]) #check that data is shared across models

build_models(models, force_build)
for i in holdout_sizes:
data.holdout_size = i
data.update()

metric_scores = evaluate_models(models, metrics)
holdout_scores.append(metric_scores)

return consolidate(holdout_scores, holdout_sizes, metrics)
return consolidate(holdout_scores, level_name='hsize', level_keys=holdout_sizes)


def topk_test(models, topk_list=[10], metrics=['hits'], force_build=True):
def topk_test(models, **kwargs):
metrics = kwargs.pop('metrics', None) or 'all'
topk_list = kwargs.pop('topk_list', [10])
topk_scores = []
data = models[0].data
assert all([model.data is data for model in models[1:]]) #check that data is shared across models
assert all([model.data is data for model in models[1:]]) # check that data is shared across models

data.update()
topk_list = list(reversed(sorted(topk_list))) #start from max topk and rollback
topk_list_sorted = list(reversed(sorted(topk_list))) # start from max topk and rollback

build_models(models, force_build)
for topk in topk_list:
metric_scores = evaluate_models(models, metrics, topk)
for topk in topk_list_sorted:
kwargs['topk'] = topk
metric_scores = evaluate_models(models, metrics, **kwargs)
topk_scores.append(metric_scores)

return consolidate(topk_scores, topk_list, metrics)
level_name = 'top-n'
res = consolidate(topk_scores, level_name=level_name, level_keys=topk_list_sorted)
return res.sort_index(level=level_name, sort_remaining=False)


def run_cv_experiment(models, folds=None, metrics='all', fold_experiment=evaluate_models,
force_build=True, iterator=lambda x: x, **kwargs):
if not isinstance(models, (list, tuple)):
models = [models]

data = models[0].data
assert all([model.data is data for model in models[1:]]) # check that data is shared across models

if folds is None:
folds = range(1, int(1/data.test_ratio) + 1)

fold_results = []
for fold in iterator(folds):
data.test_fold = fold
data.update()
build_models(models, force_build)
fold_result = fold_experiment(models, metrics=metrics, **kwargs)
fold_results.append(fold_result)
return consolidate(fold_results, level_name='fold', level_keys=folds)
168 changes: 158 additions & 10 deletions polara/evaluation/pipelines.py
Original file line number Diff line number Diff line change
@@ -2,8 +2,8 @@

from operator import mul as mul_op
from functools import reduce
from itertools import product
from random import choice
import pandas as pd


def random_chooser():
@@ -12,28 +12,176 @@ def random_chooser():
yield choice(values)


def random_grid(params, n=60, grid_cache=None):
def random_grid(params, n=60, grid_cache=None, skip_config=None):
if not isinstance(n, int):
raise TypeError('n must be an integer, not {}'.format(type(n)))
if n < 0:
raise ValueError('n should be >= 0')

grid = grid_cache or set()
max_n = reduce(mul_op, [len(vals) for vals in params.values()])
# fix names and order of parameters
param_names, param_values = zip(*params.items())
grid = set(grid_cache) if grid_cache is not None else set()
max_n = reduce(mul_op, [len(vals) for vals in param_values])
n = min(n if n > 0 else max_n, max_n)

skipped = set()
if skip_config is None:
def never_skip(config): return False
skip_config = never_skip

param_chooser = random_chooser()
try:
while len(grid) < n:
while len(grid) < (n-len(skipped)):
level_choice = []
for v in params.values():
for param_val in param_values:
next(param_chooser)
level_choice.append(param_chooser.send(v))
grid.add(tuple(level_choice))
level_choice.append(param_chooser.send(param_val))
level_choice = tuple(level_choice)
if skip_config(level_choice):
skipped.add(level_choice)
continue
grid.add(level_choice)
except KeyboardInterrupt:
print('Interrupted by user. Providing current results.')
return grid
return grid, param_names


def set_config(model, attributes, values):
for name, value in zip(attributes, values):
setattr(model, name, value)


def evaluate_models(models, target_metric='precision', metric_type='all', **kwargs):
if not isinstance(models, (list, tuple)):
models = [models]

model_scores = {}
for model in models:
scores = model.evaluate(metric_type, **kwargs)
scores = [scores] if not isinstance(scores, list) else scores
scores_df = pd.concat([pd.DataFrame([s]) for s in scores], axis=1)
if isinstance(target_metric, str):
model_scores[model.method] = scores_df[target_metric].squeeze()
elif callable(target_metric):
model_scores[model.method] = scores_df.apply(target_metric, axis=1).squeeze()
else:
raise NotImplementedError
return model_scores


def find_optimal_svd_rank(model, ranks, target_metric, return_scores=False,
protect_factors=True, config=None, verbose=False,
evaluator=None, iterator=lambda x: x, **kwargs):
evaluator = evaluator or evaluate_models
model_verbose = model.verbose
if config:
set_config(model, *zip(*config.items()))

model.rank = svd_rank = max(max(ranks), model.rank)
if not model._is_ready:
model.verbose = verbose
model.build()

if protect_factors:
svd_factors = dict(**model.factors) # avoid accidental overwrites

res = {}
try:
for rank in iterator(list(reversed(sorted(ranks)))):
model.rank = rank
res[rank] = evaluator(model, target_metric, **kwargs)[model.method]
# prevent previous scores caching when assigning svd_rank
model._recommendations = None
finally:
if protect_factors:
model._rank = svd_rank
model.factors = svd_factors
model.verbose = model_verbose

scores = pd.Series(res)
best_rank = scores.idxmax()
if return_scores:
scores.index.name = 'rank'
scores.name = model.method
return best_rank, scores.loc[ranks]
return best_rank


def find_optimal_tucker_ranks(model, tucker_ranks, target_metric, return_scores=False,
config=None, verbose=False, same_space=False,
evaluator=None, iterator=lambda x: x, **kwargs):
evaluator = evaluator or evaluate_models
model_verbose = model.verbose
if config:
set_config(model, *zip(*config.items()))

model.mlrank = tuple([max(mode_ranks) for mode_ranks in tucker_ranks])

if not model._is_ready:
model.verbose = verbose
model.build()

factors = dict(**model.factors)
tucker_rank = model.mlrank

res_score = {}
for r1 in iterator(tucker_ranks[0]):
for r2 in tucker_ranks[1]:
if same_space and (r2 != r1):
continue
for r3 in tucker_ranks[2]:
if (r1*r2 < r3) or (r1*r3 < r2) or (r2*r3 < r1):
continue
try:
model.mlrank = mlrank = (r1, r2, r3)
res_score[mlrank] = evaluator(model, target_metric, **kwargs)[model.method]
# prevent previous scores caching when assigning tucker_rank
model._recommendations = None
finally:
model._mlrank = tucker_rank
model.factors = dict(**factors)
model.verbose = model_verbose

scores = pd.Series(res_score).sort_index()
best_mlrank = scores.idxmax()
if return_scores:
scores.index.names = ['r1', 'r2', 'r3']
scores.name = model.method
return best_mlrank, scores
return best_mlrank


def find_optimal_config(model, param_grid, param_names, target_metric, return_scores=False,
init_config=None, reset_config=None, verbose=False, force_build=True,
evaluator=None, iterator=lambda x: x, **kwargs):
evaluator = evaluator or evaluate_models
model_verbose = model.verbose
if init_config:
set_config(model, *zip(*init_config.items()))

model.verbose = verbose
grid_results = {}
for params in iterator(param_grid):
try:
set_config(model, param_names, params)
if not model._is_ready or force_build:
model.build()
grid_results[params] = evaluator(model, target_metric, **kwargs)[model.method]
finally:
if reset_config is not None:
if isinstance(reset_config, dict):
set_config(model, *zip(*reset_config.items()))
elif callable(reset_config):
reset_config(model)
else:
raise NotImplementedError

model.verbose = model_verbose
# workaround non-orderable configs (otherwise pandas raises error)
scores = pd.Series(**dict(zip(('index', 'data'),
(zip(*grid_results.items())))))
best_config = scores.idxmax()
if return_scores:
scores.index.names = param_names
scores.name = model.method
return best_config, scores
return best_config
20 changes: 15 additions & 5 deletions polara/evaluation/plotting.py
Original file line number Diff line number Diff line change
@@ -8,6 +8,9 @@ def _plot_pair(scores, keys, titles=None, errors=None, err_alpha=0.2, figsize=(1
else:
show_legend = False

if 'model' in scores.index.names:
scores = scores.unstack('model')

left, right = keys
left_title, right_title = titles or keys

@@ -18,6 +21,8 @@ def _plot_pair(scores, keys, titles=None, errors=None, err_alpha=0.2, figsize=(1
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

if errors is not None:
if 'model' in errors.index.names:
errors = errors.unstack('model')
errG = errors[left]
errL = errors[right]
for method in errL.columns:
@@ -39,15 +44,15 @@ def _plot_pair(scores, keys, titles=None, errors=None, err_alpha=0.2, figsize=(1


def show_hits(all_scores, **kwargs):
scores = all_scores['hits']
scores = all_scores['hits'] if 'hits' in all_scores else all_scores
keys = ['true_positive', 'false_positive']
kwargs['titles'] = ['True Positive Hits @$n$', 'False Positive Hits @$n$']
kwargs['errors'] = kwargs['errors']['hits'] if kwargs.get('errors', None) is not None else None
_plot_pair(scores, keys, **kwargs)


def show_ranking(all_scores, **kwargs):
scores = all_scores['ranking']
scores = all_scores['ranking'] if 'ranking' in all_scores else all_scores
keys = ['nDCG', 'nDCL']
kwargs['titles'] = ['nDCG@$n$', 'nDCL@$n$']
kwargs['errors'] = kwargs['errors']['ranking'] if kwargs.get('errors', None) is not None else None
@@ -62,6 +67,9 @@ def _cross_plot(scores, keys, titles=None, errors=None, err_alpha=0.2, ROC_middl
else:
show_legend = False

if 'model' in scores.index.names:
scores = scores.unstack('model')

methods = scores.columns.levels[1]
x, y = keys
for method in methods:
@@ -72,6 +80,8 @@ def _cross_plot(scores, keys, titles=None, errors=None, err_alpha=0.2, ROC_middl
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

if errors is not None:
if 'model' in errors.index.names:
errors = errors.unstack('model')
for method in methods:
plot_data = scores.xs(method, 1, 1).sort_values(x)
error = errors.xs(method, 1, 1).sort_values(x)
@@ -97,7 +107,7 @@ def _cross_plot(scores, keys, titles=None, errors=None, err_alpha=0.2, ROC_middl


def show_hit_rates(all_scores, **kwargs):
scores = all_scores['relevance']
scores = all_scores['relevance'] if 'relevance' in all_scores else all_scores
keys = ['fallout', 'recall']
kwargs['titles'] = ['False Positive Rate', 'True Positive Rate']
kwargs['errors'] = kwargs['errors']['relevance'] if kwargs.get('errors', None) is not None else None
@@ -107,7 +117,7 @@ def show_hit_rates(all_scores, **kwargs):


def show_ranking_positivity(all_scores, **kwargs):
scores = all_scores['ranking']
scores = all_scores['ranking'] if 'ranking' in all_scores else all_scores
keys = ['nDCL', 'nDCG']
kwargs['titles'] = ['Negative Ranking', 'Positive Ranking']
kwargs['errors'] = kwargs['errors']['ranking'] if kwargs.get('errors', None) is not None else None
@@ -117,7 +127,7 @@ def show_ranking_positivity(all_scores, **kwargs):


def show_precision_recall(all_scores, limit=False, ignore_field_limit=None, **kwargs):
scores = all_scores['relevance']
scores = all_scores['relevance'] if 'relevance' in all_scores else all_scores
keys = ['recall', 'precision']
kwargs['titles'] = ['Recall', 'Precision']
kwargs['errors'] = kwargs['errors']['relevance'] if kwargs.get('errors', None) is not None else None
264 changes: 256 additions & 8 deletions polara/lib/optimize.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
from math import sqrt
import numpy as np
from numba import njit
from numba import jit, njit, prange

from polara.tools.timing import track_time


@njit(nogil=True)
def sgd_step(users_idx, items_idx, feedbacks, P, Q, eta, lambd):
def mf_sgd_sweep(users_idx, items_idx, feedbacks, P, Q, eta, lambd, *args,
adjust_gradient, adjustment_params):
cum_error = 0
for k, a in enumerate(feedbacks):
i = users_idx[k]
@@ -11,19 +16,25 @@ def sgd_step(users_idx, items_idx, feedbacks, P, Q, eta, lambd):
pi = P[i, :]
qj = Q[j, :]

e = a - np.dot(pi, qj)
err = a - pi @ qj

new_pi = pi + eta * (e*qj - lambd*pi)
new_qj = qj + eta * (e*pi - lambd*qj)
ngrad_p = err*qj - lambd*pi
adjusted_ngrad_p = adjust_gradient(ngrad_p, i, *adjustment_params[0])
new_pi = pi + eta * adjusted_ngrad_p

ngrad_q = err*pi - lambd*qj
adjusted_ngrad_q = adjust_gradient(ngrad_q, j, *adjustment_params[1])
new_qj = qj + eta * adjusted_ngrad_q

P[i, :] = new_pi
Q[j, :] = new_qj

cum_error += e*e
cum_error += err*err
return cum_error

@njit(nogil=True)
def sgd_step_biased(users_idx, items_idx, feedbacks, P, Q, b_user, b_item, mu, eta, lambd):
def mf_sgd_sweep_biased(users_idx, items_idx, feedbacks, P, Q, eta, lambd,
b_user, b_item, mu, *args):
cum_error = 0
for k, a in enumerate(feedbacks):
i = users_idx[k]
@@ -34,7 +45,7 @@ def sgd_step_biased(users_idx, items_idx, feedbacks, P, Q, b_user, b_item, mu, e
bi = b_user[i]
bj = b_item[j]

e = a - (np.dot(pi, qj) + bi + bj + mu)
e = a - (pi @ qj + bi + bj + mu)

new_pi = pi + eta * (e*qj - lambd*pi)
new_qj = qj + eta * (e*pi - lambd*qj)
@@ -50,3 +61,240 @@ def sgd_step_biased(users_idx, items_idx, feedbacks, P, Q, b_user, b_item, mu, e

cum_error += e*e
return cum_error



@njit(nogil=True)
def identity(x, *args): # used to fall back to standard SGD
return x


@njit(nogil=True)
def adagrad(grad, m, cum_sq_grad, smoothing=1e-6):
cum_sq_grad_update = cum_sq_grad[m, :] + grad * grad
cum_sq_grad[m, :] = cum_sq_grad_update
adjusted_grad = grad / (smoothing + np.sqrt(cum_sq_grad_update))
return adjusted_grad


@njit(nogil=True)
def rmsprop(grad, m, cum_sq_grad, gamma=0.9, smoothing=1e-6):
cum_sq_grad_update = gamma * cum_sq_grad[m, :] + (1 - gamma) * (grad * grad)
cum_sq_grad[m, :] = cum_sq_grad_update
adjusted_grad = grad / (smoothing + np.sqrt(cum_sq_grad_update))
return adjusted_grad


@njit(nogil=True)
def adam(grad, m, cum_grad, cum_sq_grad, step, beta1=0.9, beta2=0.999, smoothing=1e-6):
cum_grad_update = beta1 * cum_grad[m, :] + (1 - beta1) * grad
cum_grad[m, :] = cum_grad_update
cum_sq_grad_update = beta2 * cum_sq_grad[m, :] + (1 - beta2) * (grad * grad)
cum_sq_grad[m, :] = cum_sq_grad_update
step[m] = t = step[m] + 1
db1 = 1 - beta1**t
db2 = 1 - beta2**t
adjusted_grad = cum_grad_update/db1 / (smoothing + np.sqrt(cum_sq_grad_update/db2))
return adjusted_grad


@njit(nogil=True)
def adanorm(grad, m, smoothing=1e-6):
gnorm2 = grad @ grad
adjusted_grad = grad / sqrt(smoothing + gnorm2)
return adjusted_grad

@njit(nogil=True)
def gnprop(grad, m, cum_sq_norm, gamma=0.99, smoothing=1e-6):
cum_sq_norm_update = gamma * cum_sq_norm[m] + (1 - gamma) * (grad @ grad)
cum_sq_norm[m] = cum_sq_norm_update
adjusted_grad = grad / sqrt(smoothing + cum_sq_norm_update)
return adjusted_grad

@njit(nogil=True)
def gnpropz(grad, m, cum_sq_norm, smoothing=1e-6):
cum_sq_norm_update = cum_sq_norm[m] + grad @ grad
cum_sq_norm[m] = cum_sq_norm_update
adjusted_grad = grad / sqrt(smoothing + cum_sq_norm_update)
return adjusted_grad


@njit(nogil=True, parallel=False)
def generalized_sgd_sweep(row_idx, col_idx, values, P, Q,
eta, lambd, row_nnz, col_nnz,
transform, transform_params,
adjust_gradient, adjustment_params):
cum_error = 0
for k, val in enumerate(values):
m = row_idx[k]
n = col_idx[k]

pm = P[m, :]
qn = Q[n, :]

err = val - pm @ qn
row_lambda = lambd / row_nnz[m]
col_lambda = lambd / col_nnz[n]

kpm = transform(pm, P, m, *transform_params[0])
ngrad_p = err * qn - kpm * row_lambda
sqn = transform(qn, Q, n, *transform_params[1])
ngrad_q = err * pm - sqn * col_lambda

adjusted_ngrad_p = adjust_gradient(ngrad_p, m, *adjustment_params[0])
new_pm = pm + eta * adjusted_ngrad_p
adjusted_ngrad_q = adjust_gradient(ngrad_q, n, *adjustment_params[1])
new_qn = qn + eta * adjusted_ngrad_q

P[m, :] = new_pm
Q[n, :] = new_qn

cum_error += err*err
return cum_error


# @jit
def mf_sgd_boilerplate(interactions, shape, nonzero_count, rank,
lrate, lambd, num_epochs, tol,
sgd_sweep_func=None,
transform=None, transform_params=None,
adjust_gradient=None, adjustment_params=None,
seed=None, verbose=False,
iter_errors=None, iter_time=None):
assert isinstance(interactions, tuple) # required by numba
assert isinstance(nonzero_count, tuple) # required by numba

nrows, ncols = shape
row_shp = (nrows, rank)
col_shp = (ncols, rank)

rnds = np.random if seed is None else np.random.RandomState(seed)
row_factors = rnds.normal(scale=0.1, size=row_shp)
col_factors = rnds.normal(scale=0.1, size=col_shp)

sgd_sweep_func = sgd_sweep_func or generalized_sgd_sweep
transform = transform or identity
transform_params = transform_params or ((), ())
adjust_gradient = adjust_gradient or identity
adjustment_params = adjustment_params or ((), ())

nnz = len(interactions[-1])
last_err = np.finfo('f8').max
training_time = []
for epoch in range(num_epochs):
if adjust_gradient in [adagrad, rmsprop]:
adjustment_params = ((np.zeros(row_shp, dtype='f8'),),
(np.zeros(col_shp, dtype='f8'),)
)
if adjust_gradient is gnprop:
adjustment_params = ((np.zeros(nrows, dtype='f8'),),
(np.zeros(ncols, dtype='f8'),)
)
if adjust_gradient is adam:
adjustment_params = ((np.zeros(row_shp, dtype='f8'),
np.zeros(row_shp, dtype='f8'),
np.zeros(nrows, dtype='intp')),
(np.zeros(col_shp, dtype='f8'),
np.zeros(col_shp, dtype='f8'),
np.zeros(ncols, dtype='intp'))
)

with track_time(training_time, verbose=False):
new_err = sgd_sweep_func(*interactions, row_factors, col_factors,
lrate, lambd, *nonzero_count,
transform, transform_params,
adjust_gradient, adjustment_params)

refined = abs(last_err - new_err) / last_err
last_err = new_err
rmse = sqrt(new_err / nnz)
if iter_errors is not None:
iter_errors.append(rmse)
if verbose:
print('Epoch: {}. RMSE: {}'.format(epoch, rmse))
if refined < tol:
break
if iter_time is not None:
iter_time.extend(training_time)
return row_factors, col_factors


def simple_mf_sgd(interactions, shape, nonzero_count, rank,
lrate, lambd, num_epochs, tol,
adjust_gradient=None, adjustment_params=None,
seed=None, verbose=False,
iter_errors=None, iter_time=None):
#nonzero_count = ((), ())
nonzero_count = (np.ones(shape[0]), np.ones(shape[1]))
return mf_sgd_boilerplate(interactions, shape, nonzero_count, rank,
lrate, lambd, num_epochs, tol,
adjust_gradient=adjust_gradient,
adjustment_params=adjustment_params,
sgd_sweep_func=generalized_sgd_sweep,
seed=seed, verbose=verbose,
iter_errors=iter_errors, iter_time=iter_time)


def simple_pmf_sgd(interactions, shape, nonzero_count, rank,
lrate, sigma, num_epochs, tol,
adjust_gradient=None, adjustment_params=None,
seed=None, verbose=False,
iter_errors=None, iter_time=None):
lambd = 0.5 * sigma**2
return mf_sgd_boilerplate(interactions, shape, nonzero_count, rank,
lrate, lambd, num_epochs, tol,
adjust_gradient=adjust_gradient,
adjustment_params=adjustment_params,
seed=seed, verbose=verbose,
iter_errors=iter_errors, iter_time=iter_time)


def sp_kernel_update(pm, P, m, K):
k = K.getrow(m)
kp = k.dot(P).squeeze()
return kp + k[0, m] * pm

@njit(nogil=True, parallel=False)
def sparse_kernel_update(pm, P, m, kernel_ptr, kernel_ind, kernel_data):
lead_idx = kernel_ptr[m]
stop_idx = kernel_ptr[m+1]

kernel_update = np.zeros_like(pm)

for i in range(lead_idx, stop_idx):
index = kernel_ind[i]
value = kernel_data[i]
p_row = P[index, :]
if index == m: # diagonal value
p_row = p_row + pm # avoid rewriting original data
kernel_update += value * p_row
return kernel_update


def kernelized_pmf_sgd(interactions, shape, nonzero_count, rank,
lrate, sigma, num_epochs, tol,
kernel_matrices, kernel_update=None, sparse_kernel_format=True,
adjust_gradient=None, adjustment_params=None,
seed=None, verbose=False, iter_errors=None, iter_time=None):
kernel_update = kernel_update or sparse_kernel_update

row_kernel, col_kernel = kernel_matrices
if sparse_kernel_format:
row_kernel_data = (row_kernel.indptr, row_kernel.indices, row_kernel.data)
col_kernel_data = (col_kernel.indptr, col_kernel.indices, col_kernel.data)
else:
row_kernel_data = (row_kernel,)
col_kernel_data = (col_kernel,)

kernel_params = (row_kernel_data, col_kernel_data)

lambd = 0.5 * sigma**2
return mf_sgd_boilerplate(interactions, shape, nonzero_count, rank,
lrate, lambd, num_epochs, tol,
sgd_sweep_func=generalized_sgd_sweep,
transform=kernel_update,
transform_params=kernel_params,
adjust_gradient=adjust_gradient,
adjustment_params=adjustment_params,
seed=seed, verbose=verbose,
iter_errors=iter_errors, iter_time=iter_time)
44 changes: 33 additions & 11 deletions polara/lib/similarity.py
Original file line number Diff line number Diff line change
@@ -47,9 +47,10 @@ def set_diagonal_values(mat, val=1):


def safe_inverse_root(d, dtype=None):
if (d < 0).any():
raise ValueError
return np.power(d, -0.5, where=d>0, dtype=dtype)
pos_d = d > 0
res = np.zeros(len(d), dtype=dtype)
np.power(d, -0.5, where=pos_d, dtype=dtype, out=res)
return res


def normalize_binary_features(feature_mat, dtype=None):
@@ -251,13 +252,25 @@ def build_indicator_matrix(labels, max_items=None):
return csr_matrix((data, indices, indprt), shape=shape)


def feature2sparse(feature_data, ranking=None, deduplicate=True):
def feature2sparse(feature_data, ranking=None, deduplicate=True, labels=None):
if deduplicate:
feature_data = feature_data.apply(uniquify_ordered if ranking else set)

feature_lbl = defaultdict(lambda: len(feature_lbl))
indices = [feature_lbl[item] for items in feature_data for item in items]
indptr = np.r_[0, feature_data.apply(len).cumsum().values]
if labels:
feature_lbl = labels
indices = []
indlens = []
for items in feature_data:
# wiil also remove unknown items to ensure index consistency
inds = [feature_lbl[item] for item in items if item in feature_lbl]
indices.extend(inds)
indlens.append(len(inds))
else:
feature_lbl = defaultdict(lambda: len(feature_lbl))
indices = [feature_lbl[item] for items in feature_data for item in items]
indlens = feature_data.apply(len).values

indptr = np.r_[0, np.cumsum(indlens)]

if ranking:
if ranking is True:
@@ -285,7 +298,7 @@ def feature2sparse(feature_data, ranking=None, deduplicate=True):
return feature_mat, dict(feature_lbl)


def get_features_data(meta_data, ranking=None, deduplicate=True):
def get_features_data(meta_data, ranking=None, deduplicate=True, labels=None):
feature_mats = OrderedDict()
feature_lbls = OrderedDict()
features = meta_data.columns
@@ -303,13 +316,16 @@ def get_features_data(meta_data, ranking=None, deduplicate=True):

for feature in features:
feature_data = meta_data[feature]
mat, lbl = feature2sparse(feature_data, ranking=ranking.get(feature, None), deduplicate=deduplicate)
mat, lbl = feature2sparse(feature_data,
ranking=ranking.get(feature, None),
deduplicate=deduplicate,
labels=labels[feature] if labels else None)
feature_mats[feature], feature_lbls[feature] = mat, lbl
return feature_mats, feature_lbls


def stack_features(features, add_identity=False, normalize=True, dtype=None, **kwargs):
feature_mats, feature_lbls = get_features_data(features, **kwargs)
def stack_features(features, add_identity=False, normalize=True, dtype=None, labels=None, stacked_index=False, **kwargs):
feature_mats, feature_lbls = get_features_data(features, labels=labels, **kwargs)

all_matrices = list(feature_mats.values())
if add_identity:
@@ -320,9 +336,15 @@ def stack_features(features, add_identity=False, normalize=True, dtype=None, **k

if normalize:
norm = stacked_features.getnnz(axis=1)
norm = norm.astype(np.promote_types(norm.dtype, 'f4'))
scaling = np.power(norm, -1, where=norm>0, dtype=dtype)
stacked_features = sp.sparse.diags(scaling).dot(stacked_features)

if stacked_index:
index_shift = identity.shape[1] if add_identity else 0
for feature, lbls in feature_lbls.items():
feature_lbls[feature] = {k:v+index_shift for k, v in lbls.items()}
index_shift += feature_mats[feature].shape[1]
return stacked_features, feature_lbls


80 changes: 73 additions & 7 deletions polara/lib/sparse.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,84 @@
# python 2/3 interoperability
try:
range = xrange
except NameError:
pass

import sys
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
import numpy as np
from numpy import power
from scipy.sparse import csr_matrix
from scipy.sparse import diags
from scipy.sparse.linalg import norm as spnorm

from numba import jit, njit, guvectorize, prange
from numba import float64 as f8
from numba import intp as ip

from polara.recommender import defaults


tuplsize = sys.getsizeof(())
itemsize = np.dtype(np.intp).itemsize
pntrsize = sys.getsizeof(1.0)
# size of list of tuples of indices - to estimate when to convert sparse matrix to dense
# based on http://stackoverflow.com/questions/15641344/python-memory-consumption-dict-vs-list-of-tuples
# and https://code.tutsplus.com/tutorials/understand-how-much-memory-your-python-objects-use--cms-25609
def get_nnz_max():
return int(defaults.memory_hard_limit * (1024**3) / (tuplsize + 2*(pntrsize + itemsize)))


def check_sparsity(matrix, nnz_coef=0.5, tocsr=False):
if matrix.nnz > nnz_coef * matrix.shape[0] * matrix.shape[1]:
return matrix.toarray(order='C')
if tocsr:
return matrix.tocsr()
return matrix


def sparse_dot(left_mat, right_mat, dense_output=False, tocsr=False):
# scipy always returns sparse result, even if dot product is dense
# this function offers solution to this problem
# it also takes care on sparse result w.r.t. to further processing
if dense_output: # calculate dense result directly
# TODO matmat multiplication instead of iteration with matvec
res_type = np.result_type(right_mat.dtype, left_mat.dtype)
result = np.empty((left_mat.shape[0], right_mat.shape[1]), dtype=res_type)
for i in range(left_mat.shape[0]):
v = left_mat.getrow(i)
result[i, :] = csc_matvec(right_mat, v, dense_output=True, dtype=res_type)
else:
result = left_mat.dot(right_mat.T)
# NOTE even though not neccessary for symmetric i2i matrix,
# transpose helps to avoid expensive conversion to CSR (performed by scipy)
if result.nnz > get_nnz_max():
# too many nnz lead to undesired memory overhead in downvote_seen_items
result = result.toarray() # not using order='C' as it may consume memory
else:
result = check_sparsity(result, tocsr=tocsr)
return result


def rescale_matrix(matrix, scaling, axis, binary=True, return_scaling_values=False):
'''Function to scale either rows or columns of the sparse rating matrix'''
scaling_values = None
if scaling == 1: # no scaling (standard SVD case)
result = matrix

if binary:
norm = np.sqrt(matrix.getnnz(axis=axis)) # compute Euclidean norm as if values are binary
else:
norm = spnorm(matrix, axis=axis, ord=2) # compute Euclidean norm

scaling_values = power(norm, scaling-1, where=norm != 0)
scaling_matrix = diags(scaling_values)

if axis == 0: # scale columns
result = matrix.dot(scaling_matrix)
if axis == 1: # scale rows
result = scaling_matrix.dot(matrix)

if return_scaling_values:
result = (result, scaling_values)
return result


# matvec implementation is based on
# http://stackoverflow.com/questions/18595981/improving-performance-of-multiplication-of-scipy-sparse-matrices
@njit(nogil=True)
@@ -161,7 +226,8 @@ def dttm_par(idx, val, mat1, mat2, mode1, mode2, unqs, inds, res):
res[i0, j1, j2] += vp * mat1[i1, j1] * mat2[i2, j2]


@jit(parallel=True)
# @jit(parallel=True) # numba up to v0.41.dev only supports the 1st argument
# https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
def arrange_index(array):
unqs, unq_inv, unq_cnt = np.unique(array, return_inverse=True, return_counts=True)
inds = np.split(np.argsort(unq_inv), np.cumsum(unq_cnt[:-1]))
17 changes: 11 additions & 6 deletions polara/lib/tensor.py
Original file line number Diff line number Diff line change
@@ -34,7 +34,8 @@ def ttm3d_par(idx, val, shape, U, V, modes, unqs, inds, dtype=None):
return res


def hooi(idx, val, shape, core_shape, num_iters=25, parallel_ttm=False, growth_tol=0.01, verbose=False, seed=None):
def hooi(idx, val, shape, core_shape, return_core=True, num_iters=25,
parallel_ttm=False, growth_tol=0.01, verbose=False, seed=None):
'''
Compute Tucker decomposition of a sparse tensor in COO format
with the help of HOOI algorithm. Usage:
@@ -62,19 +63,20 @@ def log_status(msg):
u2 = np.linalg.qr(u2, mode='reduced')[0]

g_norm_old = 0
return_core_vectors = True if return_core else 'u'
for i in range(num_iters):
log_status('Step %i of %i' % (i+1, num_iters))

u0 = ttm[0](*tensor_data, u2, u1, ((2, 0), (1, 0)), *index_data[0]).reshape(shape[0], r1*r2)
uu = svds(u0, k=r0, return_singular_vectors='u')[0]
uu, ss, _ = svds(u0, k=r0, return_singular_vectors='u')
u0 = np.ascontiguousarray(uu[:, ::-1])

u1 = ttm[1](*tensor_data, u2, u0, ((2, 0), (0, 0)), *index_data[1]).reshape(shape[1], r0*r2)
uu = svds(u1, k=r1, return_singular_vectors='u')[0]
uu, ss, _ = svds(u1, k=r1, return_singular_vectors='u')
u1 = np.ascontiguousarray(uu[:, ::-1])

u2 = ttm[2](*tensor_data, u1, u0, ((1, 0), (0, 0)), *index_data[2]).reshape(shape[2], r0*r1)
uu, ss, vv = svds(u2, k=r2)
uu, ss, vv = svds(u2, k=r2, return_singular_vectors=return_core_vectors)
u2 = np.ascontiguousarray(uu[:, ::-1])

g_norm_new = np.linalg.norm(ss)
@@ -85,7 +87,10 @@ def log_status(msg):
log_status('Core is no longer growing. Norm of the core: %f' % g_norm_old)
break

g = np.ascontiguousarray((ss[:, np.newaxis] * vv)[::-1, :])
g = g.reshape(r2, r1, r0).transpose(2, 1, 0)
if return_core:
g = np.ascontiguousarray((ss[:, np.newaxis] * vv)[::-1, :])
g = g.reshape(r2, r1, r0).transpose(2, 1, 0)
else:
g = None
log_status('Done')
return u0, u1, u2, g
81 changes: 10 additions & 71 deletions polara/recommender/coldstart/data.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from collections import namedtuple, defaultdict
import numpy as np
import pandas as pd
from scipy.sparse import issparse

from polara.recommender.data import RecommenderData
from polara.lib.similarity import build_indicator_matrix
from polara.recommender.hybrid.data import (IdentityDiagonalMixin,
SideRelationsMixin)


class ItemColdStartData(RecommenderData):
@@ -150,11 +151,7 @@ def _verify_cold_items_features(self):
return

if self.meta_data.shape[1] > 1:
try: # agg is supported only starting from pandas v.0.20.0
features_melted = self.meta_data.agg('sum', axis=1)
except AttributeError: # fall back to much slower but more general option
features_melted = (self.meta_data.apply(lambda x: [x.sum()], axis=1)
.apply(lambda x: x[0]))
features_melted = self.meta_data.agg(lambda x: [f for l in x for f in l], axis=1)
else:
features_melted = self.meta_data.iloc[:, 0]

@@ -207,67 +204,6 @@ def _sort_by_cold_items(self):
holdout.sort_values(itemid_cold, inplace=True)



class FeatureSimilarityMixin(object):
def __init__(self, sim_mat, sim_idx, *args, **kwargs):
super(FeatureSimilarityMixin, self).__init__(*args, **kwargs)

entities = [self.fields.userid, self.fields.itemid]
self._sim_idx = {entity: pd.Series(index=idx, data=np.arange(len(idx)), copy=False)
if idx is not None else None
for entity, idx in sim_idx.items()
if entity in entities}
self._sim_mat = {entity: mat for entity, mat in sim_mat.items() if entity in entities}
self._similarity = dict.fromkeys(entities)

self.subscribe(self.on_change_event, self._clean_similarity)

def _clean_similarity(self):
self._similarity = dict.fromkeys(self._similarity.keys())

@property
def item_similarity(self):
entity = self.fields.itemid
return self.get_similarity_matrix(entity)

@property
def user_similarity(self):
entity = self.fields.userid
return self.get_similarity_matrix(entity)

def get_similarity_matrix(self, entity):
similarity = self._similarity.get(entity, None)
if similarity is None:
self._update_similarity(entity)
return self._similarity[entity]

def _update_similarity(self, entity):
sim_mat = self._sim_mat[entity]
if sim_mat is None:
self._similarity[entity] = None
else:
if self.verbose:
print('Updating {} similarity matrix'.format(entity))

entity_type = self.fields._fields[self.fields.index(entity)]
index_data = getattr(self.index, entity_type)

try: # check whether custom index is introduced
entity_idx = index_data.training['old']
except AttributeError: # fall back to standard case
entity_idx = index_data['old']

sim_idx = entity_idx.map(self._sim_idx[entity]).values
sim_mat = self._sim_mat[entity][:, sim_idx][sim_idx, :]

if issparse(sim_mat):
sim_mat.setdiag(1)
else:
np.fill_diagonal(sim_mat, 1)
self._similarity[entity] = sim_mat



class ColdSimilarityMixin(object):
@property
def cold_items_similarity(self):
@@ -280,7 +216,7 @@ def cold_users_similarity(self):
return self.get_cold_similarity(userid)

def get_cold_similarity(self, entity):
sim_mat = self._sim_mat[entity]
sim_mat = self._rel_mat[entity]

if sim_mat is None:
return None
@@ -289,11 +225,14 @@ def get_cold_similarity(self, entity):
entity_type = fields._fields[fields.index(entity)]
index_data = getattr(self.index, entity_type)

similarity_index = self._sim_idx[entity]
similarity_index = self._rel_idx[entity]
seen_idx = index_data.training['old'].map(similarity_index).values
cold_idx = index_data.cold_start['old'].map(similarity_index).values

return sim_mat[:, seen_idx][cold_idx, :]


class ColdStartSimilarityDataModel(ColdSimilarityMixin, FeatureSimilarityMixin, ItemColdStartData): pass
class ColdStartSimilarityDataModel(ColdSimilarityMixin,
IdentityDiagonalMixin,
SideRelationsMixin,
ItemColdStartData): pass
115 changes: 109 additions & 6 deletions polara/recommender/coldstart/models.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,126 @@
import numpy as np
from polara.recommender.models import RecommenderModel

from polara import SVDModel
from polara.recommender.models import RecommenderModel, ScaledSVD
from polara.lib.similarity import stack_features
from polara.lib.sparse import sparse_dot

class ContentBasedColdStart(RecommenderModel):

class ItemColdStartEvaluationMixin:
def __init__(self, *args, **kwargs):
super(ContentBasedColdStart, self).__init__(*args, **kwargs)
self.method = 'CB'
super().__init__(*args, **kwargs)
self.filter_seen = False # there are no seen entities in cold start
self._prediction_key = '{}_cold'.format(self.data.fields.itemid)
self._prediction_target = self.data.fields.userid


class RandomModelItemColdStart(ItemColdStartEvaluationMixin, RecommenderModel):
def __init__(self, *args, **kwargs):
self.seed = kwargs.pop('seed', None)
super().__init__(*args, **kwargs)
self.method = 'RND(cs)'

def build(self):
seed = self.seed
self._random_state = np.random.RandomState(seed) if seed is not None else np.random

def get_recommendations(self):
repr_users = self.data.representative_users
if repr_users is None:
repr_users = self.data.index.userid.training
repr_users = repr_users.new.values
n_cold_items = self.data.index.itemid.cold_start.shape[0]
shape = (n_cold_items, len(repr_users))
users_matrix = np.lib.stride_tricks.as_strided(repr_users, shape,
(0, repr_users.itemsize))
random_users = np.apply_along_axis(self._random_state.choice, 1,
users_matrix, self.topk, replace=False)
return random_users


class PopularityModelItemColdStart(ItemColdStartEvaluationMixin, RecommenderModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.method = 'MP(cs)'

def build(self):
userid = self.data.fields.userid
user_activity = self.data.training[userid].value_counts(sort=False)
repr_users = self.data.representative_users
if repr_users is not None:
user_activity = user_activity.reindex(repr_users.new.values)
self.user_scores = user_activity.sort_values(ascending=False)

def get_recommendations(self):
topk = self.topk
shape = (self.data.index.itemid.cold_start.shape[0], topk)
top_users = self.user_scores.index[:topk].values
top_users_array = np.lib.stride_tricks.as_strided(top_users, shape,
(0, top_users.itemsize))
return top_users_array


class SimilarityAggregationItemColdStart(ItemColdStartEvaluationMixin, RecommenderModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.method = 'SIM(cs)'
self.implicit = False
self.dense_output = False

def build(self):
pass

def get_recommendations(self):
item_similarity_scores = self.data.cold_items_similarity

user_item_matrix = self.get_training_matrix()
user_item_matrix.data = np.ones_like(user_item_matrix.data)
if self.implicit:
user_item_matrix.data = np.ones_like(user_item_matrix.data)
scores = sparse_dot(item_similarity_scores, user_item_matrix, self.dense_output, True)
top_similar_users = self.get_topk_elements(scores).astype(np.intp)
return top_similar_users

scores = item_similarity_scores.dot(user_item_matrix.T).tocsr()

class SVDModelItemColdStart(ItemColdStartEvaluationMixin, SVDModel):
def __init__(self, *args, item_features=None, **kwargs):
super().__init__(*args, **kwargs)
self.method = 'PureSVD(cs)'
self.item_features = item_features
self.use_raw_features = item_features is not None

def build(self, *args, **kwargs):
super().build(*args, return_factors=True, **kwargs)

def get_recommendations(self):
userid = self.data.fields.userid
itemid = self.data.fields.itemid

u = self.factors[userid]
v = self.factors[itemid]
s = self.factors['singular_values']

if self.use_raw_features:
item_info = self.item_features.reindex(self.data.index.itemid.training.old.values,
fill_value=[])
item_features, feature_labels = stack_features(item_info, normalize=False)
w = item_features.T.dot(v).T
wwt_inv = np.linalg.pinv(w @ w.T)

cold_info = self.item_features.reindex(self.data.index.itemid.cold_start.old.values,
fill_value=[])
cold_item_features, _ = stack_features(cold_info, labels=feature_labels, normalize=False)
else:
w = self.data.item_relations.T.dot(v).T
wwt_inv = np.linalg.pinv(w @ w.T)
cold_item_features = self.data.cold_items_similarity

cold_items_factors = cold_item_features.dot(w.T) @ wwt_inv
scores = cold_items_factors @ (u * s[None, :]).T
top_similar_users = self.get_topk_elements(scores).astype(np.intp)
return top_similar_users


class ScaledSVDItemColdStart(ScaledSVD, SVDModelItemColdStart):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.method = 'PureSVDs(cs)'
58 changes: 58 additions & 0 deletions polara/recommender/contextual/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
from polara.recommender.data import RecommenderData


class ItemPostFilteringData(RecommenderData):
def __init__(self, *args, item_context_mapping=None, **kwargs):
super().__init__(*args, **kwargs)
userid = self.fields.userid
itemid = self.fields.itemid
self.item_context_mapping = dict(**item_context_mapping)
self.context_data = {context: dict.fromkeys([userid, itemid])
for context in item_context_mapping.keys()}

def map_context_data(self, context):
if context is None:
return

userid = self.fields.userid
itemid = self.fields.itemid

context_mapping = self.item_context_mapping[context]
index_mapping = self.index.itemid.set_index('old').new
mapped_index = {itemid: lambda x: x[itemid].map(index_mapping)}
item_data = (context_mapping.loc[lambda x: x[itemid].isin(index_mapping.index)]
.assign(**mapped_index)
.groupby(context)[itemid]
.apply(list))
holdout = self.test.holdout
try:
user_data = holdout.set_index(userid)[context]
except AttributeError:
print(f'Unable to map {context}: holdout data is not recognized')
return
except KeyError:
print(f'Unable to map {context}: not present in holdout')
return
# deal with mesmiatch between user and item data
item_data = item_data.reindex(user_data.drop_duplicates().values, fill_value=[])

self.context_data[context][userid] = user_data
self.context_data[context][itemid] = item_data

def update_contextual_data(self):
holdout = self.test.holdout
if holdout is not None:
# assuming that for each user in holdout we have only 1 item
assert holdout.shape[0] == holdout[self.fields.userid].nunique()

for context in self.item_context_mapping.keys():
self.map_context_data(context)

def prepare(self, *args, **kwargs):
super().prepare(*args, **kwargs)
self.update_contextual_data()


def set_test_data(self, *args, **kwargs):
super().set_test_data(*args, **kwargs)
self.update_contextual_data()
32 changes: 32 additions & 0 deletions polara/recommender/contextual/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import numpy as np


class ItemPostFilteringMixin:
def upvote_context_items(self, context, scores, test_users):
if context is None:
return

userid = self.data.fields.userid
itemid = self.data.fields.itemid
context_data = self.data.context_data[context]
try:
upvote_items = context_data[userid].loc[test_users].map(context_data[itemid])
except:
print(f'Unable to upvote items in context "{context}"')
return
upvote_index = zip(*[(i, el) for i, l in enumerate(upvote_items) for el in l])

context_idx_flat = np.ravel_multi_index(list(upvote_index), scores.shape)
context_scores = scores.flat[context_idx_flat]

upscored = scores.max() + context_scores + 1
scores.flat[context_idx_flat] = upscored

def upvote_relevant_items(self, scores, test_users):
for context in self.data.context_data.keys():
self.upvote_context_items(context, scores, test_users)

def slice_recommendations(self, test_data, test_shape, start, stop, test_users):
scores, slice_data = super().slice_recommendations(test_data, test_shape, start, stop, test_users)
self.upvote_relevant_items(scores, test_users[start:stop])
return scores, slice_data
25 changes: 20 additions & 5 deletions polara/recommender/data.py
Original file line number Diff line number Diff line change
@@ -110,15 +110,17 @@ def __init__(self, data, userid, itemid, feedback=None, custom_order=None, seed=

if data is None:
cols = fields + [custom_order]
self._data = data = pd.DataFrame(columns=[c for c in cols if c])
else:
self._data = data
data = pd.DataFrame(columns=[c for c in cols if c])

if data.duplicated(subset=[f for f in fields if f]).any():
# unstable in pandas v. 17.0, only works in <> v.17.0
# rely on deduplicated data in many places - makes data processing more efficient
raise NotImplementedError('Data has duplicate values')

if not data.index.is_unique:
data = data.reset_index(drop=True)

self._data = data
self._custom_order = custom_order
self.fields = namedtuple('Fields', self._std_fields)
self.fields = self.fields(**dict(zip(self._std_fields, fields)))
@@ -199,9 +201,12 @@ def _verified_data_property(self, data_property):
return getattr(self, data_property)


def update(self):
def update(self, training_only=False):
if self._change_properties:
self.prepare()
if training_only:
self.prepare_training_only()
else:
self.prepare()


def prepare(self):
@@ -592,6 +597,16 @@ def _reindex_train_items(self):
items_index = self.reindex(self._training, itemid)
self.index = self.index._replace(itemid=items_index)

def get_entity_index(self, entity, index_id='training'):
entity_type = self.fields._fields[self.fields.index(entity)]
index_data = getattr(self.index, entity_type)

try: # check whether custom index is introduced (as in e.g. coldstart)
entity_idx = getattr(index_data, index_id)
except AttributeError: # fall back to standard case
entity_idx = index_data
return entity_idx

def _reindex_feedback(self):
self.index = self.index._replace(feedback=None)

6 changes: 5 additions & 1 deletion polara/recommender/defaults.py
Original file line number Diff line number Diff line change
@@ -48,7 +48,11 @@
#COMPUTATION
test_chunk_size = 1000 #to split tensor decompositions into smaller pieces in memory
max_test_workers = None # to compute recommendations in parallel for groups of test users

memory_hard_limit = 1 # in gigabytes, default=1, depends on hardware
# varying this value may significantly impact performance
# setting it to None or large value typically reduces performance,
# as iterating over a smaller number of huge arrays takes longer
# than over a higher number of smaller arrays

def get_config(params):
this = sys.modules[__name__]
63 changes: 31 additions & 32 deletions polara/recommender/evaluation.py
Original file line number Diff line number Diff line change
@@ -14,6 +14,13 @@ def no_copy_csr_matrix(data, indices, indptr, shape, dtype):
return matrix


def safe_divide(a, b, mask=None, dtype=None):
pos = mask if mask is not None else a > 0
res = np.zeros(len(a), dtype=dtype)
np.divide(a, b, where=pos, out=res)
return res


def build_rank_matrix(recommendations, shape):
# handle singletone case for a single user
recommendations = np.array(recommendations, copy=False, ndmin=2)
@@ -80,11 +87,11 @@ def generate_hits_data(rank_matrix, eval_matrix_hits, eval_matrix_miss=None):
return hits_rank, miss_rank


def assemble_scoring_matrices(recommendations, eval_data, key, target, is_positive, feedback=None):
def assemble_scoring_matrices(recommendations, holdout, key, target, is_positive, feedback=None):
# handle singletone case for a single user
recommendations = np.array(recommendations, copy=False, ndmin=2)
shape = (recommendations.shape[0], max(recommendations.max(), eval_data[target].max())+1)
eval_matrix = matrix_from_observations(eval_data, key, target, shape, feedback=feedback)
shape = (recommendations.shape[0], max(recommendations.max(), holdout[target].max())+1)
eval_matrix = matrix_from_observations(holdout, key, target, shape, feedback=feedback)
eval_matrix_hits, eval_matrix_miss = split_positive(eval_matrix, is_positive)
rank_matrix = build_rank_matrix(recommendations, shape)
hits_rank, miss_rank = generate_hits_data(rank_matrix, eval_matrix_hits, eval_matrix_miss)
@@ -128,13 +135,9 @@ def get_ndcr_score(eval_matrix, discounts_matrix, ideal_discounts, alternative=F
relevance = eval_matrix._with_data(np.exp2(eval_matrix.data)-1, copy=False)
else:
relevance = eval_matrix

dcr = relevance.multiply(discounts_matrix).sum(axis=1)
idcr = relevance.multiply(ideal_discounts).sum(axis=1)

with np.errstate(invalid='ignore'):
score = np.nansum(dcr/idcr) / relevance.shape[0]
return score
dcr = np.array(relevance.multiply(discounts_matrix).sum(axis=1), copy=False).squeeze()
idcr = np.array(relevance.multiply(ideal_discounts).sum(axis=1), copy=False).squeeze()
return safe_divide(dcr, idcr).mean()


def get_ndcg_score(eval_matrix, discounts_matrix, ideal_discounts, alternative=False):
@@ -193,30 +196,26 @@ def get_relevance_scores(rank_matrix, hits_rank, miss_rank, eval_matrix, eval_ma
true_negative, false_negative] = get_relevance_data(rank_matrix, hits_rank, miss_rank,
eval_matrix, eval_matrix_hits, eval_matrix_miss,
not_rated_penalty, True)
# non-zero mask for safe division
tpnz = true_positive > 0
fnnz = false_negative > 0
# true positive rate
precision = safe_divide(true_positive, true_positive + false_positive, tpnz).mean()
# sensitivity
recall = safe_divide(true_positive, true_positive + false_negative, tpnz).mean()
# false negative rate
miss_rate = safe_divide(false_negative, false_negative + true_positive, fnnz).mean()

with np.errstate(invalid='ignore'):
# true positive rate
precision = true_positive / (true_positive + false_positive)
# sensitivity
recall = true_positive / (true_positive + false_negative)
# false negative rate
miss_rate = false_negative / (false_negative + true_positive)
if true_negative is not None:
# false positive rate
fallout = false_positive / (false_positive + true_negative)
# true negative rate
specifity = true_negative / (false_positive + true_negative)
else:
fallout = specifity = None

n_keys = hits_rank.shape[0]
# average over all users
precision = np.nansum(precision) / n_keys
recall = np.nansum(recall) / n_keys
miss_rate = np.nansum(miss_rate) / n_keys
if true_negative is not None:
specifity = np.nansum(specifity) / n_keys
fallout = np.nansum(fallout) / n_keys
# non-zero mask for safe division
fpnz = false_positive > 0
tnnz = true_negative > 0
# false positive rate
fallout = safe_divide(false_positive, false_positive + true_negative, fpnz).mean()
# true negative rate
specifity = safe_divide(true_negative, false_positive + true_negative, tnnz).mean()
else:
fallout = specifity = None

scores = namedtuple('Relevance', ['precision', 'recall', 'fallout', 'specifity', 'miss_rate'])
scores = scores._make([precision, recall, fallout, specifity, miss_rate])
4 changes: 2 additions & 2 deletions polara/recommender/external/implicit/ialswrapper.py
Original file line number Diff line number Diff line change
@@ -7,7 +7,7 @@
import numpy as np
import implicit
from polara.recommender.models import RecommenderModel
from polara.tools.timing import Timer
from polara.tools.timing import track_time


class ImplicitALS(RecommenderModel):
@@ -55,7 +55,7 @@ def build(self):
matrix.data = self.confidence(matrix.data, alpha=self.alpha,
weight=self.weight_func, epsilon=self.epsilon)

with Timer(self.method, verbose=self.verbose):
with track_time(self.training_time, verbose=self.verbose, model=self.method):
# build the model
# implicit takes item_by_user matrix as input, need to transpose
self._model.fit(matrix.T)
29 changes: 19 additions & 10 deletions polara/recommender/external/lightfm/lightfmwrapper.py
Original file line number Diff line number Diff line change
@@ -2,10 +2,11 @@
from __future__ import print_function

import numpy as np
from numpy.lib.stride_tricks import as_strided
from lightfm import LightFM
from polara.recommender.models import RecommenderModel
from polara.lib.similarity import stack_features
from polara.tools.timing import Timer
from polara.tools.timing import track_time


class LightFMWrapper(RecommenderModel):
@@ -62,19 +63,27 @@ def build(self):
normalize=True,
dtype='f4')

with Timer(self.method, verbose=self.verbose):
with track_time(self.training_time, verbose=self.verbose, model=self.method):
fit(matrix, item_features=self._item_features_csr, user_features=self._user_features_csr)


def slice_recommendations(self, test_data, shape, start, stop, test_users=None):
if self.data.warm_start:
raise NotImplementedError
else:
slice_data = self._slice_test_data(test_data, start, stop)
all_items = self.data.index.itemid.new.values
n_users = stop - start
n_items = len(all_items)
scores = self._model.predict(np.repeat(test_users[start:stop], n_items),
np.tile(all_items, n_users),
item_features=self._item_features_csr).reshape(n_users, n_items)

slice_data = self._slice_test_data(test_data, start, stop)
all_items = self.data.index.itemid.new.values
n_users = stop - start
n_items = len(all_items)
# use stride tricks to avoid unnecessary copies of repeated indices
# have to conform with LightFM's dtype to avoid additional copies
itemsize = np.dtype('i4').itemsize
useridx = as_strided(test_users[start:stop].astype('i4', copy=False),
(n_users, n_items), (itemsize, 0))
itemidx = as_strided(all_items.astype('i4', copy=False),
(n_users, n_items), (0, itemsize))
scores = self._model.predict(useridx.ravel(), itemidx.ravel(),
user_features=self._user_features_csr,
item_features=self._item_features_csr
).reshape(n_users, n_items)
return scores, slice_data
Original file line number Diff line number Diff line change
@@ -1,42 +1,44 @@
# python 2/3 interoperability
from __future__ import print_function

import numpy as np
from polara.recommender.models import RecommenderModel
import graphlab as gl
import turicreate as tc
from polara import RecommenderModel


class GraphlabFactorization(RecommenderModel):
class TuriFactorizationRecommender(RecommenderModel):
def __init__(self, *args, **kwargs):
self.item_side_info = kwargs.pop('item_side_info', None)
self.user_side_info = kwargs.pop('user_side_info', None)
super(GraphlabFactorization, self).__init__(*args, **kwargs)
super().__init__(*args, **kwargs)
self.tc_model = None
self._rank = 10
self.method = 'GLF'
self.method = 'TCF'
# side data
self._item_data = None
self._user_data = None
self.side_data_factorization = True
# randomization
self.seed = 61
# optimization
self.binary_target = False
self.solver = 'auto'
self.max_iterations = 30
# reglarization
self.max_iterations = 25
# regularization
self.regularization = 1e-10
self.linear_regularization = 1e-10
# adagrad
self.adagrad_momentum_weighting = 0.9
# sgd
self.sgd_step_size = 0
# ranking
self.ranking_optimization = False
self.ranking_regularization = 0.25
self.unobserved_rating_value = None
self.num_sampled_negative_examples = None
self.num_sampled_negative_examples = 4
# other parameters
self.other_gl_params = {}
self.with_data_feedback = True
self.other_tc_params = {}
self.data.subscribe(self.data.on_change_event, self._clean_metadata)

def _on_change(self):
super(GraphlabFactorization, self)._on_change()
def _clean_metadata(self):
self._item_data = None
self._user_data = None

@@ -71,7 +73,7 @@ def item_data(self):
.reset_index())
side_features[itemid] = side_features[itemid].map(index_map)

self._item_data = gl.SFrame(side_features)
self._item_data = tc.SFrame(side_features)
else:
self._item_data = None
return self._item_data
@@ -94,18 +96,17 @@ def user_data(self):
.reset_index())
side_features[userid] = side_features[userid].map(index_map)

self._user_data = gl.SFrame(side_features)
self._user_data = tc.SFrame(side_features)
else:
self._user_data = None
return self._user_data

def build(self):
item_data = self.item_data
user_data = self.user_data
side_fact = (item_data is not None) or (user_data is not None)
params = dict(item_data=item_data,
user_data=user_data,
side_data_factorization=side_fact,
side_data_factorization=self.side_data_factorization,
num_factors=self.rank,
binary_target=self.binary_target,
verbose=self.verbose,
@@ -115,75 +116,88 @@ def build(self):
# optimization
solver=self.solver,
max_iterations=self.max_iterations,
# adagrad
adagrad_momentum_weighting=self.adagrad_momentum_weighting,
# sgd
sgd_step_size=self.sgd_step_size,
# regularization
regularization=self.regularization,
linear_regularization=self.linear_regularization,
# other parameters
**self.other_gl_params)
**self.other_tc_params)

if self.ranking_optimization:
build_model = gl.ranking_factorization_recommender.create
build_model = tc.recommender.ranking_factorization_recommender.create
params.update(ranking_regularization=self.ranking_regularization,
num_sampled_negative_examples=self.num_sampled_negative_examples)
if self.unobserved_rating_value is not None:
params.update(unobserved_rating_value=self.unobserved_rating_value)
else:
build_model = gl.factorization_recommender.create
build_model = tc.factorization_recommender.create

self.gl_model = build_model(gl.SFrame(self.data.training),
target = self.data.fields.feedback if self.with_data_feedback else None
self.tc_model = build_model(tc.SFrame(self.data.training),
user_id=self.data.fields.userid,
item_id=self.data.fields.itemid,
target=self.data.fields.feedback,
target=target,
**params)
if self.training_time is not None:
self.training_time.append(self.tc_model.training_time)
if self.verbose:
print('{} training time: {}s'.format(self.method, self.gl_model.training_time))
print(f'{self.method} training time: {self.tc_model.training_time}s')

def get_recommendations(self):
if self.data.warm_start:
raise NotImplementedError

userid = self.data.fields.userid
test_users = self.data.test.holdout[userid].drop_duplicates().values

recommend = self.gl_model.recommend
top_recs = recommend(users=test_users,
k=self.topk,
exclude_known=self.filter_seen,
verbose=self.verbose)
top_recs = self.tc_model.recommend(users=test_users,
k=self.topk,
exclude_known=self.filter_seen,
verbose=self.verbose)
itemid = self.data.fields.itemid
top_recs = top_recs[itemid].to_numpy().reshape(-1, self.topk)
return top_recs

def evaluate_rmse(self):
if self.data.warm_start:
raise NotImplementedError
feedback = self.data.fields.feedback
holdout = gl.SFrame(self.data.test.holdout)
return self.gl_model.evaluate_rmse(holdout, feedback)['rmse_overall']
holdout = tc.SFrame(self.data.test.holdout)
return self.tc_model.evaluate_rmse(holdout, feedback)['rmse_overall']



class WarmStartRecommendationsMixin(object):
class WarmStartRecommendationsMixin:
def get_recommendations(self):
pass


class ColdStartRecommendationsMixin(object):
class ColdStartRecommendationsMixin:
def get_recommendations(self):
userid = self.data.fields.userid
itemid = self.data.fields.itemid
data_index = self.data.index

cold_items_index = self.data.index.itemid.cold_start.old
lower_index = self.data.index.itemid.training.new.max() + 1
cold_items_index = data_index.itemid.cold_start.old.values
lower_index = data_index.itemid.training.new.max() + 1
upper_index = lower_index + len(cold_items_index)
# prevent intersecting cold items index with known items
unseen_items_idx = np.arange(lower_index, upper_index)
new_item_data = gl.SFrame(self.item_side_info.loc[cold_items_index]
new_item_data = tc.SFrame(self.item_side_info.loc[cold_items_index]
.reset_index()
.assign(**{itemid: unseen_items_idx}))

repr_users = self.data.representative_users.new.values
repr_users = self.data.representative_users
try:
repr_users = repr_users.new.values
except AttributeError:
repr_users = data_index.userid.training.new.values
observation_idx = [a.flat for a in np.broadcast_arrays(repr_users, unseen_items_idx[:, None])]
new_observation = gl.SFrame(dict(zip([userid, itemid], observation_idx)))
new_observation = tc.SFrame(dict(zip([userid, itemid], observation_idx)))

scores = self.gl_model.predict(new_observation, new_item_data=new_item_data).to_numpy()
top_similar_idx = self.get_topk_items(scores.reshape(-1, len(repr_users)))
scores = self.tc_model.predict(new_observation, new_item_data=new_item_data).to_numpy()
top_similar_idx = self.get_topk_elements(scores.reshape(-1, len(repr_users)))
top_similar_users = repr_users[top_similar_idx.ravel()].reshape(top_similar_idx.shape)
return top_similar_users
69 changes: 69 additions & 0 deletions polara/recommender/hybrid/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import numpy as np
import pandas as pd
from scipy.sparse import issparse

from polara.recommender.data import RecommenderData


class SideRelationsMixin:
def __init__(self, rel_mat, rel_idx, *args, **kwargs):
super().__init__(*args, **kwargs)

entities = [self.fields.userid, self.fields.itemid]
self._rel_idx = {entity: pd.Series(index=idx, data=np.arange(len(idx)), copy=False)
if idx is not None else None
for entity, idx in rel_idx.items()
if entity in entities}
self._rel_mat = {entity: mat for entity, mat in rel_mat.items() if entity in entities}
self._relations = dict.fromkeys(entities)

self.subscribe(self.on_change_event, self._clean_relations)

def _clean_relations(self):
self._relations = dict.fromkeys(self._relations.keys())

@property
def item_relations(self):
entity = self.fields.itemid
return self.get_relations_matrix(entity)

@property
def user_relations(self):
entity = self.fields.userid
return self.get_relations_matrix(entity)

def get_relations_matrix(self, entity):
relations = self._relations.get(entity, None)
if relations is None:
self._update_relations(entity)
return self._relations[entity]

def _update_relations(self, entity):
rel_mat = self._rel_mat[entity]
if rel_mat is None:
self._relations[entity] = None
else:
if self.verbose:
print(f'Updating {entity} relations matrix')

index_data = self.get_entity_index(entity)
entity_idx = index_data['old']

rel_idx = entity_idx.map(self._rel_idx[entity]).values
rel_mat = self._rel_mat[entity][:, rel_idx][rel_idx, :]

self._relations[entity] = rel_mat


class IdentityDiagonalMixin:
def _update_relations(self, *args, **kwargs):
super()._update_relations(*args, **kwargs)
for rel_mat in self._relations.values():
if rel_mat is not None:
if issparse(rel_mat):
rel_mat.setdiag(1)
else:
np.fill_diagonal(rel_mat, 1)


class SimilarityDataModel(IdentityDiagonalMixin, SideRelationsMixin, RecommenderData): pass
102 changes: 102 additions & 0 deletions polara/recommender/hybrid/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
import scipy as sp
import numpy as np

from polara.recommender.models import RecommenderModel, ProbabilisticMF
from polara.lib.optimize import kernelized_pmf_sgd
from polara.lib.sparse import sparse_dot
from polara.tools.timing import track_time


class SimilarityAggregation(RecommenderModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.method = 'SIM'
self.implicit = False
self.dense_output = False
self.item_similarity_matrix = False

def build(self):
# use copy to prevent contaminating original data
self.item_similarity_matrix = self.data.item_relations.copy()
self.item_similarity_matrix.setdiag(0) # exclude self-links
self.item_similarity_matrix.eliminate_zeros()

def slice_recommendations(self, test_data, shape, start, stop, test_users=None):
test_matrix, slice_data = self.get_test_matrix(test_data, shape, (start, stop))
if self.implicit:
test_matrix.data = np.ones_like(test_matrix.data)
scores = sparse_dot(test_matrix, self.item_similarity_matrix, self.dense_output, True)
return scores, slice_data


class KernelizedRecommenderMixin:
'''Based on the work:
Kernelized Probabilistic Matrix Factorization: Exploiting Graphs and Side Information
http://people.ee.duke.edu/~lcarin/kpmf_sdm_final.pdf
'''
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.kernel_type = 'reg'
self.beta = 0.01
self.gamma = 0.1
self.sigma = 1
self.kernel_update = None # will use default kernel update method
self.sparse_kernel_format = True

entities = [self.data.fields.userid, self.data.fields.itemid]
self.factor_sigma = dict.fromkeys(entities, 1)
self._kernel_matrices = dict.fromkeys(entities)

self.data.subscribe(self.data.on_change_event, self._clean_kernel_data)

def _compute_kernel(self, laplacian, kernel_type=None):
kernel_type = kernel_type or self.kernel_type
if kernel_type == 'dif': # diffusion
return sp.sparse.linalg.expm(self.beta * laplacian) # dense matrix
elif kernel_type == 'reg': # regularized laplacian
n_entities = laplacian.shape[0]
return sp.sparse.eye(n_entities).tocsr() + self.gamma * laplacian # sparse matrix
else:
raise ValueError

def _update_kernel_matrices(self, entity):
laplacian = self.data.get_relations_matrix(entity)
if laplacian is None:
sigma = self.factor_sigma[entity]
n_entities = self.data.get_entity_index(entity).shape[0]
kernel_matrix = (sigma**2) * sp.sparse.eye(n_entities).tocsr()
else:
kernel_matrix = self._compute_kernel(laplacian)
self._kernel_matrices[entity] = kernel_matrix

def _clean_kernel_data(self):
self._kernel_matrices = dict.fromkeys(self._kernel_matrices.keys())

@property
def item_kernel_matrix(self):
entity = self.data.fields.itemid
return self.get_kernel_matrix(entity)

@property
def user_kernel_matrix(self):
entity = self.data.fields.userid
return self.get_kernel_matrix(entity)

def get_kernel_matrix(self, entity):
kernel_matrix = self._kernel_matrices.get(entity, None)
if kernel_matrix is None:
self._update_kernel_matrices(entity)
return self._kernel_matrices[entity]


class KernelizedPMF(KernelizedRecommenderMixin, ProbabilisticMF):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.optimizer = kernelized_pmf_sgd
self.method = 'KPMF'

def build(self, *args, **kwargs):
kernel_matrices = (self.user_kernel_matrix, self.item_kernel_matrix)
kernel_config = dict(kernel_update=self.kernel_update,
sparse_kernel_format=self.sparse_kernel_format)
super().build(kernel_matrices, *args, **kernel_config, **kwargs)
Loading

0 comments on commit ae90f14

Please sign in to comment.