Skip to content

Commit

Permalink
Refactor of ML Toolkit (#87)
Browse files Browse the repository at this point in the history
* update to xval fitscore to support XGBoost models

* xval refactor

* xval refactor

* update to timeseries; Added save/load functionality

* review of xval

* review of graph

* updated param to camelcase

* review of graphing structure

* added utils function

* made naming verbose. Cleaned up code formatting for if statements

* update README

* add @Private for util funcs

* new fresh format

* new fresh format

* fixes to hyperparam json file for single values

* utils refactor

* feature function refactor

* final refactoring

* formatting

* formatting review

* timeseries windows tests

* updates to fresh tests

* python utilities

* refactor of util folder

* refactor review

* refactor or optimize section

* Addition of test script for line length exceeding 80 characters, updates in line with this and minor changes to aspects of the optimization code

* Added deprecation warning. Updated namespace for xval

* review of clust code

* updated cutDict comment

* upd cutK

* Minor changes to deprecation functionality

* updated functionMapping for clustering

* fixed sigfeat tests

* updates resulting from review of clustering refactor/update

* fix to scoring tests

* Fix bugs

* review of clust update

* Fixed hierarchial comments

* removed redundant min function

* added WLS,OLS functionality, updated describe function. Updated failing tests for fresh

* updated tests and describe function

* addition of stats folder

* update order of inputs

* updated format of fit/predict inputs

* added WLS fit function. Fixed inputs to OLS fit order

* fixed travis issue for mac

* resolved comments

* changed all fit/predict functions to the same format. Updated timeseries tests

* fixed indentation

* added stats tests to bash script

* added time series tests for windows

* resolve latest comments

* update function mapping and fixed comments

* addition of README

* updated describe function, fixed errors in timeseries and graphimg

* fixed filelength

* ml utilities style review

* fixed line lengths

* fixed crossEntropy

* fixed .ml.i.findKey

* added changes from comments

* review of stats

* reviewed fresh and fixed stats test

* updating clustering and replying to comments on stats and fresh

* try to fix 'branch outside repository error'

* new commit on new branch

* changes to clust/utils.q after comments

* commiting with kx email address

* commiting with kx email address

* review optimize library

* fixed desc from fileoverview

* changes for comments

* fixed @type comments

* review remainder of ml libraries

* changes following comments

* fileoverview changed in pipeline file

* changed init file

* fixed .ml.i.ap added .ml.infReplace for all types

* added test for keyed table infReplace

* change predict -> transform

* fixed init file. Clash with AutoML if not in ml namespace

* resolved comments

Co-authored-by: Deanna Morgan <[email protected]>
Co-authored-by: dmorgankx <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: Conor McCarthy <[email protected]>
Co-authored-by: unknown <Andrew Morrison>
Co-authored-by: Andrew Morrison <[email protected]>
Co-authored-by: Andrew Morrison <[email protected]>
  • Loading branch information
7 people authored Mar 10, 2021
1 parent 46b5021 commit 5781cd1
Show file tree
Hide file tree
Showing 118 changed files with 9,473 additions and 4,149 deletions.
3 changes: 2 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ install:
# grab latest embedpy
- if [[ "x$QLIC_KC" != "x" ]]; then
echo -n $QLIC_KC |base64 --decode > q/kc.lic;
pip install --upgrade pip;
pip -q install -r requirements.txt;
fi
beforescript:
Expand All @@ -40,7 +41,7 @@ script:
- echo "Packaged as ml_$TRAVIS_OS_NAME-$TRAVIS_BRANCH.zip"
- if [[ "x$QLIC_KC" != "x" ]]; then
curl -fsSL -o test.q https://github.com/KxSystems/embedpy/raw/master/test.q;
q test.q fresh/tests/ util/tests/ xval/tests clust/tests/ graph/tests/ timeseries/tests/ optimize/tests/ -q;
bash tests/testFiles.bat;

else
echo No kdb+, no tests;
Expand Down
2 changes: 1 addition & 1 deletion build/test.bat
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@ if defined QLIC_KC (
pip -q install -r requirements.txt
echo getting test.q from embedpy
curl -fsSL -o test.q https://github.com/KxSystems/embedpy/raw/master/test.q
q test.q fresh/tests/ util/tests/ xval/tests/ clust/tests/ graph/tests/ timeseries/tests/ optimize/tests/ -q
call "tests\testFiles.bat"
)
2 changes: 1 addition & 1 deletion clust/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,6 @@ Documentation is available on the [clustering](https://code.kx.com/v2/ml/toolkit

## Status

The clustering library is still in development and is available here as a beta release. Further functionality and improvements will be made to the library in the coming months.
The clustering library is still in development. Further functionality and improvements will be made to the library on an ongoing basis.

If you have any issues, questions or suggestions, please write to [email protected].
230 changes: 47 additions & 183 deletions clust/aprop.q
Original file line number Diff line number Diff line change
@@ -1,197 +1,61 @@
\d .ml
// clust/init.q - Affinity propagation
// Copyright (c) 2021 Kx Systems Inc
//
// Clustering using affinity propagation.
// Affinity Propagation groups data based on the similarity
// between points and subsequently finds exemplars, which best
// represent the points in each cluster. The algorithm does
// not require the number of clusters be provided at run time,
// but determines the optimum solution by exchanging real-valued
// messages between points until a high-valued set of exemplars
// is produced.

// Affinity Propagation
\d .ml

// @kind function
// @category clust
// @fileoverview Fit affinity propagation algorithm
// @param data {float[][]} Data in matrix format, each column is an individual datapoint
// @param df {symbol} Distance function name within '.ml.clust.df'
// @param dmp {float} Damping coefficient
// @param diag {func} Function applied to the similarity matrix diagonal
// @param iter {dict} Max number of overall iterations and iterations
// without a change in clusters. (::) can be passed in which case the defaults
// of (`total`nochange!200 15) will be used
// @return {dict} Data, input variables, clusters and exemplars
// (`data`inputs`clt`exemplars) required for the predict method
clust.ap.fit:{[data;df;dmp;diag;iter]
// @desc Fit affinity propagation algorithm
// @param data {float[][]} Each column of the data is an individual datapoint
// @param df {symbol} Distance function name within '.ml.clust.df'
// @param damp {float} Damping coefficient
// @param diag {fn} Function applied to the similarity matrix diagonal
// @param iter {dictionary} Max number of overall iterations and iterations
// without a change in clusters. (::) can be passed in which case the
// defaults of (`total`noChange!200 15) will be used
// @return {dictionary} Data, input variables, clusters and exemplars
// (`data`inputs`clust`exemplars) required, along with a projection of the
// predict function
clust.ap.fit:{[data;df;damp;diag;iter]
data:clust.i.floatConversion[data];
defaultDict:`run`total`nochange!0 200 15;
defaultDict:`run`total`noChange!0 200 15;
if[iter~(::);iter:()!()];
if[99h<>type iter;'"iter must be (::) or a dictionary"];
// update iteration dictionary with user changes
// Update iteration dictionary with user changes
updDict:defaultDict,iter;
// cluster data using AP algo
clust.i.runap[data;df;dmp;diag;til count data 0;updDict]
// Cluster data using AP algo
modelInfo:clust.i.runAp[data;df;damp;diag;til count data 0;updDict];
returnInfo:enlist[`modelInfo]!enlist modelInfo;
predictFunc:clust.ap.predict returnInfo;
returnInfo,enlist[`predict]!enlist predictFunc
}

// @kind function
// @category clust
// @fileoverview Predict clusters using AP config
// @param data {float[][]} Data in matrix format, each column is an individual datapoint
// @param cfg {dict} `data`inputs`clt`exemplars returned by clust.ap.fit
// @return {long[]} List of predicted clusters
clust.ap.predict:{[data;cfg]
// @desc Predict clusters using AP config
// @param config {dictionary} `data`inputs`clust`exemplars returned by the
// modelInfo key from the return of clust.ap.fit
// @param data {float[][]} Each column of the data is an individual datapoint
// @return {long[]} Predicted clusters
clust.ap.predict:{[config;data]
config:config`modelInfo;
data:clust.i.floatConversion[data];
if[-1~first cfg`clt;
'"'.ml.clust.ap.fit' did not converge, all clusters returned -1. Cannot predict new data."];
// retrieve cluster centres from training data
ex:cfg[`data][;distinct cfg`exemplars];
// predict testing data clusters
clust.i.appreddist[ex;cfg[`inputs]`df]each$[0h=type data;flip;enlist]data
}


// Utilities

// @kind function
// @category private
// @fileoverview Run affinity propagation algorithm
// @param data {float[][]} Data in matrix format, each column is an individual datapoint
// @param df {symbol} Distance function name within '.ml.clust.df'
// @param dmp {float} Damping coefficient
// @param diag {func} Function applied to the similarity matrix diagonal
// @param idxs {long[]} List of indicies to find distances for
// @param iter {dict} Max number of overall iterations and iterations
// without a change in clusters. (::) can be passed in where the defaults
// of (`total`nochange!200 15) will be used
// @return {long[]} List of clusters
clust.i.runap:{[data;df;dmp;diag;idxs;iter]
// check negative euclidean distance has been given
if[df<>`nege2dist;clust.i.err.ap[]];
// calculate distances, availability and responsibility
info0:clust.i.apinit[data;df;diag;idxs];
// initialize exemplar matrix and convergence boolean
info0,:`emat`conv`iter!((count data 0;iter`nochange)#0b;0b;iter);
// run ap algo until maximum number of iterations completed or convergence
info1:clust.i.apstop clust.i.apalgo[dmp]/info0;
// return data, inputs, clusters and exemplars
inputs:`df`dmp`diag`iter!(df;dmp;diag;iter);
exemplars:info1`exemplars;
clt:$[info1`conv;clust.i.reindex exemplars;count[data 0]#-1];
`data`inputs`clt`exemplars!(data;inputs;clt;exemplars)
}

// @kind function
// @category private
// @fileoverview Initialize matrices
// @param data {float[][]} Data in matrix format, each column is an individual datapoint
// @param df {symbol} Distance function name within '.ml.clust.df'
// @param diag {func} Function applied to the similarity matrix diagonal
// @param idxs {long[]} List of point indices
// @return {dict} Similarity, availability and responsibility matrices
// and keys for matches and exemplars to be filled during further iterations
clust.i.apinit:{[data;df;diag;idxs]
// calculate similarity matrix values
s:clust.i.dists[data;df;data]each idxs;
// update diagonal
s:@[;;:;diag raze s]'[s;k:til n:count data 0];
// create lists/matrices of zeros for other variables
`matches`exemplars`s`a`r!(0;0#0;s),(2;n;n)#0f
}

// @kind function
// @category private
// @fileoverview Run affinity propagation algorithm
// @param dmp {float} Damping coefficient
// @param info {dict} Similarity, availability, responsibility, exemplars,
// matches, iter dictionary, no_conv boolean and iter dict
// @return {dict} Updated inputs
clust.i.apalgo:{[dmp;info]
// update responsibility matrix
info[`r]:clust.i.updr[dmp;info];
// update availability matrix
info[`a]:clust.i.upda[dmp;info];
// find new exemplars
ex:imax each sum info`a`r;
// update `info` with new exemplars/matches
info:update exemplars:ex,matches:?[exemplars~ex;matches+1;0]from info;
// update iter dictionary
.[clust.i.apconv info;(`iter;`run);+[1]]
}

// @kind function
// @category private
// @fileoverview Check affinity propagation algorithm for convergence
// @param info {dict} Similarity, availability, responsibility, exemplars,
// matches, iter dictionary, no_conv boolean and iter dict
// @return {dict} Updated info dictionary
clust.i.apconv:{[info]
// iteration dictionary
iter:info`iter;
// exemplar matrix
emat:info`emat;
// existing exemplars
ediag:0<sum clust.i.diag each info`a`r;
emat[;iter[`run]mod iter`nochange]:ediag;
// check for convergence
if[iter[`nochange]<=iter`run;
unconv:count[info`s]<>sum(se=iter`nochange)+0=se:sum each emat;
conv:$[(iter[`total]=iter`run)|not[unconv]&sum[ediag]>0;1b;0b]];
// return updated info
info,`emat`conv!(emat;conv)
}

// @kind function
// @category private
// @fileoverview Retrieve diagonal from a square matrix
// @param m {any[][]} Square matrix
// @return {any[]} Matrix diagonal
clust.i.diag:{[m]
{x y}'[m;til count m]
}

// @kind function
// @category private
// @fileoverview Update responsibility matrix
// @param dmp {float} Damping coefficient
// @param info {dict} Similarity, availability, responsibility, exemplars,
// matches, iter dictionary, no_conv boolean and iter dict
// @return {float[][]} Updated responsibility matrix
clust.i.updr:{[dmp;info]
// create matrix with every points max responsibility
// diagonal becomes -inf, current max is becomes second max
mxresp:{[x;i]@[count[x]#mx;j;:;]max@[x;i,j:x?mx:max x;:;-0w]};
mx:mxresp'[sum info`s`a;til count info`r];
// calculate new responsibility
(dmp*info`r)+(1-dmp)*info[`s]-mx
}

// @kind function
// @category private
// @fileoverview Update availability matrix
// @param dmp {float} Damping coefficient
// @param info {dict} Similarity, availability, responsibility, exemplars,
// matches, iter dictionary, no_conv boolean and iter dict
// @return {float[][]} Returns updated availability matrix
clust.i.upda:{[dmp;info]
// sum values in positive availability matrix
s:sum@[;;:;0f]'[pv:0|info`r;k:til n:count info`a];
// create a matrix using the negative values produced by the availability sum
// + responsibility diagonal - positive availability values
a:@[;;:;]'[0&(s+info[`r]@'k)-/:pv;k;s];
// calculate new availability
(dmp*info`a)+a*1-dmp
}

// @kind function
// @category private
// @fileoverview Stopping condition for affinity propagation algorithm
// @param info {dict} Similarity, availability, responsibility, exemplars,
// matches, iter dictionary, no_conv boolean and iter dict
// @return {bool} Indicates whether to continue or stop running AP (1/0b)
clust.i.apstop:{[info]
(info[`iter;`total]>info[`iter]`run)&not 1b~info`conv
}

// @kind function
// @category private
// @fileoverview Predict clusters using AP training exemplars
// @param ex {float[][]} Training cluster centres in matrix format,
// each column is an individual datapoint
// @param df {symbol} Distance function name within '.ml.clust.df'
// @param pt {float[]} Current data point
// @return {long[]} Predicted clusters
clust.i.appreddist:{[ex;df;pt]
d?max d:clust.i.dists[ex;df;pt]each til count ex 0
if[-1~first config`clust;
'"'.ml.clust.ap.fit' did not converge, all clusters returned -1.",
" Cannot predict new data."
];
// Retrieve cluster centres from training data
exemp:config[`data][;distinct config`exemplars];
// Predict testing data clusters
data:$[0h=type data;flip;enlist]data;
clust.i.apPredDist[exemp;config[`inputs]`df]each data
}
Loading

0 comments on commit 5781cd1

Please sign in to comment.