Skip to content

Commit

Permalink
Refactor PCA and docs (JuliaStats#163)
Browse files Browse the repository at this point in the history
* add new DR subtypes and subtype PCA class

* add `loadings` & fix tests (close JuliaStats#123)

* migrated PCA docs

* allow the nightly build to fail in CI

* fixed deprecated calls in tests

* Relax type-asserts in PCA (close JuliaStats#140, close JuliaStats#141)
  • Loading branch information
wildart authored Oct 28, 2021
1 parent 16ceda5 commit c0f74de
Show file tree
Hide file tree
Showing 10 changed files with 337 additions and 81 deletions.
9 changes: 8 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,26 @@ jobs:
if: "!contains(github.event.head_commit.message, 'skip ci')"
name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }}
runs-on: ${{ matrix.os }}
continue-on-error: ${{ matrix.allow-to-fail }}
strategy:
fail-fast: false
matrix:
version:
- '1.1'
- '1' # automatically expands to the latest stable 1.x release of Julia
- 'nightly'
# - 'nightly'
os:
- ubuntu-latest
# - macOS-latest
# - windows-latest
arch:
- x64
allow-to-fail: [false]
include:
- version: 'nightly'
os: ubuntu-latest
arch: x64
allow-to-fail: true
steps:
- uses: actions/checkout@v2
- uses: julia-actions/setup-julia@v1
Expand Down
2 changes: 2 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
MultivariateStats = "6f286f6a-111f-5878-ab1e-185364afe411"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
RDatasets = "ce6b1742-4840-55fa-b093-852dadbb1d8b"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"

[compat]
Expand Down
4 changes: 3 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ end
makedocs(
sitename = "MultivariateStats.jl",
modules = [MultivariateStats],
pages = ["Home"=>"index.md", "whiten.md", "lda.md", "Development"=>"api.md"]
pages = ["Home"=>"index.md",
"whiten.md", "lda.md", "pca.md",
"Development"=>"api.md"]
)

deploydocs(
Expand Down
37 changes: 18 additions & 19 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ Table of the package models and corresponding function names used by these model
|loadings | ? | | | ? | x | x | ? | ? | ? |
|eigvals | | | | | ? | ? | ? | ? | x |
|eigvecs | | | | | ? | ? | ? | ? | ? |
|length | | | | | | | | | |
|size | | | | | | | | | |
|length | | | | | | | | | |
|size | | | | | | | | | |

Note: `?` refers to a possible implementation that is missing or called differently.

Expand All @@ -35,44 +35,43 @@ Note: `?` refers to a possible implementation that is missing or called differen
|indim | - | | | - | - | x | x | x | x | x | x |
|outdim | - | x | | - | - | x | x | x | x | x | x |
|mean | x | x | | x | x | x | x | x | x | ? | |
|var | | | | | | | x | x | ? | ? | ? |
|cov | | | | | | | x | ? | | | |
|var | | | | | | | x | x | x | ? | ? |
|cov | | | | | | | x | x | | | |
|cor | | x | | | | | | | | | |
|projection | ? | x | | x | x | | x | x | x | x | x |
|reconstruct | | | | | | | x | x | x | x | |
|loadings | | ? | | | | | x | x | ? | ? | ? |
|eigvals | | | | | + | | ? | ? | ? | ? | x |
|eigvecs | | | | | | | ? | ? | ? | ? | ? |
|loadings | | ? | | | | | x | x | x | ? | ? |
|eigvals | | | | | + | | ? | ? | x | ? | x |
|eigvecs | | | | | | | ? | ? | x | ? | ? |
|length | + | | + | + | + | | | | | | |
|size | + | | | + | + | | | | | | |
|size | + | | | + | + | | | | x | | |
| | | | | | | | | | | | |

- StatsBase.AbstractDataTransform
- Whitening
- Interface: fit, transfrom
- Interface: fit, transform
- New: length, mean, size
- StatsBase.RegressionModel
- *Interface:* fit, predict
- LinearDiscriminant
- Methods:
- Interface: fit, predict, coef, dof, weights
- New: evaluate, length
- Functions: coef, dof, weights, evaluate, length
- MulticlassLDA
- Methods: fit, predict, size, mean, projection
- New: length
- Functions: size, mean, projection, length
- SubspaceLDA
- Methods: fit, predict, size, mean, projection
- New: length, eigvals
- Functions: size, mean, projection, length, eigvals
- CCA
- Methods: fit, transfrom, indim, outdim, mean
- Functions: indim, outdim, mean
- Subtypes:
- AbstractDimensionalityReduction
- Methods: projection, var, reconstruct, loadings
- *Interface:* projection, var, reconstruct, loadings
- *Interface:* projection == weights
- Subtypes:
- LinearDimensionalityReduction
- Methods: ICA, PCA
- NonlinearDimensionalityReduction
- Methods: KPCA, MDS
- Functions: modelmatrix (X),
- LatentVariableModel or LatentVariableDimensionalityReduction
- Methods: FA, PPCA
- Methods: cov
- Functions: cov

2 changes: 1 addition & 1 deletion docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ end
[MultivariateStats.jl](https://github.com/JuliaStats/MultivariateStats.jl) is a Julia package for multivariate statistical analysis. It provides a rich set of useful analysis techniques, such as PCA, CCA, LDA, ICA, etc.

```@contents
Pages = ["whiten.md", "lda.md", "api.md"]
Pages = ["whiten.md", "lda.md", "pca.md", "api.md"]
Depth = 2
```

Expand Down
90 changes: 90 additions & 0 deletions docs/src/pca.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Principal Component Analysis

[Principal Component Analysis](http://en.wikipedia.org/wiki/Principal_component_analysis) (PCA) derives an orthogonal projection to convert a given set of observations to linearly uncorrelated variables, called *principal components*.

## Example

Performing [`PCA`](@ref) on *Iris* data set:

```@example PCAex
using MultivariateStats, RDatasets, Plots
plotly() # using plotly for 3D-interacive graphing
# load iris dataset
iris = dataset("datasets", "iris")
# split half to training set
Xtr = Matrix(iris[1:2:end,1:4])'
Xtr_labels = Vector(iris[1:2:end,5])
# split other half to testing set
Xte = Matrix(iris[2:2:end,1:4])'
Xte_labels = Vector(iris[2:2:end,5])
nothing # hide
```

Suppose `Xtr` and `Xte` are training and testing data matrix, with each observation in a column.
We train a PCA model, allowing up to 3 dimensions:

```@example PCAex
M = fit(PCA, Xtr; maxoutdim=3)
```

Then, apply PCA model to the testing set

```@example PCAex
Yte = predict(M, Xte)
```

And, reconstruct testing observations (approximately) to the original space

```@example PCAex
Xr = reconstruct(M, Yte)
```

Now, we group results by testing set labels for color coding and visualize first 3 principal
components in 3D interactive plot

```@example PCAex
setosa = Yte[:,Xte_labels.=="setosa"]
versicolor = Yte[:,Xte_labels.=="versicolor"]
virginica = Yte[:,Xte_labels.=="virginica"]
p = scatter(setosa[1,:],setosa[2,:],setosa[3,:],marker=:circle,linewidth=0)
scatter!(versicolor[1,:],versicolor[2,:],versicolor[3,:],marker=:circle,linewidth=0)
scatter!(virginica[1,:],virginica[2,:],virginica[3,:],marker=:circle,linewidth=0)
plot!(p,xlabel="PC1",ylabel="PC2",zlabel="PC3")
```

## Linear Principal Component Analysis

This package uses the [`PCA`](@ref) type to define a linear PCA model:

```@docs
PCA
```

This type comes with several methods where ``M`` be an instance of [`PCA`](@ref),
``d`` be the dimension of observations, and ``p`` be the output dimension (*i.e* the dimension of the principal subspace).

```@docs
fit(::Type{PCA}, ::AbstractMatrix{T}; kwargs) where {T<:Real}
predict(::PCA, ::AbstractVecOrMat{T}) where {T<:Real}
reconstruct(::PCA, ::AbstractVecOrMat{T}) where {T<:Real}
size(::PCA)
mean(M::PCA)
projection(M::PCA)
var(M::PCA)
tprincipalvar(M::PCA)
tresidualvar(M::PCA)
r2(M::PCA)
loadings(M::PCA)
eigvals(M::PCA)
eigvecs(M::PCA)
```

Auxiliary functions
```@docs
pcacov
pcasvd
```
15 changes: 10 additions & 5 deletions src/MultivariateStats.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@ module MultivariateStats
AbstractDataTransform, pairwise!
import Statistics: mean, var, cov, covm
import Base: length, size, show, dump
import StatsBase: fit, predict, predict!, ConvergenceException, dof_residual, coef, weights, dof, pairwise
import StatsBase: fit, predict, predict!, ConvergenceException, coef, weights,
dof, pairwise, r2
import SparseArrays
import LinearAlgebra: eigvals
import LinearAlgebra: eigvals, eigvecs

export

Expand Down Expand Up @@ -45,7 +46,7 @@ module MultivariateStats

tprincipalvar, # total principal variance, i.e. sum(principalvars(M))
tresidualvar, # total residual variance
tvar, # total variance
loadings, # model loadings

## ppca
PPCA, # Type: the Probabilistic PCA model
Expand Down Expand Up @@ -97,8 +98,7 @@ module MultivariateStats
betweenclass_scatter, # between-class scatter matrix
multiclass_lda_stats, # compute statistics for multiclass LDA training
multiclass_lda, # train multi-class LDA based on statistics
mclda_solve, # solve multi-class LDA projection given scatter matrices
mclda_solve!, # solve multi-class LDA projection (inputs are overriden)
mclda_solve, # solve multi-class LDA projection given sStatisticalModel

## ica
ICA, # Type: the Fast ICA model
Expand All @@ -113,6 +113,7 @@ module MultivariateStats
facm # EM algorithm for probabilistic PCA

## source files
include("types.jl")
include("common.jl")
include("lreg.jl")
include("whiten.jl")
Expand All @@ -132,6 +133,10 @@ module MultivariateStats
@deprecate outdim(f::MulticlassLDA) size(f::MulticlassLDA)[2]
@deprecate indim(f::SubspaceLDA) size(f::SubspaceLDA)[1]
@deprecate outdim(f::SubspaceLDA) size(f::SubspaceLDA)[2]
@deprecate indim(f::PCA) size(f::PCA)[1]
@deprecate outdim(f::PCA) size(f::PCA)[2]
@deprecate tvar(f::PCA) var(f::PCA) # total variance
@deprecate transform(f::PCA, x) predict(f::PCA, x) #ex=false
# @deprecate transform(m, x; kwargs...) predict(m, x; kwargs...) #ex=false
# @deprecate transform(m; kwargs...) predict(m; kwargs...) #ex=false

Expand Down
Loading

0 comments on commit c0f74de

Please sign in to comment.