This MATLAB code implements iterative methods for Principal Component Regression based on the work in Principal Component Projection Without Principal Component Analysis.
Download fastpcr.m
,lanczos.m
,ridgeInv.m
, and robustReg.m
, add to MATLAB path, or include directly in project directory.
Principal Component Regression (PCR) is a common and effective form of regularized linear regression. Given an n x d matrix A and threshold λ, let Pλ be a d x d projection matrix onto the span of all right singular vectors (i.e. principal components) of A with corresponding squared singular value > λ. PCR computes x = Pλ(ATA)-1ATb. In other words, it solves a standard linear regression problem but restricts the solution to lie in the space spanned by A's top singular vectors.
Most implementations of PCR first perform an eigendecoposition of ATA in order to compute Pλ before seperately solving a regression problem. The eigendecomposition is a computational bottle neck.
fastpcr
avoids this step entirely through matrix polynomial methods (either explicit or implicit via the Lanczos method). To do so, it requires access to an algorithm for standard ℓ2 regularized regression. It uses a few calls to this algorithm to construct a solution to the Principal Component Regression problem.
Input:
fastpcr(A, b, lambda, iter, solver, method, tol)
A
: design matrixb
: response vectorlambda
: eigenvalue cut off, default = ||A
TA
||2/100. All eigenvectors ofA
TA
with eigenvalue <lambda
(i.e., all singular vectors ofA
with squared singular value <lambda
) will be ignored for the regression.iter
: number of iterations, default = 40. Each iteration requires the solution of one ridge regression problem onA
with ridge parameterlambda
.solver
: black box routine for ridge regression, default = 'CG'. Set to 'CG' for Conjugate Gradient solver, 'SVRG' for Stochastic Variance Reduced Gradient solver, or any other solver implemented inridgeInv.m
.method
: the technique used for applying matrix polynomials, default = 'LANCZOS'. Set to 'LANCZOS' for a standard Lanczos method analyzed here, or 'EXPLICIT' for the explicit method analyzed in our ICML paper.tol
: accuracy for calls to ridge regression, default 1e-3
Output:
x
: approximate solution to PCR with parameter lambda.
Approximate Principal Component Regression on random dataset
% generate random test problem
A = randn(10000,4000); b = randn(10000,1);
[U,S,V] = svd(A,'econ');
% modify to have (slightly) decaying spectrum
k = 500;
s = diag(S); s(1:k) = s(1:k)+10;
A = U*diag(s)*V';
lambda = (s(k)-5)^2;
% compute approximate PCR solution
tic; x = fastpcr(A, b, lambda); toc;
Elapsed time is 10.399166 seconds.
fastpcr
is typically faster than a direct PCR solve, but is still very accurate.
% compare to direct method
tic;
[V,D] = eig(A'*A);
xDirect = V*(D>lambda)*V'*(A\b);
toc;
Elapsed time is 37.365212 seconds.
% parameter vector error
norm(x-xDirect)/norm(xDirect)
0.0092
% projection error
norm(x - V*(D>lambda)*V'*x)/norm(x)
0.0048
If a higher accuracy solution is required, the iter
and tol
parameters should be increased from the defaults of 40
and 1e-3
:
x = fastpcr(A, b, lambda, 200, 'CG', 'LANCZOS', 1e-8);
norm(x-xDirect)/norm(xDirect)
1.0304e-08
Doing so will slow down fastpcr
: the algorithm requires iter
calls to a ridge regression algorithm that runs to accuracy tol
. The most effective way to improve the runtime of fastpcr
is to provide a faster ridge regression algorithm in ridgeInv.m
. This algorithm can be customized to your dataset and computational environment.
We do not recommend running fastpcr
with the method
option set to EXPLICIT
. Doing so will almost always result in a less accurate solution. A full error analysis of the LANCZOS
method can be found in Stability of the Lanczos Method for Matrix Function Approximation.