The Elements of Statistical Learning
Chapter 14
Unsupervised Learning
Episode 3 of 3 (§4 to §7)
BCI
删划线
Unsupervised Learning (§4 to §7)
14.4 Self-Organizing Maps
14.5 Principal Components
Principal Curves
Principal Surfaces
14.6 Factor Analysis
Independant Componant Analysis
Exploratory Projection Pursuit
14.7 Multidimensional Scaling
14.4 Self-Organizing Maps (SOMs)
Constrained version of K-means (VQ)
with prototypes on a topological map (grid)
For K prototypes mj in ℝp placed on a given map
- Find closest mj of considered xi
- Move mj and its neighbors mk towards xi
Running the algorithm moves the prototypes
inside the input distribution w.r.t. map constrains
( )kikk mxmm −+← α
SOMs have two parameters
Neighbors found according to prototypes
parametrization lj in {1,2,…, q1 } x {1,2,…, q2 }
Neighbourhood relationship limited
to a maximum distance r between mj and mk
‘Performance’ depends on the learning rate α,
the maximum neighbourhood distance r,
(and the way those two parameters decrease)
SOMs can be extended
Presented version is the ‘basic online’ algorithm
Enhanced version :
Batch one :
Further developments : WEBSOM, …
( )( )kikikk mxllhmm −−+← α
∑
∑
=
k
kk
j w
xw
m
Can SOMs separate sphere data ?
Well … and so what ?
VQ aims to reduce the inputs (K≤N)
Vector Quantization (VQ) from 14.3.9 :
Neural VQ algorithms are
an adaptative (iterated) version of K-means
We aim to :
We use :
SOMs are a constrained VQ algorithm
Intuition
Input space (concrete) Predefined map (abstract)
On ‘simple drawings’ and ‘long speeches’ … ;)
It’s just an algorithm … so run it !
Source (modified) : Growing Neural Gas
http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG_2.html
A ‘dynamic’ example
(better) illustrates the SOMs
14.5 Principal Components (PCA)
Set of q directions towards which
p-dimensional data are linearly projected
-> dimension reduction since q ≤p
Data : x1, x2, …, xn
Model : f(λ) = µ + Vq λ
Model fitted by least squares
w.r.t. reconstruction error
2
1,,
min ∑
=
λµ
λ−µ−
q
i
iqi V
qi
x
V
Principal Components are computed
from matrix operations
Considering and , solve
Data are usually centered,
and VqVqT = Hq is the projection matrix
Solution Vq are the q first columns of V obtainedfrom
( ) ( ) 2
1,,
min ∑
=
λµ
−+−
q
i
i
T
qqi VV
qi
xxxx
V
x=µˆ ( )xx −=λ iTqi Vˆ
TUDVX =
PCA can project data
PCA can separate sphere data
PCA can be used for handwritten digits
Data in ℝ256 2-D projections
( ) 2211ˆ vvxf λ+λ+=λ
PCA is already known
Chapter 3 :
SVD decomposition of XTX = VD2VT
Chapter 3 :
PCA regression and ridge regression
Known tool …
14.5 Principal Curves (PCurves)
Let f(λ) be a parametrized smooth curve in ℝp
Each coordinate of p-dimensional f(λ)
is a smooth function of λ (e.g. arc-length)
Then is a principal curve
Related concepts
- responsability
- principal points
( ) ( )( )λ=λ=λ XX f|Ef
PCurves are a smooth
generalization of (linear) PCA
PCurves can be obtained automatically
Algorithm principle :
Consider the PCurve coordinates one by one
Those two steps are repeated until convergence
( ) ( )( )λ=λ←λ XX
jfjj Ef |ˆ
( ) ( )'ˆˆ
'
λ−←λ
λ
fargminf xX
14.5 Principal Surfaces (PSurfaces)
Generalization to more than one parameter
(but rarely more than two)
Algorithm for PCurves can be applied
for PSurfaces
Links with the SOMs
( ) ( ) ( ) ( )[ ]21212121 ,,...,,,,, λλλλλλ=λλ p21 ffff
PSurfaces can separate sphere data
14.6 Factor Analysis (FA)
Goal : find the latent variables
within the observed ones
Latent representation of data (using SVD)
with andUUUUSSSS N=
T
T
SASASASA
UDVUDVUDVUDVXXXX
=
=
N
T
T DVDVDVDVAAAA =
PCA estimates a latent variable model
or X = AAAA S
Due to hypothesis on X there are many solutions
as
**
T SSSX AAAARRRRARARARARAAAA ===
pppppp
pp
pp
SaSaSaX
SaSaSaX
SaSaSaX
+++=
+++=
+++=
K
MM
K
K
2211
22221212
12121111
IIIIRRRRRRRR == Cov Cov )(*)( SS
Factor Analysis avoids
the multi-solution problem
Idea : use q < p
or X = AAAA S + ε, with εj the unique information of Xj
Parameters are given by the covariance matrix
with
pqpqppp
qq
qq
SaSaSaX
SaSaSaX
SaSaSaX
ε++++=
ε++++=
ε++++=
K
MM
K
K
2211
222221212
112121111
ε+= DDDDAAAAAAAAΣΣΣΣ T
[ ])Var(,),Var(diag 1 pεε=ε KDDDD
Still many possible solutions
since AAAA and ARARARART are equivalent in
Presented as an ‘advantage’ of the method :
allows rotation of the selected factors
(interpretation purpose)
ε+=Σ DDDDAAAAAAAA T
Factor Analysis avoids only partially
the multi-solution problem
PCA and FA are related
FA can be generalized
PCA : all variability in an item should be used
FA : only the variability in common is used
In most cases, the methods yield similar results
PCA is preferred for data reduction
FA is preferred for structure detection
pqpqppp
qq
FaFaFaX
FaFaFaX
ε++++=
ε++++=
K
M
K
2211
112121111
14.6 Independent Component
Analysis (ICA)
Goal : find source signals S from mixed ones X
ICA helps finding source signals
from mixed ones
Random variables are
- uncorrelated if second cross-moment is zero
- independant if all cross-moments are zero
pppppp
pp
pp
SaSaSaX
SaSaSaX
SaSaSaX
+++=
+++=
+++=
K
MM
K
K
2211
22221212
12121111
ICA is based on indenpendance
(and not orthogonality)
Thus solve X = A A A A S and find AAAA
s.t. Sj are independent
Then S = BBBB X with BBBB = AAAA-1
But alternative formulation needed
AAAA andandandand S are both unknown !
ICA is also subject to two ambiguities
There are an infinity of solutions
(AAAA / cste) compensates for (S *cste)
Variance of Sj is thus fixed to be 1
The components order can not be determined
X = AAAA S = A PA PA PA P-1 PPPPS = AAAA’ S ’
Does order really matters ?
ICA can not cope with Gaussian signals
Suppose AAAA is orthogonal, s1 and s2 are Gaussian
x1 and x2 are therefore Gaussian,
uncorrelated, with unit variance
The joint density is
which is symmetric : no information on AAAA
( )
+
−
pi
=
2
exp
2
1
,
2
2
2
1
21
xxxxp
The key to ICA model :
maximize nongaussianity
Let X be a mixture of independant components
A single component can be found
Let zzzz = AAAAT BBBB
Then refer to large interpretation of CLT …
∑==
i
ii
T xXY BBBBBBBB
ssXY TTT zzzzAAAABBBBBBBB ===
Kurtosis is a mesure of nongaussianity
Simple but not robust to outliers
Negentropy is a mesure of nongaussianty
The differential entropy is
From Information Theory :
entropy is maximum for Gaussian signals
Use negentropy instead
but need to estimate it …
( ) ( ) ( )∫−= yygygYH dlog
( ) ( ) ( )YHYHYJ gauss −=
Mutual Information
a.k.a. Kullback-Leibler distance
between g(y) and product of its marginal densities
Finding BBBB that minimizes mutual information
is equivalent to find directions
in which negentropy is maximized
Mutual Information
can also measure nongaussianity
( ) ( ) ( )∑
=
−=
p
j
j YHYHYI
1
Preprocessing improves ICA
Centering : X has zero mean
Whitening : X ’ s. t. E[X ’ X ’T] = I
Others : application dependant
ICA works better than PCA
ICA works better than PCA
ICA can be used for handwritten digits
14.6 Exploratory Projection Pursuit
Chapter 11
Find ‘interesting’ direction on which
multidimensional data can be projected
‘Interesting’ means displays some structure
of the data distribution
Gaussian distributions are the least interesting
(from structure point of view)
thus find nongaussian distributions (as ICA)
14.6 A different approach for ICA
The data are assigned to a class G = 1 and
compared to a ‘reference signal’ of class G = 0
Kind of generalized logistic regression
Project Pursuit Regression techniques
can therefore be applied
( )
( ) ( ) ( )XafXafGG TT 22111Pr1 1Prlog +==− =
14.7 Multidimensional Scaling (MDS)
Aims to project data, as SOMs, PCurves, …
but tries to preserve pairwise distances
Based on distance or dissimilarity (only)
Goal : find values minimizing the stress function
Least square or Kruskal-Shephard scaling
jiij xxd −=
( ) ( )∑
≠
−−=
ji
jiijnD zzdzzzS
2
21 ,,, K
MDS can be used with other scalings
Classical scaling minimizes
Sammon’s mapping minimizes
Shephard-Kruskal nonmetric scaling minimizes
( )
∑
≠
−−
ji ij
jiij
d
zzd 2
( )( )
∑
≠
−−θ
ji ij
ijji
d
dzz
2
2
( )∑
≠
−−−
ji
jjiiij zzzzs ,
MDS can separate sphere data
Unsupervised Learning (§4 to §7)
SOMs : is a constrained VQ algorithm
PCA : finds (linearly) orthogonal componants
PCurves : is a smooth generalization of PCA
PSurfaces : is a 2(+)-D generalization of PCurves
FA : finds common (correlated) componant parts
ICA : find independant componants
EPP : can be related to ICA
MDS : tries to preserve pairwise distances