相关向量机核函数选择研究(可编辑)
相关向量机核函数选择研究
中国 科 技 论 文 在 线////0>.
A Study of Kernel Select for the Relevance Vector Machine**
PENG Yang, LI Dehua
5 School of Automation, Huazhong University of Sci. & Tech., Wuhan
Abstract: In this paper a method of the kernel select for the relevance vector machine is proposedFormer researches of RVM usually focus on the advantages of the algorithm in sparsely for less
opportunities of over learning, how to accelerate the algorithm in order for needs such as large scale
sets, and how to improve the precision of the algorithm by the combination of other algorithms such as
10 PCA or SVM. The parameters of the kernel functions are regarded more important than these functions
themselves, so that how to select the appropriate kernel and its parameters is confused for the
application of RVM. In this paper, a method for kernel select for the RVM is proposed. Firstly, by
simulation experiment, the relationship between σ in Gaussian kernel
function and the regression result
is discovered. Secondly, by analysis the reason of the experimental result, a common principle for
15 kernel select is presented. Finally, on the principle, a composite kernel and the relation between
different parameters and the number of RVs and the error rate in regression are tested by simulation
experiment.
Key words: Pattern Recognition; Relevance Vector Machine; Kernel Function
20 0 Introduction
Support vector machine SVM is a very successful approach to supervised learning. The key
feature of SVM in classification case is that its target function attempts to minimize a measure of
error on the training set while simultaneously imizing the ‘margin’ between the separated
classes[1]
25 However, the support vector methodology does exhibit significant disadvantages:
1 Predictions are not probabilistic. In regression the SVM outputs a point estimate, and in
classification, a ‘hard’ binary decision. Ideally, we desire to
estimate the conditional distribution
pt|x in order to capture uncertainty in our prediction. In regression this may take the form of
‘error-bars’, but it is particularly crucial in classification where posterior probabilities of class
30 membership are necessary to adapt to varying class priors and asymmetric misclassification costs2 Although relatively sparse, SVMs make liberal use of kernel functions, the requisite
number of which grows steeply with the size of the training set3 It is necessary to estimate the error/margin trade-off parameter ‘C’ and
in regression,
the insensitivity parameter ‘ ε’ too. This generally entails a
cross-validation procedure, which is
35 wasteful both of data and computation4 The kernel function k ?,? must satisfy Mercer’s condition [2] [3]
In 2000, Michael E. Tipping introduced the ‘relevance vector machine’ RVM , a
probabilistic sparse kernel model identical in functional form to the SVM. In RVM, a Bayesian
approach to learning is adopted, where a prior over the weights is governed by a set of
40 hyperparameters, one associated with each other, which most
probable practice we find that the
posterior distributions of many of the weights are sharply peaking around zero1 Background
1.1 Comparison of SVM and RVM
[1]
Tipping 2001 compared SVM and RVM for regression and classification, the result is in
45 tables below:Brief author introduction:PENG Yang1989-, Male, Master, Main research: Artificial Intelligence
Correspondance author: LI Dehua1946-, Male, Professor,Main research: artificial intelligence. E-mail:
727643697@
- 1 - 中国 科 技 论 文 在 线////.
[1]
Tab.1 Comparison of RVM & SVM in Regressionerrors vectors
Data set N d SVM RVM SVM RVM
SincGaussian noise 100 1 0.378 0.326 45.2 6.7
SincUniform noise 100 1 0.215 0.187 44.3 7.0
Friedman #2 240 4 4140 3505 110.3 6.9
Friedman #3 240 4 0.0202 0.0164 106.5 11.5
Boston Housing 481 13 8..04 7.46 142.8 39.0
Normalized Mean1.00 0.86 1.00 0.15
50
[1]
Tab.2 Comparison of RVM & SVM in Classificationerrors vectors Data set N d SVM RVM SVM RVM
Pima Diabetes 200 8 20.1% 19.6% 109 4
U.S.P.S 7291 256 4.4% 5.1% 2540 316
Banana 400 2 10.9% 10.8% 135.2 11.4
Breast Cancer 200 9 26.9% 29.9% 116.7 6.3
Titanic 150 3 22.1% 23.0% 93.7 65.3
Waveform 400 21 10.3% 10.9% 116.7 6.3
German 700 20 22.6% 22.2% 411.2% 12.5
Image 1300 18 3.0% 3.9% 166.6 34.6
Normalized Mean1.00 1.08 1.00 0.17
[4]
Wei Miao Yu, er al2004 did experiment for regression and
classification by using the
RBF kernel. By 400 samples training data and 1000 samples testing data,
the performance of SVM
55 and RVM for the Banana data is in the table below: [4]
Tab.3 Performance of SVM and RVM for the Banana Data Method SVM RVM
Accuracy 89.08% 93.25%
Number of Vectors 140 11
Optimal Hyperarameters C1.0, G1.0 D1.0
By totally 877 samples training data and 7888 testing data in practice,
they found the presence of
60 the unstable data point easily causes the Hessian near to be semi
definite. The performance of
RVM for the optimal control regression is in the table below: [4]
65
Tab.4 The Performance of RVM for the Optimal Control Regression No. of Number of RVs RMS of Fitting Hyperarmeter
VC
RVM SVM RVM SVM RVM SVMC,G
65 15 17 2.256 2.312 2.2 45,0.2
196 30 64 0.752 0.976 1.4 40,0.1
327 75 143 0.940 0.943 0.7 30,0.1
532 146 253 1.300 1.243 0.5 60,0.1
By totally 286 samples training data and 1137 testing data in practice,
the performance of RVM
for the optimal control classification data is in the table below: [4]
70
Tab.5 The Performance of RVM for the Optimal Control Classification
No. of Number of RVs RMS of Fitting Hyperarmeter
VC
RVM SVM RVM SVM RVM SVMC,G
71 4 15 85.66 85.22 1.8 50, 0.1
143 8 32 92.79 93.58 1.8 30, 0.1
214 8 34 90.24 91.38 1.8 30, 0.1
286 10 47 95.87 95.25 2.2 80, 0.1
- 2 - 中国 科 技 论 文 在 线////.
Generally RVM is efficient to achieve a sparse solution for the regression and classificationThe performance of the RVM is comparable to SVM. The performance will be determined by the
tuning of hyperparameters and the nature of the problem itself. Typically, the RVM will be slower
75 than the SVM algorithm. Even the fast training of the RVM is much faster that original method, it
is still slower than SVM1.2 Fast Marginal Likelihood imization for Sparse Bayesian Models
[5]
Tipping and Anita C. Faul 2003 described a new and highly accelerated algorithm which
exploits recently- elucidated properties of the marginal likelihood function to enable imization
80 via a principled and efficient sequential addition and deletion of candidate basis functions2
1. If regression initialize σ to some sensible value2. Initialize with a single basis vector Φ , setting
i
2i1 1
i 2
Tt
i 2?
2i
All other αm are notionally set to infinity85 3. Explicitly compute ? and μ which are scalars initially, along with initial value of s and q
m m
for all M bases Φ
m
4. Select a candidate basis vector Φ from the set of all Mi
2
5. Compute?qsi i i
6. If? 0 and, re-estimate α
i
i i
90 7. If? 0 and, add Φ to the model with updated α
i i
i i
8. If? 0 and, then delete Φ from the model and seti
i i i
9. If regression and estimating the noise level, update2
ty2 1 2
NM? m mm
m
10. Recompute/update ? and μ and all s and q
m m
95 11. If converged terminate, otherwise goto 4The new marginal likelihood imization algorithm appears to operate highly effectively. It
offers a very clear speed advantage over the originally proposed approach. Given that the sparse
Bayesian framework arguably offers a number of advantageous features when compared with the
popular SVM, it is salient that its main disadvantage, that of grossly extended training time for
100 large data sets, can be largely overcome1.3 RVM Expansion Methods
for Large Sets
[6]
Catarina Silva 2008 developed and analyzed methods for expanding automated learning
of Relevance Vector Machines to large scale text sets. Since RVM learning computational cost is
3
of On , where n is the number of training instances which results in huge computational cost for
105 large data sets, furthermore, computing a test case requires On cost for calculating the mean and
2
On cost for computing the variance, the heavy scaling properties are the main hurdles in the use
of RVM in large scale problems. The rationale is to preserve the RVM probabilistic Bayesian
nature, together with the sparse solutions achieved, while improving classification performance,
by using all training documents available. The figure below depicts a schematic representation of
110 the three methods
- 3 - 中国 科 技 论 文 在 线////.
all docs all docs
all docs
sampling
division.
model
boost model 1. model nmodel 1 model 2 model 3
RV
RV
RV
N MODELS+
voting
model boosting weights
a b c
Fig.1 a Incremental RVM b RVM Boosting c RVM Ensemble
115 In Incremental RVM, the dataset is divided into evenly sized smaller subsets. These chunks
are then trained independently and if possible in parallel, resulting in a set of RVM models. The
relevance vectors yielded by each model are gathered and constitute the training set of a new
RVM model, which will be the final incremental RVM. The size of each chunk and the number of
chunks should be determined according to the available computational power. For instance if there
120 is a distributed platform with available resources, the procedure can be speeded upThe boosting algorithm assigns different importance weights to different training examplesThe algorithm proceeds by incrementally assigning increasing significance to examples which are
hard to classify, while easier training examples get lower weights. This weight strategy is the basis
of the weak learners evaluation. The final combined hypothesis classifies a new test example by
125 computing the prediction of each of the weak hypothesis and taking a vote on these predictionsRVM Ensemble starts by constructing several smaller evenly sized training sets, randomly
sampled from the available training set. The size and number of the training sets depends on the
available computational power, but more training examples used usually results in more diversity
and better performance achieved. Then, from each training set a model is learned, and these
130 models will constitute the ensemble individual classifiers. After this learning phase, a majority
voting scheme is implemented to determine the ensemble output
decision, taking as output value
the average value of the classifiers that corroborated the majority decision1.4 SVR+RVR
As SVR support vector regression is a robust kernel regression method, there are many
135 redundant components in the SV set of SVR. The learned RVR relevance vector regression
[7]
model is typically highly sparse, but it is not robust to outliers. Gai Ying Zhang, er al2010
concatenated SVR and RVR as below:
1 Given the training setl
Dx , y 1 3 ,i i i ?1
140 training a SVR machine on it. The regression function of the SVR can been written as,
f x? kx , x ?b 14SVR i i
x ?SV
i
where SV is support vector set of the SVR machine, b is the bias, and k?,? is the kernel
function used by SVR2 LetEx , f x 1 5
145 ,i SVR i x ?SV
i
- 4 - 中国 科 技 论 文 在 线////.
training a RVR machine on the new training set E. At last, we obtain the regression function
of the RVR machine ,
f x? k x , xb 1 ?6RVR i i
x ?SV
i
where RV is relevance vector set of the RVR machine, b’ is the bias, and k’?,? is the kernel
150 function used by RVR1.5 Application of RVM
Xiaodong Wang, er al2009 [8]investigated an approach to classify the data from an
electronic nose, which is based on the method of the RVM. The electronic nose data are first
converted into principal components using the principal component analysis PCA method and
155 then directly sent as inputs to a RVM classifier
[9]
Pijush Samui2007 introduced RVM to predict seismic attenuation based on rock
properties. The results show that RVM approach has the potential to
be a practical tool for
determination of seismic attenuation[6]
Catarina Silva 2008 demonstrated that, by exploring incremental, ensemble and boosting
160 strategies, it is possible to make use of RVM advantages, such as, predictive distributions for test
instances and sparse solutions, while maintaining and even improving the classification
performance competitive edge
[10]
A. Wahab, er al2010 used the relevance vector machine RVM as classifier based on our
preliminary work for the time domain feature extraction for emotion recognition using the
165 electroencephalographic EEG signalsIn these researches, we can find that RVM is a useful method in regression and classificationThe authors mainly focus on the sparse of relevance vector set and the long time training. We can
make a conclusion that RVM is sparseness, and the challenge is how to accelerate the training
speed, while there are few researches concerning the tolerance and the function parameter itself170 2 The Kernel of the RVM
To train the SVM or RVM model, three types of kernel function are generally used. They are:
1 Polynomial
2 Gaussian
3 Spline[9]
175 Pijush Samui2007 studied the influence of the RVM regression using different kernel
functions. In this study, the coefficient of correlations R is the main criterion that is used to
evaluate the performance of the developed RVM models. The result is in the table below:
Tab.6 General performance of RVM for different kernels
Kernel Training Performance Testing PerformanceR Number of Relevance
Vector
Gaussian width 0.8 0.977 0.954 6
Polynomial degree 4 0.996 0.977 2
Spline 0.986 0.948 2
180[11]
Gustavo Camps-Valls, er al2006 presented a framework of composite kernel machines
[12]
for enhanced classification of hyperspectral images. Brian Mak, er
al2004 proposed various
composite kernels for kernel eigenvoice speaker adaptation could be more effective. Gustavo
[13]
Camps-Valls, er al2007 extended it to the sparse Bayesian framework, in which some
- 5 - 中国 科 技 论 文 在 线////.
185 attractive additional properties are available, such as the improved sparsely of the solution and the
confidence intervals provided for the predictions3 The Relationship between σ in Gaussian Kernel Function and the
Regression Result
We select 200 points in [0,1] as input samples, in which the odd-numbered ones are used for
190 training set, and the even-numbered ones are used for testing set. The outputs of training as:
ysinc[10x ?0.5]k ?sinc[10x ?0.5] 3 ?1wherek[0,1 3 2U[1,1]3 3195 Let k0, and the kernel function as:
2
xxij2Kx , x e 3 4
ij
the result of the experiment as:
Number of Relevant
100
50
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Average Tolerance 0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
200
Fig.2 Relationship between σ and Number of RV and Average Tolerance We can easily see that the learning machine becomes sparser when the
kernel is wider σ gets
larger. The tolerance is getting larger at the same time. If we limit
the tolerance less than 4%, and
choose the parameter σ when we need the least relevance vectors,
result of the experiment as:
205- 6 - 中国 科 技 论 文 在 线////.
1 1.2
1
0.8
0.8
0.6
0.6
0.4
0.40.2
0.2
0
0
-0.2
-0.2
-0.4 -0.4
0 0.5 1 0 0.5 1
Fig.3 RVM Regression without Noise where σ 0.216. The dots in the left figure are training samples,
while the curves in the right 210 figure are prediction and real outputs, and the circles are the
relevance vectorsWhen the training set is noisy, for example let k5%, the
result of the experiment as:
Number of Relevant
100
50
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45Average Tolerance
0.4
0.3
0.2
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Fig.4 Relationship between σ and Number of RV and Average Tolerance
in Noisy Model
215
We can see that the result is similar to the prior model. Similarity,
we limit the tolerance less than
4%, and choose the parameter σ when we need the least relevance
vectors, result of the experiment
as:
- 7 - 中国 科 技 论 文 在 线////.
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
220
0.2 0.2
0 0
-0.2 -0.2
-0.4 -0.4
0 0.5 1 0 0.5 1
Fig.5 RVM Regression with 5% Noise where σ 0.1154 Analysis of the Result 225 When we use polynomial kernel to do this experiment prior, we
cannot control the tolerance
under 4%, but the number of the relevance vectors is less. If there
are k relevance vectors in the set , let
Xx xx x 4 ?1R n n n n
12 ki
The prediction model is as:
k
230 yxw Kx, x w 42nn 0
ii
i ?1
In this space, we hope the vectors could be linear. Taylor expansion
of 4-2
2
kk
dKx, x d Kx, x 1
nn
2
ii
yxyx w xx w xx 0??nn 0 0
2
ii
dx 2! dx
ii 11
xx0 xx0 44
let
x0 4 5
0
235 4-4 can be written as
2
kk
dKx, x d Kx, x 1
nn
2
iiyxy0w xw x 46
nn
2
ii
dx 2! dx ii 11
xxxx0
0
let
i
Pxa x47? i
i ?0
and
240 yxPx 4 8
then
- 8 - 中国 科 技 论 文 在 线////.
j
k
d Kx, x 1
n
i
aw4 9
inj
i
i! dx
j ?1
xx0The differential of2
xa?
2kxe4 10
245 is
2
xa?
2xa
2?
k x? e 4 ?11 2
That is
2xak x? kx 4 ?12
2
The equation 4-10 with both sides seeking derivative of order n,
2xa 2
n1 n n 1
k x? k xn k x4 ?13 22
250
That is
2
n2 n 1 n
k x? [xak xn ?1k x]4 ?14
2
then
1
a4 15
i
k
255 where
ki 4 16As
n
a
lim0 ?a 4 ?17
n
n!a is convergent. That is to say, the less σ is, the more information
of high- dimensional presentsi
260 So we need more relevance vectors to express it and it will be more accurate, of course
5 Composite Kernel in Regression
Since the polynomial kernel can be used to control the number of relevance vectors and the
Gaussian kernel perform well in accurateness, we propose the composite kernel of the two parts in
RVM. Then we can emphasize the low-dimensional components and keep the high- dimensional
265 componentsFor example, when the kernel function is
2
xxij2
3Kx , x 50xx e 5 ?1
i j i j
we repeat the experiment, the result is
- 9 - 中国 科 技 论 文 在