为了正常的体验网站,请在浏览器设置里面开启Javascript功能!

相关向量机核函数选择研究(可编辑)

2017-11-12 24页 doc 58KB 20阅读

用户头像

is_737352

暂无简介

举报
相关向量机核函数选择研究(可编辑)相关向量机核函数选择研究(可编辑) 相关向量机核函数选择研究 中国 科 技 论 文 在 线////0>. A Study of Kernel Select for the Relevance Vector Machine** PENG Yang, LI Dehua 5 School of Automation, Huazhong University of Sci. & Tech., Wuhan Abstract: In this paper a method of the kernel select for th...
相关向量机核函数选择研究(可编辑)
相关向量机核函数选择研究(可编辑) 相关向量机核函数选择研究 中国 科 技 论 文 在 线////0>. A Study of Kernel Select for the Relevance Vector Machine** PENG Yang, LI Dehua 5 School of Automation, Huazhong University of Sci. & Tech., Wuhan Abstract: In this paper a method of the kernel select for the relevance vector machine is proposedFormer researches of RVM usually focus on the advantages of the algorithm in sparsely for less opportunities of over learning, how to accelerate the algorithm in order for needs such as large scale sets, and how to improve the precision of the algorithm by the combination of other algorithms such as 10 PCA or SVM. The parameters of the kernel functions are regarded more important than these functions themselves, so that how to select the appropriate kernel and its parameters is confused for the application of RVM. In this paper, a method for kernel select for the RVM is proposed. Firstly, by simulation experiment, the relationship between σ in Gaussian kernel function and the regression result is discovered. Secondly, by analysis the reason of the experimental result, a common principle for 15 kernel select is presented. Finally, on the principle, a composite kernel and the relation between different parameters and the number of RVs and the error rate in regression are tested by simulation experiment. Key words: Pattern Recognition; Relevance Vector Machine; Kernel Function 20 0 Introduction Support vector machine SVM is a very successful approach to supervised learning. The key feature of SVM in classification case is that its target function attempts to minimize a measure of error on the training set while simultaneously imizing the ‘margin’ between the separated classes[1] 25 However, the support vector methodology does exhibit significant disadvantages: 1 Predictions are not probabilistic. In regression the SVM outputs a point estimate, and in classification, a ‘hard’ binary decision. Ideally, we desire to estimate the conditional distribution pt|x in order to capture uncertainty in our prediction. In regression this may take the form of ‘error-bars’, but it is particularly crucial in classification where posterior probabilities of class 30 membership are necessary to adapt to varying class priors and asymmetric misclassification costs2 Although relatively sparse, SVMs make liberal use of kernel functions, the requisite number of which grows steeply with the size of the training set3 It is necessary to estimate the error/margin trade-off parameter ‘C’ and in regression, the insensitivity parameter ‘ ε’ too. This generally entails a cross-validation procedure, which is 35 wasteful both of data and computation4 The kernel function k ?,? must satisfy Mercer’s condition [2] [3] In 2000, Michael E. Tipping introduced the ‘relevance vector machine’ RVM , a probabilistic sparse kernel model identical in functional form to the SVM. In RVM, a Bayesian approach to learning is adopted, where a prior over the weights is governed by a set of 40 hyperparameters, one associated with each other, which most probable practice we find that the posterior distributions of many of the weights are sharply peaking around zero1 Background 1.1 Comparison of SVM and RVM [1] Tipping 2001 compared SVM and RVM for regression and classification, the result is in 45 tables below:Brief author introduction:PENG Yang1989-, Male, Master, Main research: Artificial Intelligence Correspondance author: LI Dehua1946-, Male, Professor,Main research: artificial intelligence. E-mail: 727643697@ - 1 - 中国 科 技 论 文 在 线////. [1] Tab.1 Comparison of RVM & SVM in Regressionerrors vectors Data set N d SVM RVM SVM RVM SincGaussian noise 100 1 0.378 0.326 45.2 6.7 SincUniform noise 100 1 0.215 0.187 44.3 7.0 Friedman #2 240 4 4140 3505 110.3 6.9 Friedman #3 240 4 0.0202 0.0164 106.5 11.5 Boston Housing 481 13 8..04 7.46 142.8 39.0 Normalized Mean1.00 0.86 1.00 0.15 50 [1] Tab.2 Comparison of RVM & SVM in Classificationerrors vectors Data set N d SVM RVM SVM RVM Pima Diabetes 200 8 20.1% 19.6% 109 4 U.S.P.S 7291 256 4.4% 5.1% 2540 316 Banana 400 2 10.9% 10.8% 135.2 11.4 Breast Cancer 200 9 26.9% 29.9% 116.7 6.3 Titanic 150 3 22.1% 23.0% 93.7 65.3 Waveform 400 21 10.3% 10.9% 116.7 6.3 German 700 20 22.6% 22.2% 411.2% 12.5 Image 1300 18 3.0% 3.9% 166.6 34.6 Normalized Mean1.00 1.08 1.00 0.17 [4] Wei Miao Yu, er al2004 did experiment for regression and classification by using the RBF kernel. By 400 samples training data and 1000 samples testing data, the performance of SVM 55 and RVM for the Banana data is in the table below: [4] Tab.3 Performance of SVM and RVM for the Banana Data Method SVM RVM Accuracy 89.08% 93.25% Number of Vectors 140 11 Optimal Hyperarameters C1.0, G1.0 D1.0 By totally 877 samples training data and 7888 testing data in practice, they found the presence of 60 the unstable data point easily causes the Hessian near to be semi definite. The performance of RVM for the optimal control regression is in the table below: [4] 65 Tab.4 The Performance of RVM for the Optimal Control Regression No. of Number of RVs RMS of Fitting Hyperarmeter VC RVM SVM RVM SVM RVM SVMC,G 65 15 17 2.256 2.312 2.2 45,0.2 196 30 64 0.752 0.976 1.4 40,0.1 327 75 143 0.940 0.943 0.7 30,0.1 532 146 253 1.300 1.243 0.5 60,0.1 By totally 286 samples training data and 1137 testing data in practice, the performance of RVM for the optimal control classification data is in the table below: [4] 70 Tab.5 The Performance of RVM for the Optimal Control Classification No. of Number of RVs RMS of Fitting Hyperarmeter VC RVM SVM RVM SVM RVM SVMC,G 71 4 15 85.66 85.22 1.8 50, 0.1 143 8 32 92.79 93.58 1.8 30, 0.1 214 8 34 90.24 91.38 1.8 30, 0.1 286 10 47 95.87 95.25 2.2 80, 0.1 - 2 - 中国 科 技 论 文 在 线////. Generally RVM is efficient to achieve a sparse solution for the regression and classificationThe performance of the RVM is comparable to SVM. The performance will be determined by the tuning of hyperparameters and the nature of the problem itself. Typically, the RVM will be slower 75 than the SVM algorithm. Even the fast training of the RVM is much faster that original method, it is still slower than SVM1.2 Fast Marginal Likelihood imization for Sparse Bayesian Models [5] Tipping and Anita C. Faul 2003 described a new and highly accelerated algorithm which exploits recently- elucidated properties of the marginal likelihood function to enable imization 80 via a principled and efficient sequential addition and deletion of candidate basis functions2 1. If regression initialize σ to some sensible value2. Initialize with a single basis vector Φ , setting i 2i1 1 i 2 Tt i 2? 2i All other αm are notionally set to infinity85 3. Explicitly compute ? and μ which are scalars initially, along with initial value of s and q m m for all M bases Φ m 4. Select a candidate basis vector Φ from the set of all Mi 2 5. Compute?qsi i i 6. If? 0 and, re-estimate α i i i 90 7. If? 0 and, add Φ to the model with updated α i i i i 8. If? 0 and, then delete Φ from the model and seti i i i 9. If regression and estimating the noise level, update2 ty2 1 2 NM? m mm m 10. Recompute/update ? and μ and all s and q m m 95 11. If converged terminate, otherwise goto 4The new marginal likelihood imization algorithm appears to operate highly effectively. It offers a very clear speed advantage over the originally proposed approach. Given that the sparse Bayesian framework arguably offers a number of advantageous features when compared with the popular SVM, it is salient that its main disadvantage, that of grossly extended training time for 100 large data sets, can be largely overcome1.3 RVM Expansion Methods for Large Sets [6] Catarina Silva 2008 developed and analyzed methods for expanding automated learning of Relevance Vector Machines to large scale text sets. Since RVM learning computational cost is 3 of On , where n is the number of training instances which results in huge computational cost for 105 large data sets, furthermore, computing a test case requires On cost for calculating the mean and 2 On cost for computing the variance, the heavy scaling properties are the main hurdles in the use of RVM in large scale problems. The rationale is to preserve the RVM probabilistic Bayesian nature, together with the sparse solutions achieved, while improving classification performance, by using all training documents available. The figure below depicts a schematic representation of 110 the three methods - 3 - 中国 科 技 论 文 在 线////. all docs all docs all docs sampling division. model boost model 1. model nmodel 1 model 2 model 3 RV RV RV N MODELS+ voting model boosting weights a b c Fig.1 a Incremental RVM b RVM Boosting c RVM Ensemble 115 In Incremental RVM, the dataset is divided into evenly sized smaller subsets. These chunks are then trained independently and if possible in parallel, resulting in a set of RVM models. The relevance vectors yielded by each model are gathered and constitute the training set of a new RVM model, which will be the final incremental RVM. The size of each chunk and the number of chunks should be determined according to the available computational power. For instance if there 120 is a distributed platform with available resources, the procedure can be speeded upThe boosting algorithm assigns different importance weights to different training examplesThe algorithm proceeds by incrementally assigning increasing significance to examples which are hard to classify, while easier training examples get lower weights. This weight strategy is the basis of the weak learners evaluation. The final combined hypothesis classifies a new test example by 125 computing the prediction of each of the weak hypothesis and taking a vote on these predictionsRVM Ensemble starts by constructing several smaller evenly sized training sets, randomly sampled from the available training set. The size and number of the training sets depends on the available computational power, but more training examples used usually results in more diversity and better performance achieved. Then, from each training set a model is learned, and these 130 models will constitute the ensemble individual classifiers. After this learning phase, a majority voting scheme is implemented to determine the ensemble output decision, taking as output value the average value of the classifiers that corroborated the majority decision1.4 SVR+RVR As SVR support vector regression is a robust kernel regression method, there are many 135 redundant components in the SV set of SVR. The learned RVR relevance vector regression [7] model is typically highly sparse, but it is not robust to outliers. Gai Ying Zhang, er al2010 concatenated SVR and RVR as below: 1 Given the training setl Dx , y 1 3 ,i i i ?1 140 training a SVR machine on it. The regression function of the SVR can been written as, f x? kx , x ?b 14SVR i i x ?SV i where SV is support vector set of the SVR machine, b is the bias, and k?,? is the kernel function used by SVR2 LetEx , f x 1 5 145 ,i SVR i x ?SV i - 4 - 中国 科 技 论 文 在 线////. training a RVR machine on the new training set E. At last, we obtain the regression function of the RVR machine , f x? k x , xb 1 ?6RVR i i x ?SV i where RV is relevance vector set of the RVR machine, b’ is the bias, and k’?,? is the kernel 150 function used by RVR1.5 Application of RVM Xiaodong Wang, er al2009 [8]investigated an approach to classify the data from an electronic nose, which is based on the method of the RVM. The electronic nose data are first converted into principal components using the principal component analysis PCA method and 155 then directly sent as inputs to a RVM classifier [9] Pijush Samui2007 introduced RVM to predict seismic attenuation based on rock properties. The results show that RVM approach has the potential to be a practical tool for determination of seismic attenuation[6] Catarina Silva 2008 demonstrated that, by exploring incremental, ensemble and boosting 160 strategies, it is possible to make use of RVM advantages, such as, predictive distributions for test instances and sparse solutions, while maintaining and even improving the classification performance competitive edge [10] A. Wahab, er al2010 used the relevance vector machine RVM as classifier based on our preliminary work for the time domain feature extraction for emotion recognition using the 165 electroencephalographic EEG signalsIn these researches, we can find that RVM is a useful method in regression and classificationThe authors mainly focus on the sparse of relevance vector set and the long time training. We can make a conclusion that RVM is sparseness, and the challenge is how to accelerate the training speed, while there are few researches concerning the tolerance and the function parameter itself170 2 The Kernel of the RVM To train the SVM or RVM model, three types of kernel function are generally used. They are: 1 Polynomial 2 Gaussian 3 Spline[9] 175 Pijush Samui2007 studied the influence of the RVM regression using different kernel functions. In this study, the coefficient of correlations R is the main criterion that is used to evaluate the performance of the developed RVM models. The result is in the table below: Tab.6 General performance of RVM for different kernels Kernel Training Performance Testing PerformanceR Number of Relevance Vector Gaussian width 0.8 0.977 0.954 6 Polynomial degree 4 0.996 0.977 2 Spline 0.986 0.948 2 180[11] Gustavo Camps-Valls, er al2006 presented a framework of composite kernel machines [12] for enhanced classification of hyperspectral images. Brian Mak, er al2004 proposed various composite kernels for kernel eigenvoice speaker adaptation could be more effective. Gustavo [13] Camps-Valls, er al2007 extended it to the sparse Bayesian framework, in which some - 5 - 中国 科 技 论 文 在 线////. 185 attractive additional properties are available, such as the improved sparsely of the solution and the confidence intervals provided for the predictions3 The Relationship between σ in Gaussian Kernel Function and the Regression Result We select 200 points in [0,1] as input samples, in which the odd-numbered ones are used for 190 training set, and the even-numbered ones are used for testing set. The outputs of training as: ysinc[10x ?0.5]k ?sinc[10x ?0.5] 3 ?1wherek[0,1 3 2U[1,1]3 3195 Let k0, and the kernel function as: 2 xxij2Kx , x e 3 4 ij the result of the experiment as: Number of Relevant 100 50 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Average Tolerance 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 200 Fig.2 Relationship between σ and Number of RV and Average Tolerance We can easily see that the learning machine becomes sparser when the kernel is wider σ gets larger. The tolerance is getting larger at the same time. If we limit the tolerance less than 4%, and choose the parameter σ when we need the least relevance vectors, result of the experiment as: 205- 6 - 中国 科 技 论 文 在 线////. 1 1.2 1 0.8 0.8 0.6 0.6 0.4 0.40.2 0.2 0 0 -0.2 -0.2 -0.4 -0.4 0 0.5 1 0 0.5 1 Fig.3 RVM Regression without Noise where σ 0.216. The dots in the left figure are training samples, while the curves in the right 210 figure are prediction and real outputs, and the circles are the relevance vectorsWhen the training set is noisy, for example let k5%, the result of the experiment as: Number of Relevant 100 50 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45Average Tolerance 0.4 0.3 0.2 0.1 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Fig.4 Relationship between σ and Number of RV and Average Tolerance in Noisy Model 215 We can see that the result is similar to the prior model. Similarity, we limit the tolerance less than 4%, and choose the parameter σ when we need the least relevance vectors, result of the experiment as: - 7 - 中国 科 技 论 文 在 线////. 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 220 0.2 0.2 0 0 -0.2 -0.2 -0.4 -0.4 0 0.5 1 0 0.5 1 Fig.5 RVM Regression with 5% Noise where σ 0.1154 Analysis of the Result 225 When we use polynomial kernel to do this experiment prior, we cannot control the tolerance under 4%, but the number of the relevance vectors is less. If there are k relevance vectors in the set , let Xx xx x 4 ?1R n n n n 12 ki The prediction model is as: k 230 yxw Kx, x w 42nn 0 ii i ?1 In this space, we hope the vectors could be linear. Taylor expansion of 4-2 2 kk dKx, x d Kx, x 1 nn 2 ii yxyx w xx w xx 0??nn 0 0 2 ii dx 2! dx ii 11 xx0 xx0 44 let x0 4 5 0 235 4-4 can be written as 2 kk dKx, x d Kx, x 1 nn 2 iiyxy0w xw x 46 nn 2 ii dx 2! dx ii 11 xxxx0 0 let i Pxa x47? i i ?0 and 240 yxPx 4 8 then - 8 - 中国 科 技 论 文 在 线////. j k d Kx, x 1 n i aw4 9 inj i i! dx j ?1 xx0The differential of2 xa? 2kxe4 10 245 is 2 xa? 2xa 2? k x? e 4 ?11 2 That is 2xak x? kx 4 ?12 2 The equation 4-10 with both sides seeking derivative of order n, 2xa 2 n1 n n 1 k x? k xn k x4 ?13 22 250 That is 2 n2 n 1 n k x? [xak xn ?1k x]4 ?14 2 then 1 a4 15 i k 255 where ki 4 16As n a lim0 ?a 4 ?17 n n!a is convergent. That is to say, the less σ is, the more information of high- dimensional presentsi 260 So we need more relevance vectors to express it and it will be more accurate, of course 5 Composite Kernel in Regression Since the polynomial kernel can be used to control the number of relevance vectors and the Gaussian kernel perform well in accurateness, we propose the composite kernel of the two parts in RVM. Then we can emphasize the low-dimensional components and keep the high- dimensional 265 componentsFor example, when the kernel function is 2 xxij2 3Kx , x 50xx e 5 ?1 i j i j we repeat the experiment, the result is - 9 - 中国 科 技 论 文 在
/
本文档为【相关向量机核函数选择研究(可编辑)】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索