 Research
 Open Access
 Published:
Distancebased features in pattern classification
EURASIP Journal on Advances in Signal Processing volume 2011, Article number: 62 (2011)
Abstract
In data mining and pattern classification, feature extraction and representation methods are a very important step since the extracted features have a direct and significant impact on the classification accuracy. In literature, numbers of novel feature extraction and representation methods have been proposed. However, many of them only focus on specific domain problems. In this article, we introduce a novel distancebased feature extraction method for various pattern classification problems. Specifically, two distances are extracted, which are based on (1) the distance between the data and its intracluster center and (2) the distance between the data and its extracluster centers. Experiments based on ten datasets containing different numbers of classes, samples, and dimensions are examined. The experimental results using naïve Bayes, kNN, and SVM classifiers show that concatenating the original features provided by the datasets to the distancebased features can improve classification accuracy except imagerelated datasets. In particular, the distancebased features are suitable for the datasets which have smaller numbers of classes, numbers of samples, and the lower dimensionality of features. Moreover, two datasets, which have similar characteristics, are further used to validate this finding. The result is consistent with the first experiment result that adding the distancebased features can improve the classification performance.
1. Introduction
Data mining has received unprecedented focus in the recent years. It can be utilized in analyzing a huge amount of data and finding valuable information. Particularly, data mining can extract useful knowledge from the collected data and provide useful information for making decisions [1, 2]. With the rapid increase in the size of organizations' databases and data warehouses, developing efficient and accurate mining techniques have become a challenging problem.
Pattern classification is an important research topic in the fields of data mining and machine learning. In particular, it focuses on constructing a model so that the input data can be assigned to the correct category. Here, the model is also known as a classifier. Classification techniques, such as support vector machine (SVM) [3], can be used in a wide range of applications, e.g., document classification, image recognition, web mining, etc. [4]. Most of the existing approaches perform data classification based on a distance measure in a multivariate feature space.
Because of the importance of classification techniques, the focus of our attention is placed on the approach for improving classification accuracy. For any pattern classification problem, it is very important to choose appropriate or representative features since they have a direct impact on the classification accuracy. Therefore, in this article, we introduce novel distancebased features to improve classification accuracy. Specifically, the distances between the data and cluster centers are considered. This leads to the intracluster distance between the data and the cluster center in the same cluster, and the extracluster distance between the data and other cluster centers.
The idea behind the distancebased features is to extend and take the advantage of the centroidbased classification approach [5], i.e., all the centroids over a given dataset usually have their discrimination capabilities for distinguishing data between different classes. Therefore, the distance between a specific data and its nearest centroid and other distances between the data and other centroids should be able to provide valuable information for classification.
This rest of the article is organized as follows. Section 2 briefly describes feature selection and several classification techniques. Related work focusing on extracting novel features is reviewed. Section 3 introduces the proposed distancebased feature extraction method. Section 4 presents the experimental setup and results. Finally, conclusion is provided in Section 5.
2. Literature review
2.1. Feature selection
Feature selection can be considered as a combination optimization problem. The goal of feature selection is to select the most discriminant features from the original features [6]. In many pattern classification problems, we are often confronted with the curse of dimensionality, i.e., the raw data contain too many features. Therefore, it is a common practice to remove redundant features so that efficiency and accuracy can be improved [7, 8].
To perform appropriate feature selection, the following considerations should be taken into account [9]:

1.
Accuracy: Feature selection can help us exclude irrelevant features from the raw data. These irrelevant features usually have a disrupting effect on the classification accuracy. Therefore, classification accuracy can be improved by filtering out the irrelevant features.

2.
Operation time: In general, the operation time is proportional to the number of selected features. Therefore, we can effectively improve classification efficiency using feature selection.

3.
Sample size: The more samples we have, the more features can be selected.

4.
Cost: Since it takes time and money to collect data, excessive features would definitely incur additional cost. Therefore, feature selection can help us to reduce the cost in collecting data.
In general, there are two approaches for dimensionality reduction, namely, feature selection and feature extraction. In contrast to the feature selection, feature extraction performs transformation or combination on the original features [10]. In other words, feature selection finds the best feature subset from the original feature set. On the other hand, feature extraction projects the original feature to a subspace where classification accuracy can be improved.
In literature, there are many approaches for dimensionality reduction. principal component analysis (PCA) is one of the most widely used techniques to perform this task [11–13].
The origin of PCA can be traced back to 1901 [14] and it is an approach for multivariate analysis. In a realworld application, the features from different sources are more and less correlated. Therefore, one can develop a more efficient solution by taking these correlations into account. The PCA algorithm is based on the correlation between features and finds a lowerdimensional subspace where covariance is maximized. The goal of PCA is to use a few extracted features to represent the distribution of the original data. The PCA algorithm can be summarized in the following steps:

1.
Compute the mean vector μ and the covariance matrix S of the input data.

2.
Compute the eigenvalues and eigenvectors of S. The eigenvalues and the corresponding eigenvectors are sorted according the eigenvalues.

3.
The transformation matrix contains the sorted eigenvectors. The number of eigenvectors preserved in the transformation matrix can be adjusted by users.

4.
A lowerdimensional feature vector is obtained by subtracting the mean vector μ from an input datum and then multiplied by the projection matrix.
2.2. Pattern clustering
The aim of clustering analysis is to find groups of data samples having similar properties. This is an unsupervised learning method because it does not require the category information associated with each sample [15]. In particular, the clustering algorithms can be divided into five categories [16], namely, hierarchical, partitioning, densitybased, gridbased, and modelbased methods.
The kmeans algorithm is a representative approach belonging to the partition method. In addition, it is a simple, efficient, and widely used clustering method. Given k clusters, each sample is randomly assigned to a cluster. By doing so, we can find the initial locations of cluster centers. We can then reassign each sample to the nearest cluster center. After the reassignment, the locations of cluster centers should be updated. The previous steps are iterated until some termination condition is satisfied.
2.3. Pattern classification
The goal of pattern classification is to predict the category of the input data using its attributes. In particular, a certain number of training samples are available for each class, and they are used to train the classifier. In addition, each training sample is represented by a number of measurements (i.e., feature vectors) corresponding to a specific class. This can be called as supervised learning [15, 17].
In this article, we will utilize three popular classification techniques, namely, naïve Bayes, SVMs, and knearest neighbor (kNN), to evaluate the proposed distancebased features.
2.3.1. Naïve Bayes
The naïve Bayes classifier is a probabilistic classifier based on the Bayes' theorem [15]. It requires all assumptions to be explicitly built into models which are then used to derive 'optimal' decision/classification rules. It can be used to represent the dependence between random variables (features) and to give a concise and tractable specification of the joint probability distribution for a domain. It is constructed using the training data to estimate the probability of each class given the feature vectors of a new instance. Given an example represented by the feature vector X, the Bayes' theorem provides a method to compute the probability that X belongs to class C_{ i } , denoted as p(C_{ i } X):
i.e., the naïve Bayes classifier learns the conditional probability of each attribute x_{ j } (j = 1,2,...,N) of X given the class label C_{ i } . Therefore, the classification problem can be stated as 'given a set of observed features x_{ j } , from an object X, classify X into one of the classes.
2.3.2. Support vector machines
A SVM [3] has widely been applied in many pattern classification problems. It is designed to separate a set of training vectors which belong to two different classes, (x_{1}, y_{1}), (x_{2}, y_{2}),...,(x_{ m }, y_{ m } ) where x_{ i } ∈ R^{d} denotes vectors in a ddimensional feature space and y_{ i } ∈ {1, +1} is a class label. In particular, the input vectors are mapped into a new higher dimensional feature space denoted as Φ: R^{d} →H^{f} where d < f. Then, an optimal separating hyperplane in the new feature space is constructed by a kernel function, K(x_{ i }, x_{ j } ) which products between input vectors x_{ i } and x_{ j } where K(x_{ i }, x_{ j } ) = Φ(x_{ i } ) Φ(x_{ j } ).
All vectors lying on one side of the hyperplane are labeled '1', and all vectors lying on the other side are labeled '+1'. The training instances that lie closest to the hyperplane in the transformed space are called support vectors.
2.3.3. Knearest neighbor
The k NN classifier is a conventional nonparametric classifier [15]. To classify an unknown instance represented by some feature vectors as a point in the feature space, the kNN classifier calculates the distances between the point (i.e., the unknown instance) and the points in the training dataset. Then, it assigns the point to the class among its kNNs (where k is an integer).
In the process of creating a kNN classifier, k is an important parameter and different k values will cause different performances. If k is considerably huge, the neighbors which used for classification will make large classification time and influence the classification accuracy.
2.4. Related work of feature extraction
In this study, the main focus is placed on extracting novel distancebased features so that classification accuracy can be improved. The followings summarize some related studies proposing new feature extraction and representation methods for some pattern classification problems. In addition, the contributions of these research works are briefly discussed.
Tsai and Lin [18] propose a triangle areabased nearest neighbor approach and apply it to the problem of intrusion detection. Each data are represented by a number of triangle areas as its feature vectors, in which a triangle area is based on the data, its cluster center, and one of the other clusters. Their approach achieves high detection rate and low false positive rate on the KDDcup99 dataset.
Lin [19] proposes an approach called centroidbased and nearest neighbor (CANN). This approach uses cluster centers and their nearest neighbors to yield a onedimensional feature and can effectively improve the performance of an intrusion detection system. The experimental results over the KDD CUP 99 dataset indicate that CANN can improve the detection rate and reduce computational cost.
Zeng et al. [20] propose a novel feature extraction method based on Delaunay triangle. In particular, a topological structure associated with the handwritten shape can be represented by the Delaunay triangle. Then, an HMMbased recognition system is used to demonstrate that their representation can achieve good performance in the handwritten recognition problem.
Xue et al. [21] propose a Bayesian shape model for facial feature extraction. Their model can tolerate local and global deformation on a human face. The experimental results demonstrate that their approach provides better accuracy in locating facial features than the active shape model.
Choi and Lee [22] propose a feature extraction method based on the Bhattacharyya distance. They consider the classification error as a criterion for extracting features and an iterative gradient descent algorithm is utilized to minimize the estimated classification error. Their feature extraction method performs favorably with conventional methods over remotely sensed data.
To sum up, the limitations of much related work extracting novel features are that they only focuses on solving some specific domain problem. In addition, they use their proposed features to directly compare with original features in terms of classification accuracy and/or errors, i.e., they do not consider 'fusing' the original and novel features as another new feature representation for further comparisons. Therefore, the novel distancebased features proposed in this article are examined over a number of different pattern classification problems and the distancebased features and the original features are concatenated for another new feature representation for classification.
3. Distancebased features
In this section, we will describe the proposed method in detail. The aim of our approach is to augment new features to the raw data so that the classification accuracy can be improved.
3.1. The extraction process
The proposed distancebased feature extraction method can be divided into three main steps. In the first step, given a dataset the cluster center or centroid for every class is identified. Then, for the second step, the distances between each data sample and the centroids are calculated. The final step is to extract two distancebased features, which are calculated in the second step. The first distancebased feature means the distance between the data sample and its cluster center. The second one is the sum of the distances between the data sample and other cluster centers.
As a result, each of the data samples in the dataset can be represented by the two distancebased features. There are two strategies to examine the discrimination power of these two distancebased features. The first one is to use the two distancebased features alone for classification. The second one is to combine the original features with the new distancebased features as a longer feature vectors for classification.
3.2. Cluster center identification
To identify the cluster centers from a given dataset, the kmeans clustering algorithm is used to cluster the input data in this article. It is noted that the number of clusters is determined by the number of classes or categories in the dataset. For example, if the dataset is consisted of three categories, then the value of k in the kmeans algorithm is set to 3.
3.3. Distances from intracluster center
After the cluster center for each class is identified, the distance between a data sample and its cluster center (or intracluster center) can be calculated. In this article, the Euclidean distance is utilized. Given two data points A = [a_{1}, a_{2},...,a_{ n } ] and B = [b_{1},b_{2},...,b_{ n } ], the Euclidean distance between A and B is given by
Figure 1 shows an example for the distance between a data sample and its cluster center, where cluster centers are denoted by {C_{ j } j = 1, 2, 3} and data samples are denoted by {D_{ i } i = 1,2,...,8}. In this example, data point D_{7} is assigned to the third cluster (C_{3}) by the kmeans algorithm. As a result, the distance from D_{7} to its intracluster center (C_{3}) is determined by the Euclidean distance from D_{7} to C_{3}.
In this article, we will utilize the distance between a data sample and its intracluster center as a new feature, called Feature 1. Given a datum D_{ i } belonging to C_{ j } , its Feature 1 is given by
where dis(D_{ i } , C_{ j } ) denotes the Euclidean distance from D_{ i } to C_{ j } .
3.4. Distances from extracluster center
On the other hand, we also calculate the sum of the distances between the data sample and its extracluster centers and use them as the second features. Let us look at the graphical example shown in Figure 2, where cluster centers are denoted by {C_{ j } j = 1, 2, 3} and data samples are denoted by {D_{ i } i = 1,2,...,8}. Since the datum D_{6} is assigned to the second cluster (C_{2}) by the kmeans algorithm, the distance between D_{6} and its extracluster centers include dis(D_{6} , C_{1}) and dis(D_{6} , C_{3}).
Here, we define another new feature, called Feature 2, as the sum of the distances between a data sample and its extracluster centers. Given a datum D_{ i } belonging to C_{ j } , its Feature 2 is given by
where k is the number of clusters identified, dis(D_{ i }, C_{ j } ) denotes the Euclidean distance from D_{ i } to C_{ j } .
3.5. Theoretical analysis
To justify the use of the distancebased features, it is necessary to analyze their impacts on classification accuracy. For the sake of simplicity, let us consider the results when the proposed features are applied to twocategory classification problems. The generalization of these results to multicategory cases is straightforward, though much more involved. The classification accuracy can readily be evaluated if the classconditional densities ${\left\{p\left(x\left{C}_{k}\right.\right)\right\}}_{k=1}^{2}$ are multivariate normal with identical covariance matrices, i.e.,
where x is a ddimensional feature vector, μ^{(k)}is the mean vector associated with class k, and ∑ is the covariance matrix. If the prior probabilities are equal, it follows that the Bayes error rate is given by
where r is the Mahalanobis distance:
In case d features are conditionally independent, the Mahalanobis distance between two means can be simplified to
where ${\mu}_{i}^{\left(k\right)}$ denotes the mean of the i th feature belonging to class k, and ${\sigma}_{i}^{2}$ denotes the variance of the i th feature. This shows that adding a new feature, whose mean values for two categories are different, can help to reduce error rate.
Now we can calculate the expected values of the proposed features and see what the implications of this result are for the classification performance. We know that Feature 1 is defined as the distance between each data point and its class mean, i.e.,
Thus, the mean of Feature 1 is given by
This reveals that the mean value of Feature 1 is determined by the trace of the covariance matrix associated with each category. In practical applications, the covariance matrices are generally different for each category. Naturally, one can expect to improve classification accuracy by augmenting Feature 1 to the raw data. If the classconditional densities are distributed more differently, then the Feature 1 will contribute more to reducing error rate.
Similarly, Feature 2 is defined as the sum of the distances from a data point to the centroids of other categories. Given a data point x belonging to class k, we obtain
This allows us to write the mean of Feature 2 as
where K denotes the number of categories and · denotes the L_{2} norm. As mentioned before, the first term in Equation 12 usually differs for each category. On the other hand, the distances between class means are unlikely to be identical in realworld applications and thus the second term in Equation 12 tends to be different for different classes. So, we may conclude that Feature 2 also contributes to reducing the probability of classification error.
4. Experiments
4.1. Experimental setup
4.1.1. The datasets
To evaluate the effectiveness of the proposed distancebased features, ten different datasets from UCI Machine Learning Repository http://archive.ics.uci.edu/ml/index.html are considered for the following experiments. They are Abalone, Balance Scale, Corel, TicTacToe Endgame, German, HayesRoth, Ionosphere, Iris, Optical Recognition of Handwritten Digits, and Teaching Assistant Evaluation. More details regarding the downloaded datasets, including the number of classes, the number of data samples, and the dimensionality of feature vectors, are summarized in Table 1.
4.1.2. The classifiers
For pattern classification, three popular classification algorithms are applied, which are SVM, kNN, naïve Bayes. These classifiers are trained and tested by tenfold cross validation. One research objective is to investigate whether different classification approaches could yield consistent results. It is worth noting that the parameter values associated with each classifier have a direct impact on the classification accuracy. To perform a fair comparison, one should carefully choose appropriate parameter values to construct a classifier. The selection of the optimum parameter value for these classifiers is described below.
For SVM, we utilized the LibSVM package [23]. It has been documented in the literature that radial basis function (RBF) achieves good classification performances in a wide range of applications. For this reason, RBF is used as the kernel function to construct the SVM classifier. In RBF, five gamma ('γ') values, i.e., 0, 0.1, 0.3, 0.5, and 1 are examined, so that the best SVM classifier, which provides the highest classification accuracy, can be identified.
For the kNN classifier, the choice of k is a critical step. In this article, the k values from 1 to 15 are examined. Similar to SVM, the value of k with the highest classification accuracy is used to compare with SVM and naïve Bayes.
Finally, the parameter values of naïve Bayes, i.e., mean and covariance of Gaussian distribution, are estimated by maximum likelihood estimators.
4.2. Pretest analyses
4.2.1. Principal component analysis
Before examining the classification performance, PCA [24] is used to analyze the level of variance (i.e., discrimination power) of the proposed distancebased features. In particular, the communality, which is the output of PCA, is used to analyze and compare the discrimination power of the distancebased features (also called variables here). The communality measures the percent of variance in a given variable explained by all the factors jointly and may be interpreted as the reliability of the indicator. In this experiment, we use the Euclidean distance to calculate the distancebased features. Table 2 shows the analysis result.
Regarding Table 2, adding the distancebased features can improve the discrimination power over most of the chosen datasets, i.e., the average of communalities of using the distancebased features is higher than the one of using the original features alone. In addition, using the distancebased features can provide above 0.7 for the average of communalities.
On the other hand, as the PCA result of Feature 1 is lower than the one of Features, on average standard deviation using distancebased features is slightly higher than using the original features alone. However, since using the two distancebased features can provide a higher level of variance over most of the datasets, they are all together considered in this article as the main research focus.
4.2.2. Class separability
Furthermore, class separability [25] is considered before examining the classification performance. The class separability is given by
where
and N_{ j } is the number of samples in class C_{ j }, C is the mean of the total dataset. The class separability is large when the betweenclass scatter is large and the withinclass scatter is small. Therefore, it can be regarded as a reasonable indicator of classification performances.
Besides examining the impact of the proposed distancebased features using the Euclidean distance on the classification performance, the chisquared and Mahalanobis distances are considered. This is because they have quite natural and useful interpretation in discriminant analysis. Consequently, we will calculate the proposed distancebased features by utilizing the three distance metrics for the analysis.
For the chisquared distance, given ndimensional vectors a and b, the chisquared distance between them can be defined as
or
On the other hand, the Mahalanobis distance from D_{ i } to C_{ j } is given by
where ∑ _{ j } is the covariance matrix of the j th cluster. It is particularly useful when each cluster has an asymmetric distribution.
In Table 3, the effect of using different distancebased features is rated in terms of class separability. It is noted that for the highdimensional datasets, we encounter the small sample size problem and it results in the singularity of the withinclass scatter matrix S_{ W }[26]. For this reason, we cannot calculate the class separability from the highdimensional datasets. 'Original' denotes the original feature vectors provided by the UCI Machine Learning Repository. '+2D' means that we add Features 1 and 2 to the original feature.
As shown in Table 3, the class separability is consistently improved over that in the original space by adding the Euclidean distancebased features. For the chisquared distance metric, the results of using $di{s}_{{x}_{1}^{2}}$ and $di{s}_{{x}_{2}^{2}}$ are denoted by 'chisquare 1' and 'chisquare 2', respectively. Evidently, the classification performance can always be further enhanced by replacing the Euclidean distance with one of the chisquared distances. Moreover, reliable improvement can be achieved by augmenting the Mahalanobis distancebased feature to the original data.
4.3. Classification results
4.3.1. Classification accuracy
Table 4 shows the classification performance of naïve Bayes, kNN, and SVM based on the original features, the combined original and distance based features, and the distancebased features alone, respectively, over the ten datasets. The distancebased features are calculated using the Euclidean distance. It is noted that in Table 4, '2D' denotes that the two distancebased features are used alone for classifier training and testing. For the column of dimensions, the numbers in the parentheses mean the dimensionality of the feature vectors utilized in a particular experiment.
Regarding Table 4, we observe that using the distancebased features alone yields the worst results. In other words, classification accuracy cannot be improved by utilizing the two new features and discarding the original features. However, when the original features are concatenated with the new distancebased features, on average the rate of classification accuracy is improved. It is worth noting that the improvement is observed across different classifiers. Overall, these experimental results agree well with our expectation, i.e., classification accuracy can be effectively improved by including the new distancebased features into the original features.
In addition, the results indicate that the distancebased features do not perform well in highdimensional imagerelated datasets, such the Corel, Iris, and Optical Recognition of Handwritten Digits datasets. This is primarily due to the curse of dimensionality [15]. In particular, the demand for the amount of training samples grows exponentially with the dimensionality of feature space. Therefore, adding new features beyond a certain limit would have the consequence of insufficient training. As a result, we have worse rather than better performance on the highdimensional data sets.
4.3.2. Comparisons and discussions
Table 5 compares different classification performances using the original features and the combined original and distancebased features. It is noted that the classification accuracy by the original features is the baseline for the comparison. This result clearly shows that considering the distancebased features can provide some level of performance improvements over the chosen datasets except the highdimensional ones.
We also calculate the proposed features using different distance metrics. By choosing a fixed classifier (1NN), we can evaluate the classification performance of different distance metrics over different datasets. The results are summarized in Table 6. Once again, we observe that the classification accuracy is generally improved by concatenating the distancebased features to the original feature. In some cases, e.g., Abalone, Balance Scale, German, and HayesRoth, the proposed features have led to significant improvements in classification accuracy.
Since we observe consistent improvement across three different classifiers over five datasets, which are the Balance Scale, German, Ionosphere, Teaching Assistant Evaluation, and TicTacToe Endgame datasets, the relationship between classification accuracy and these datasets' characteristics is examined. Table 7 shows the five datasets, which yield classification improvements using the distancebased features. Here, another new feature is obtained by adding the two distancebased features together. Thus, we use '+3D' to denote that the original feature has been augmented with the two distancebased features and their sum. It is noted that the distancebased features are calculated using the Euclidean distance.
Among these five datasets, the number of classes is smaller than or equal to 3; the dimension of the original features is smaller than or equal to 34; and the number of samples is smaller than or equal to 1,000. Therefore, this indicates that the proposed distancebased features are suitable for the datasets whose numbers of classes, numbers of samples, and the dimensionality of features are relatively small.
4.4. Further validations
Based on our observation in the previous section, two datasets are further used to verify our conjecture, which have similar characteristics to these five datasets. These two datasets are the Australian and Japanese datasets, which are also available from the UCI Machine Repository. Table 8 shows the information of these two datasets.
Table 9 shows the rate of classification accuracy obtained by naïve Bayes, kNN, and SVM using the 'original' and '+2D' features, respectively. Similar to the finding in the previous sections, classification accuracy is improved by concatenating the original features to the distancebased features.
5. Conclusion
Pattern classification is one of the most important research topics in the fields of data mining and machine learning. In addition, to improve classification, accuracy is the major research objective. Since feature extraction and representation have a direct and significant impact on the classification performance, we introduce novel distancebased features to improve classification accuracy over various domain datasets. In particular, the novel features are based on the distances between the data and its intra and extracluster centers.
First of all, we show the discrimination power of the distancebased features by the analyses of PCA and class separability. Then, the experiments using naïve Bayes, kNN, and SVM classifiers over ten various domain datasets show that concatenating the original features with the distancebased features can provide some level of classification improvements over the chosen datasets except highdimensional image related datasets. In addition, the datasets, which produce higher rates of classification accuracy using the distancebased features, have smaller numbers of data samples, smaller numbers of classes, and lower dimensionalities. Two validation datasets, which have similar characteristics, are further used and the result is consistent with this finding.
To sum up, the experimental results (see Table 7) have shown the applicability of our method to several realworld problems, especially when the dataset sizes are certainly small. In other words, our method is very useful for the problems whose datasets contain about 434 features and 1501000 data samples, e.g., bankruptcy prediction and credit scoring. However, it is the fact that many other problems contain very large numbers of features and data samples, e.g., text classification. Our proposed method can be applied after performing feature selection and instance selection to reduce their dimensionalities and data samples, respectively. In other words, this issue will be considered for our future study. For example, given a largescale dataset some feature selection method, such as genetic algorithms, can be employed to reduce its dimensionality. When more representative features are selected, the next stage is to extract the proposed distancebased features from these selected features. Then, the classification performances can be examined using the original dataset, the dataset with feature selection, and the dataset with the combination of feature selection, and our method.
References
 1.
Fayyad UM, Piatesky SG, Smyth P: From data mining to knowledge discovery in databases. AI Mag 1996,17(3):3754.
 2.
Frawley WJ, PiatetskyShapiro GS, Matheus CJ: Knowledge Discovery in Databases: An Overview. Knowledge Discovery in Database. AAAI Press, Menlo Park, CA; 1991:127.
 3.
Vapnik VN: The Nature of Statistical Learning Theory. Springer, New York; 1995.
 4.
Keerthi S, Chapelle O, DeCoste D: Building support vector machines with reducing classifier complexity. J Mach Learn Res 2006, 7: 14931515.
 5.
CardosoCachopo A, Oliveira A: Semisupervised singlelabel text categorization using centroidbased classifiers. Proceedings of the ACM Symposium on Applied Computing 2007, 844851.
 6.
Liu H, Motoda H: Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, Boston; 1998.
 7.
Blum A, Langley P: Selection of relevant features and examples in machine learning. Artif Intell 1997,97(12):245271. 10.1016/S00043702(97)000635
 8.
Koller D, Sahami M: Toward optimal feature selection. Proceedings of the Thirteenth International Conference on Machine Learning 1996, 284292.
 9.
Yand JH, Honavar V: Feature subset selection using a genetic algorithm. IEEE Intell Syst 1998,13(2):4449. 10.1109/5254.671091
 10.
Jain AK, Duin RPW, Mao J: Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 2000,22(1):437. 10.1109/34.824819
 11.
Canbas S, Cabuk A, Kilic SB: Prediction of commercial bank failure via multivariate statistical analysis of financial structures: the Turkish case. Eur J Oper Res 2005, 166: 528546. 10.1016/j.ejor.2004.03.023
 12.
Min SH, Lee J, Han I: Hybrid genetic algorithms and support vector machines for bankruptcy prediction. Exp Syst Appl 2006, 31: 652660. 10.1016/j.eswa.2005.09.070
 13.
Tsai CF: Feature selection in bankruptcy prediction. Knowledge Based Syst 2009,22(2):120127. 10.1016/j.knosys.2008.08.002
 14.
Pearson K: On lines and planes of closest fit to system of points in space. Philos Mag 1901, 2: 559572.
 15.
Duda RO, Hart PE, Stork DG: Pattern Classification. 2nd edition. Wiley, New York; 2001.
 16.
Han J, Kamber M: Data Mining: Concepts and Techniques. 2nd edition. Morgan Kaufmann Publishers, USA; 2001.
 17.
Baralis E, Chiusano S: Essential classification rule sets. ACM Trans Database Syst (TODS) 2004,29(4):635674. 10.1145/1042046.1042048
 18.
Tsai CF, Lin CY: A triangle area based nearest neighbors approach to intrusion detection. Pattern Recog 2010, 43: 222229. 10.1016/j.patcog.2009.05.017
 19.
Lin JS: CANN: combining cluster centers and nearest neighbors for intrusion detection systems. Master's Thesis, National Chung Cheng University, Taiwan; 2009.
 20.
Zeng W, Meng XX, Yang CL, Huang L: Feature extraction for online handwritten characters using Delaunay triangulation. Comput Graph 2006, 30: 779786. 10.1016/j.cag.2006.07.007
 21.
Xue Z, Li SZ, Teoh EK: Bayesian shape model for facial feature extraction and recognition. Pattern Recogn 2003, 36: 28192833. 10.1016/S00313203(03)00181X
 22.
Choi E, Lee C: Feature extraction based on the Bhattacharyya distance. Pattern Recogn 2003, 36: 17031709. 10.1016/S00313203(03)000359
 23.
Chang CC, Lin CJ: LIBSVM: a library for support vector machines.2001. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]
 24.
Hotelling H: Analysis of a complex of statistical variables into principal components. J Educ Psychol 1933, 24: 498520.
 25.
Fukunaga K: Introduction to statistical pattern recognition. Academic Press; 1990.
 26.
Huang R, Liu Q, Lu H, Ma S: Solving the small sample size problem of LDA. International Conference on Pattern Recognition 2002, 3: 30029.
Acknowledgements
The authors have been partially supported by the National Science Council, Taiwan (Grant No. 982221E194039MY3 and 992410H008033MY2).
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Tsai, CF., Lin, WY., Hong, ZF. et al. Distancebased features in pattern classification. EURASIP J. Adv. Signal Process. 2011, 62 (2011). https://doi.org/10.1186/16876180201162
Received:
Accepted:
Published:
Keywords
 distancebased features
 feature extraction
 feature representation
 data mining
 cluster center
 pattern classification