-
Notifications
You must be signed in to change notification settings - Fork 3
Home
GenClass: A portable tool for data classification based on Grammatical Evolution
GenClass is a software entirely written in ANSI C++ that constructs classification programs in a C – like programming language in order to classify the input data, producing simple if – else rules. The dataset should conform to the format outlined below:
D
M
X11 x12 .. X1D y1
x21 x22 ... x2D y2
.... . ... .....
xM1 xM2 ... xMD yM
The integer value D determines the dimensionality of the problem and the value M determines the number of points in the file. Every subsequent line contains a pattern and the final column is the real output (category) for this pattern. The number of the classes is induced from the file. The software scans the file and identifies the number of problem’s classes. The classes should be integer numbers with number 0 assigned to the first class.
The package is distributed in a zip file from Github https://github.com/itsoulos/GenClass named GenClass-master.zip and under UNIX systems the user must execute the command: unzip GenClass-master. This command creates a directory named GenClass with the following contents:
-
bin: A directory which is initially empty. After compilation of the package, it will contain the executable genclass
-
doc: This directory contains the documentation of the package (this file) in different formats: A LyX file, A LaTeX file and a PostScript file.
-
examples: A directory that contains some test functions.
-
include: A directory which contains the header files for all the classes of the package.
-
src: A directory containing the source files of the package.
-
Makefile: The input file to the make utility in order to build the tool. Usually, the user does not need to change this file.
-
Makefile.inc: The file that contains some configuration parameters, such as the name of the C++ compiler etc. The user must edit and change this file before installation.
The following steps are required in order to build the tool:
-
Uncompress the tool as described in the previous section.
-
cd GenClass
-
Edit the file Makefile.inc and change (if needed) the configuration parameters.
-
Type make.
The parameters in Makefile.inc are the following:
-
CXX: It is the most important parameter. It specifies the name of the C++ compiler. In most systems running the GNU C++ compiler this parameter must be set to g++.
-
ROOTDIR: Is the location of the GenClass directory.
The outcome of the compilation is the executable genclass under the directory bin. The executable has the following command line parameters:
-
-h:The program prints a help screen and afterwards the program terminates.
-
-c count: The integer parameter count determines the number of chromosomes for the genetic population. The default value for this parameter is 500.
-
-g gens: The integer parameter gens determines the maximum number of generations allowed for the genetic algorithm. The default value is 200.
-
-s srate: The double parameter srate specifies the selection rate used in the genetic algorithm. The default value for this parameter is 0.10 (10%).
-
-m mrate: The double parameter mrate specifies the mutation rate used in the genetic algorithm. The default value for this parameter is 0.05 (5%).
-
-l size: The integer parameter size determines the size of every chromosome in the genetic population. The default value for this parameter is 100.
-
-p train_file: The string parameter train_file specifies the file containing the points that will be used as train data for the algorithm.
-
-t test_file: The string parameter test_file specifies the file containing the test data for the particular problem. The file should be in the same format as the train_file.
-
-w wrapping. The integer parameter wrapping determines the maximum number of wrapping events allowed. The default value for this parameter is 1.
-
-f foldcount. The integer parameter foldcount specifies the number of fold to be used for cross validation. The default value for this parameter is 0 (no cross validation).
-
-r seed: The integer parameter seed specifies the seed for the random number generator. It can assume any integer value.
-
-o method: The string parameter method specifies the output method for the executable. The available options are
(a) simple. The program prints output only on termination.
(b) csv. The program prints in csv (comma separated value) format information in every generation. In every generation the program prints: number of generations, train error and test error. This is the default value for the string parameter method.
(c) full. The program prints in every generation detailed information about the optimization procedure as well as classification error for every distinct class of the problem.
In order to measure the efficiency of the proposed method a series of experiments were conducted on some common classification problems. For all the experiments we have used 10-fold and they were conducted 30 times using different seed for the random generator each time and averages were taken. For our experiments we have used the following parameters:
-
Number of chromosomes: 200
-
Number of generations: 500
-
Selection rate: 90%
-
Mutation rate: 5%
The following datasets were used
-
Wine dataset. The wine recognition dataset contains data from wine chemical analysis. It contains 178 examples of 13 features each.
-
Glass dataset. The dataset contains glass component analysis for glass pieces that belong to 6 classes.
-
Pima dataset. The Pima Indians Diabetes dataset contains 768 examples of 8 attributes with two categories.
-
Ionosphere dataset. The ionosphere dataset contains data from the Johns Hopkins Ionosphere database.
-
Eeg dataset. The EEG dataset described in [1] is used here. The dataset consists of five sets (denoted as Z, O, N, F and S) each containing 100 single-channel EEG segments each having 23.6 sec duration.
-
Spiral artificial data: This dataset contains 1000 two-dimensional examples that belong to two classes (500 examples each). The number of the features is 2.
-
Wisconsin diagnostic breast cancer: The Wisconsin diagnostic breast cancer dataset (WDBC) contains data for breast tumours. It contains 569 training examples of 30 features each that are classified into two categories.
-
Fertility Data Set (FERT): 100 volunteers provide a semen sample analysed according to the WHO 2010 criteria. It contains 100 examples of 10 features each.
-
Regions Data Set:Regions Dataset is created from liver biopsy images of patients with hepatitis C [2]. The dataset includes 600 samples belonging into 6 classes.
-
Thyroid Data Set: Thyroid disease records[3] with 7200 patterns of 21 features each.
-
Parkinsons Data Set:This dataset[4] is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease.
-
Abalone Data Set: A dataset to predict the age of abalone from physical measurements[5].
-
Satellite image Data Set (Satimage): The database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood.The dataset contains 6635 patterns.
-
Dermatology Data Set: The aim is used for the Eryhemato-Squamous Disease. The dataset contains 366 patterns with 33 features each.
The results from the experiments are displayed in table \ref{tab:Experimental-results}. The column DATASET denotes the name of the dataset. The column NEURAL stands for the average test error from the application of neural network to the corresponding dataset. The number of weights (hidden nodes) for the neural network was set to 10 and a BFGS variant due to Powell[6] was used to train the network. The column RBF denotes the average test error from the application of a Radial Basis Function network to the dataset. The number of hidden nodes for this network was also set to 10. Finally, the column GENCLASS denotes the average test error from the application of the proposed method to the dataset. As it can be deduced from the results, the proposed can improve classification accuracy in the majority of the used datasets.
Consider the Ionosphere dataset available from the Machine Learning Repository in the following URL: http://www.ics.uci.edu/~mlearn/MLRepository.html. The ionosphere dataset contains data from the Johns Hopkins Ionosphere database. The two-class dataset contains 351 examples of 34 features each. The datasets has been divided into two files, ionosphere.train and ionosphere.test under directory examples. A typical run for the GenClass will be
../bin/genclass -p ionosphere.train -t ionosphere.test -g 10 -o csv
The output of this command is:
1, 15.43, 19.32
2, 15.43, 19.32
3, 15.43, 19.32
4, 13.71, 17.05
5, 12.57, 15.34
6, 12.57, 15.34
7, 12.57, 15.34
8, 12, 13.64
9, 12, 13.64
FINAL OUTPUT EXPRESSION= if(!(x7<log(cos(cos(((-788.787)+
((sin(x28)/sin(cos(((-7.17)/x34))))+(-83.6))))))|x6>x13&x7<log(x5))) CLASS=0.00
else CLASS=1.00
TRAIN ERROR = 12.00%
CLASS ERROR = 13.64%
** CONFUSION MATRIX ** Number of classes: 2
102 3
21 50
[1] R.G. Andrzejak, K. Lehnertz, F. Mormann, C. Rieke, P. David, and C. E. Elger, Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state, Phys. Rev. E 64, pp. 1-8, 2001.
[2] Giannakeas, N., Tsipouras, M.G., Tzallas, A.T.,Kyriakidi, K., Tsianou, Z.E., Manousou, P., Hall, A., Karvounis, E.C., Tsianos, V., Tsianos, E. A clustering based method for collagen proportional area extraction in liver biopsy images (2015) Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, 2015-November, art. no. 7319047, pp. 3097-3100.
[3]Quinlan,J.R., Compton,P.J., Horn,K.A., and Lazurus,L. (1986). Inductive knowledge acquisition: A case study. In Proceedings of the Second Australian Conference on Applications of Expert Systems. Sydney, Australia.
[4] Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering 56, pp. 1015-1022, 2009.
[5] Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait, Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)
[6] M.J.D. Powell, A Tolerant Algorithm for Linearly Constrained Optimization Calculations, Mathematical Programming 45, pp 547, 1989.