Loading

JSM Computer Science and Engineering

K-NN Classification of Mass Spectra Data for Diagnosing Alzheimer’s Disease

Research Article | Open Access | Volume 1 | Issue 1

  • 1. Department of Computer Science and Engineering, Oakland University, USA
  • 2. William Beaumont Hospital, Royal Oak, USA
+ Show More - Show Less
Corresponding Authors
Gautam Singh, Department of Computer Science and Engineering, Oakland University, Rochester, MI, 48309, USA Tel: 1248-370-2129
Abstract

Diverse algorithms and methods are needed to answer the ever increasing need of adequately harnessing Mass Spectrometer generated data. The unique nature and structure of this data, requires a high level of expertise and rigorous algorithms to harness its full benefits. The methodology of this study discusses feature selection based on direct observations of variables and their inter-relationships, Jackknife technique for data sampling, matrix to vector decomposition and successfully classifies Alzheimer’s disease patients into three disease stages; age-matched controls without any evidence of dementia, patients with mild cognitive impairment and patients with clinical symptoms of Alzheimer’s disease (AD). Our model extends the use and principle of K-nearest neighbor (KNN) algorithm and also presents a modification of Euclidean distance formula. Hitherto, there exists no clinical diagnostic tool for AD, in lieu of this, patient cognitive abilities are clinically followed-up over a period of time (may be months) to make a diagnosis. This practice usually leads to inconclusive diagnosis and results obtained from it are not generalizable. This study, provides a platform for immediate classification and correctly indicates test data sets predisposed to AD with 75% accuracy (giving a probability of 0.13 for committing type II error) without collaborating clinical records.

Index Terms

Feature matrix; Point-to-point discrimination; Protein biomarkers; Jackknife

Citation

Anyaiwe DEO, Wilson GD, Geddes TJ, Singh GB (2016) K-NN Classification of Mass Spectra Data for Diagnosing Alzheimer’s Disease. Comput Sci Eng 1(1): 1004.

ABBREVIATIONS

Con: Without Evidence of Ad; Mci: Mild cognitive impairment; Tad: Diagnosed of Alzheimer’s disease; Sla: Supervised learning algorithms

INTRODUCTION

Proteomics task of discovering and identifying the set of proteins expressed by individual cells with regards to time and other biochemical conditions has witnessed tremendous achievements in recent times due to the invention and introduction of high throughput assay processes like Protein Chip Mass pectrometer–Surface Enhanced Laser Desorption/ Ionization (SELDI) time-off light laboratory technique, in that, protein analysis are more readily accessible and available in real time, but in practice, harnessing the output of SELDI experiment involves onerous tasks and demands investigators with high level of expertise.

SELDI provides detailed analysis of the analytes with accurate results; it entails the ionization of analyte (protein) samples by subjecting an analyte to laser energy bombardment. Upon this, elucidated ions are separated based on their mass-to-charge ratio. The feature of these separated ions are recorded and presented as mass spectra showing the relative abundance of ’bio-chemical compounds’ contained in the analyzed sample. Each ion in the abundance spectra (assay results) is typically categorized by the following properties; the mass-to-charge ratio (m/z), time-of-flight (TOF), intensity (TOF Intensity), Substance mass, ion charge, ionmass, signal-to-noise ratio and peak type.

For a protein source (e.g. serum, urine, saliva) analytes, SELDI generates hundreds of peptide peaks which are further investigated with respect to the investigator’s objectives. The investigation of peptide peaks usually begins with detecting the set of peaks that are ’differentially expressed’ in the mass spectra after baseline subtraction has been done using statistical methods or thresholding. Usually, for each protein source analyzed ,a set of differentially expressed peaks (tens to hundreds, depending on the laser energy bombardment level and the type of Protein Chip used) are chosen; these represents the result of the assay process, a collection of which is our raw data.

To identify the protein or peptide of interest, the molecular weights and chemical properties of ions contained in the SELDI raw data, i.e. which chemical surface it binds to preferentially on the Protein Chip, is matched with public databases. Definitive identification of the peak is then carried out using other mass spectrometry methods.

Questions about detecting the bio-chemical changes in cells or tissues that are capable of causing post-translational modifications of proteins or change in protein’s structural information, etc are answered using identified peaks .Other uses of SELDI data is in the area of determining molecular formulas, protein curating and identification, and protein bio-marker discovery [1], personalized medicine, drug design and drug production [2,3].

In the US, from 2000 to 2013 while deaths from other diseases declined significantly, that of Alzheimer’s disease (AD) increased by 71%. AD is one of the most expensive health conditions to treat in the world today. The estimated cost of care for AD in the US exceeded $214 billion in 2014, with nearly one in every five dollars spent by Medicare on dementia. Future cost estimate from the United States Alzheimer’s Association [4], predicts that by 2050 the disease will cost $1.2 trillion annually. The disease currently affects five million people in the US, and expected to grow to 16 million by 2050; afflicting one in nine people over the age of 65, and one in three people over the age of 85.

The clinical practice of diagnosing AD today consists of patients follow ups; patient cognitive abilities (like memory) are tested over a period of time. The practice is time consuming, the follow up can be for many months and may not be conclusive, mild cognitive impairment (MCI) cases may degenerate to full blown dementia (tAD)during this period thereby causing severe and irreversible brain damage to the dementia patient. Additionally, the results or clinical notes achieved by patients follow ups are not generalizable.

Despite the progresses in identifying and discovering several protein bio-markers for Alzheimer’s disease, the story is yet to be palatable for its patients and care givers due to lack of clinical diagnostic tools. Consequently, the need for studies such as this.

Harnessing SELDI data involves the application of diverse rigorous statistical or machine learning algorithms towards the investigator’s goals. This study, goes beyond protein curating and protein bio-marker identification (which in most cases, are the clinical/laboratory objectives of detail study of SELDI data), to the building of classification model using Mass Spectrometer-SELDI saliva data. The end goal is to close the gap between identified bio-marker and the diagnoses of Alzheimer’s disease using an extended principle of K-Nearest Neighbor (KNN) Algorithm.

In the course of this study, we considered each output of MS analysis (which are basically, collections of ions and their features (peaks) that were differentially expressed) as matrices, see equation (1). Generally, pictorial view of a data-set may help identify unique patterns, consider (Figure 1), which is a display four different plots; sub-figures (1a), (1b) and (1c) respectively represents plots of the data that represents the three stages of AD; CON, MCI and tAD, and sub-figure (1d) is a display of subfigures (1a), (1b) and (1c) on same plot. Sub-figure (1d) puts the subject of this study into a clear perspective; the task to achieving or inducing a separation line on elements of the data-sets. From sub-figure (1d), it is easy to see that the location of peaks for all three stages overlaps (i.e, given a mass value there exists an intensity value for all three stages) with no cluster, no regression or a discriminative pattern.

Spectrograph of Stage Data-sets.

Figure 1 Spectrograph of Stage Data-sets.

The scenario that every data point of our data-set is a matrix further complicates the aim of this study in that, application of traditional supervised learning algorithms (SLA) fails. Traditional SLA does point-to-point discrimination of points/ feature vectors in lieu of matrix-to-matrix discrimination. It is therefore, imperative to first overcome these challenges if positive advancement is to be made in putting SELDI data into additional uses.

KNN is a non-parametric algorithm used for supervised learning, its discrimination of points of a data-setentails casting a net around points of the test data-set. The size of the net depends on the number (k) of points of the train data-set the investigator allows to be the neighbors of a test data point; the discrimination process at the end is by vote and favors the label that is mostly represented in the net. Different distance metrics can be adopted by KNN depending on how the metric function chosen fits the data-set structure. In this study, we extend the principle of KNN and modify the Euclidean distance function in other to apply them to our data-set.

The next section gives a literature review of works done using KNN principle and the description of our methodology. Discussion of our results and observations is given in Section 3 and possible areas of future works and conclusion is highlighted in Section 4.

MATERIALS AND METHODS

Literature

KNN has gained its place in diverse areas of studies; sciences, business, medicine, and in solving on-line and social networks, speech, text and image recognition problems. Its generalization principle uses an appropriate distance function to induce measures on the locations of instances of the train data-set from a test data-point; study results have shown it is most adept for data-sets with 3-to-4 classes [5].

The efficacy of machine learning algorithms in solving any supervised or unsupervised learning problem greatly depends on the algorithm’s approach and the data structure under study [5,6], KNN is a non-parametric algorithm, easy to implement, modify and extend.

In general, it is advantageous to conceptualize individual objects (e.g. genotypes) as elements existing in a multidimensional space, this way, geometric classification techniques can be applied to create homogeneous groups by building data from the structure of correlated groups in the multidimensional space [7]. Data structure plays vital role in solving classification problems, sometimes, it renders the data insensitive for analysis by either hiding or camouflaging important details in the data-set. Different approaches exists that can be applied to select objects (e.g. gene) from a genomic data-sets, Leping et al., in [8] explored KNN with Genetic Algorithms as an approach for the generation of predictive gene subsets.

Application of dimensionality reduction or feature extractions to a data-set reduces the number of features in the data-set which in turn enhances the usage of the resulting data-set while eliminating the possibilities of over fitting problems. In [9], multilabeling based on identifying the KNNs of training set in instances of test set was presented, it further showed how such exercise can be used to predict yeast gene functionality, assign labels to unseen images in natural scene classification problems and solve web page automated categorization problems, similar idea was presented in [10] for image recognition.

Similar to DNA sequence alignment, structural proteomics was studied in [11]. The study achieved grouping and predicting of new proteins based on structure alignments of the distance matrices obtained by 2D representation of protein’s tertiary structures.

Sundry studies about phylogenetic tree constructions, node connections in social and biological network systems utilize different forms of distance functions [12,13]. The results of such studies can be extended for classification or predicting purposes if supplemented with the generalization principles of KNN.

This study utilized, Jackknife sampling procedure to constitute elements of the training and corresponding test datasets. The importance and reasons as to why and when Jackknife technique can be used were presented by [14]. The method was applied for feature selection and classification in [15].

Methodology

The data-set used for this study was obtained from the Bio Bank of Beaumont Reference Laboratory and was the output of a Surface Enhanced Laser Desorption/Ionization time-offlight (SELDI-TOF) discovery proteomics laboratory experiment carried out on saliva. The experiment was designed to assess differential protein expression sin saliva donor samples for the purpose of identifying protein biomarkers for Alzheimer’s disease (AD). Three populations of patients were studied consisting of age-matched controls without any evidence of dementia (CON), patients with mild cognitive impairment (MCI) and patients with clinical symptoms of Alzheimer’s disease (tAD).

Also of note is that, having so many (tens, sometimes hundreds of) observations in an experimental result as inherent in highthroughput assay procedures like SELDI-TOF and MALDI-TOF analysis, throws-in another form of problem to feature selection, pattern recognition and building of classification models and tools. This is because, traditional Supervised/Unsupervised machine learning algorithms accepts feature vectors as inputs and discriminate data points on a point-to-point basis, i.e. given a data set and based on the parameters of a chosen algorithm, every instance of the test data-set is examined and subsequently labeled or added to a cluster group depending on the type of problem being solved. Whereas, in particular, SELDI output data is made up of matrix data points.

We present a basic systematic approach for feature selections and transformed matrices contained in the data-set to collections of feature vectors. A distance metric called exponential Euclidean distance function was also introduced. The classification model described in this study; classifies and predict test samples into the 3stages of Alzheimer’s disease using K-nearest neighbor (KNN) classifier. This was achieved by assigning a test data to the stage with the highest number of k-minimum distance hits in each iteration for k = 1 and k = 5.

Data organization: The ’uniqueness’ of the raw data-set is as a result of the structure of each data-point (SELDI analysis result) it contains. Matrix (R) represents the 179 differentially expressed peaks selected as the result for each saliva sample analyzed. Every matrix R has two types of attributes; Numerical attributes (M/Z, ToF, ToF Intensity, Substance Mass, Charge, ion Mass and Signal to Noise) and a categorical attribute; Peak Type with the values, first pass, second pass and estimated peak types.

R_{k}^{S}=\begin{bmatrix} m_{1} &T_{1} & I_{1} &S_{1} &C_{1} &M_{1} &N_{1} &P \\ m_{2}&T_{2} &I_{2} & S_{2} &C_{2} &M_{2} &N_{2} &P \\ & &\vdots \vdots & \vdots \vdots & \vdots \vdots &\vdots \vdots & & \\ m_{n}& T_{n} &I_{n} &S_{n} & C_{n} &M_{n} &N_{n} &P \end{bmatrix}        (1)

Where1 k and S are additional parameters we took the benefit to introduce k = … {1, 2,.. .., 20} , indexes the total number of results in each disease stage (S).

In each R, there are n=1,...,^{R}79 number of rows of observed ions and elements of are arranged in an ascending order that relates to the size of m/z values; i.e.m_{1}< m_{2}< …< m_{179}.

Mass spectrometer calibrates the features values with different scales (as V1 indicates), applying principle component analysis (PCA) led to loss of sensitivity of the data due to the features selected. The following observations\left [ V1-V5 \right ] was used to identify the interrelationships that exists between columns of R and was also used to achieve pre-processing of R the data as well as feature selection.

< > ^{_{1}^{}\textrm{}}m_{n}is m/z (or molecular mass),I_{n}stands for time-of-flight (TOF), n I denotes TOF Intensity, S_{n} is Substance mass, C_{n}for an ion charge, M_{n} for ion mass, N for signal-to-noise and I implies peak type

V1:        TOF values are small \left ( T_{i}\times 10^{-5}\right )and approximates to 0.0000 (at 4 decimal places). Also, for all ion in R. The entry values for ion charge and ion Mass is 1 \left ( C_{i}=M_{i}=1 \right )Leaving out these features will only cause a uniform perturbation (if any) to the data-set.
\Rightarrow \forall i\exists \alpha s.t.a_{i}=m_{i}\times T_{i}     As the mass (m/z) values increases in size, TOF Intensity \left ( I_{i} \right ) values are relatively decreasing, i.e.

                                                           mass=\frac{1}{TOFIntensity}

                                                              \Rightarrow \forall i\exists \alpha s.t.a_{i}=m_{i}\times T_{i}   

Similar relation also exists between substance Mass and TOF Intensity but the value of α is not constant across rows of R, moreover, substance Mass and m/z are related as expressed by \left ( V_{4} \right ), thus, both cannot be used in a model.

V3:     The parameters of Peak Type are; First Pass, Second Pass and estimated, these parameters are used to reference when, during the analysis process the Mass Spectrometer machine recorded such peaks. Some ions are more stable and travels through the Mass Spectrometer machine without further fragmentation, the peak of such ions are registered as first pass while the peak of ions that are results of further fragmentation are recorded as second pass peaks. Estimated peaks are average peaks assigned by the researcher. If anion’s peak is not registered in an analysis result but such ion has a peak in the pool of results, the average of available peaks is evaluated, assigned and remarked as estimated for the missing peak. The implication of this is discussed in future works.

V4:       V4: m_{i}=S_{i}+C_{i}; for any ioni , the sum of its substance Mass and Charge equals its molecular mass value.

V5:       Signal-to-Noise values are higher for First Pass peaks and relatively equal and smaller for Estimated and Second Pass peaks (same thought as in V3).

Sequel to these observations, the matrix \left ( R_{k }^{S} \right ) was reduced to a 2-by-179 matrix \left ( p_{k }^{S} \right ) shown below, having only the m/z \left ( m_{n }^{k} \right ) and TOF Intensity \left ( I_{n }^{k} \right )  features.

p _{1}^{C}=\begin{bmatrix} m_{1}^{1} &I_{1}^{1} \\ m_{2}^{1} &I_{2}^{1} \\ \vdots &\vdots \\ m_{n}^{1}& I_{n}^{1} \end{bmatrix}p_{5}^{M}\begin{bmatrix} m_{1}^{5}& I_{1}^{5}\\ m_{2}^{5}&I_{2}^{5} \\ \vdots &\vdots \\ m_{n}^{5}& I_{n}^{5} \end{bmatrix}                (3)

Above is the snapshots of two \left ( p_{k}^{S} \right )matrices, the matrix on the left is the first as contained in the CON dataset while the matrix on the right is the fifth in the MCI stage dataset, k=1,…,20, is the numbering for the 20 data points in each stage (S) and n in  \left ( m_{n}^{k} \right )or \left (I_{n}^{k} \right )  is row-wise numbering of the ions in each matrix (p). Going forward, we shallsimply refer to m/z as mass, TOFIntensity as intensity, and an ion as a peak defined by the pair (mass; intensity).

The Data-Sets: The population used for the SELDI discovery proteomics was sub-typed into CON, MCI, and tAD stages based on disease severity and each stage has 20 Spectra results, with each data point (p) having 179 rows (or peaks).

To proceed, we recall some basic notes about matrices and vectors;

1. A row matrix is a matrix that has only one row.

2. A column matrix is a matrix that has only one column.

3. A matrix with only one row or one column is called a vector.

More elaborate definitions and proofs were given by Wangmeng et al., in [10]. Based on the above notes, the feature matrices (p) were transformed to vectors by simply dropping the notion of matrix and treating each row in (p) as individual row vectors, as shown by (Figure 2). Thus, each vector holds unique information about a unique peak including a label to denote the stage the peak belongs to. The principle of Jackknifing was then used to generate the train and corresponding test data-sets.

Matrix Data to Row Vectors (Left) is the preprocessed data set made  up of matrices. (Right) is the flattened data set made up of vectors; the alphabets  C, T, &M stands for control, mci and tAD respectively

Figure 2: Matrix Data to Row Vectors (Left) is the preprocessed data set made up of matrices. (Right) is the flattened data set made up of vectors; the alphabets C, T, &M stands for control, mci and tAD respectively

Definition 2.1: Jackknife Procedure: This procedure is a re-sampling without replacement technique used to correct bias or create confidence limits for estimators and advisable in scenarios were there exist no statistical or biological models to test new research results with. Given a sample (X) of size N a delete-d Jackknife samples is obtained by selecting and deleting ‘d’-number of observations from the sample. For each Jackknife sample, parameters are estimated and tested on the deleted sample, then the final Jackknife estimate is achieved by taking the aggregate of the ‘d’ estimates thus generated. For instance, a delete-1 Jackknife sample will look like;

X_{a}=X_{b},X_{c}...,X_{n}                  (4)

Xa

is used as the test data while terms on the right hand side of Eq.4 constituted the elements of the train data-set. Our raw dataset has 20 data-points in each stage, thus, 20 Jackknife training data-sets (adopting the delete-1 Jackknife procedure) for each stage (60 in all), was generated. For iteration, a training data-set is learned and consequently used to test the corresponding test data-set.

Definition 2.2: Exponential Euclidean Distance: In general, there exists three cases that may exist between any two peaks, these are; 1) their molecular mass values are (approximately) equal but their intensity values differs, 2) they both have unequal molecular mass and intensity values and 3) they have unequal mass values but equal intensity values. case 1 is most profitable, it indicates measuring and comparing the abundance level of peaks provided they have equal mass values (i.e, both peaks must exists in the same horizontal location), case 3 measures equal peak intensity’s irrespective of their mass values or horizontal locations (this case is not informative, it is just comparing the obvious; molecular mass values) while case 2 exists as an alternative for the model to rely on if case 1 fails.

The concept of KNN is based on minimum values so concentrating on case 1, the popular Euclidean distance function Eq.5;

d\left ( a,b \right )=\sqrt{\left ( m_{a}-m_{b} \right )^{2}+\left ( I_{a}-I_{b} \right )^{2}}                          (5)

defined between two vectors a and b is not directly applicable here, since we need a formula that is biased towards intensity values. In particular, the terms ( ) m m a b − and ( ) a b I I − of Eq.5 are evaluated on the same scale. We then introduced and established Eq.6, called the exponential Euclidean distance function and defined the distance between two vectors a and b by

dist_{\left ( a,b \right )}=\sqrt{\left ( e^{\left ( m_{a} -m_{b}\right )^{2}}-1 \right )^{2}+\left ( I_{a}-I_{b} \right )^{2}}           (6)

Where m and I represents the mass and intensity of row vectors a and b respectively. By Eq.6, the term ( e^{\left ( m_{a} -m_{b}\right )^{2}}-1 ) evaluates to zero if m_{a}=m_{b},  they by, laying emphases on the other hand the value is \left ( I_{a}-I_{b} \right ), exponentially magnified even when the difference between ma and mb is very small, in lieu of adding some ‘small’ value that might have resulted from \left ( m_{a}-m_{b} \right ) if Eq.5 was used. These cases is further explained with Figure (3).

Matrix Data to Row Vectors (Left) is the preprocessed data set made  up of matrices. (Right) is the flattened data set made up of vectors; the alphabets  C, T, &M stands for control, mci and tAD respectively

Figure 2: Matrix Data to Row Vectors (Left) is the preprocessed data set made up of matrices. (Right) is the flattened data set made up of vectors; the alphabets C, T, &M stands for control, mci and tAD respectively

Definition 2.3: Metric Function: A metric (d) in R^{pcol\times prow} is a function,

d:R^{pcol\times prow}\times R^{pcol\times prow}\rightarrow RIf  for all x,y,z \epsilon R^{pcol\times prow} the following axioms are satisfied;

N1:d\left ( x,y \right )\geq 0;d\left ( x,y \right )=0\Leftrightarrow x=y(Positivity)           (7)

N2:d\left ( x,y \right )=d\left ( y,x \right )(symmetry)                           (8)

N3:d\left ( x,z \right )\leq d\left ( x,y \right )+d\left ( y,z \right )( subadditivity{\color{Emerald} })          (9)

It suffices to proof these axioms to establish that Eq.6 is a distance function.

Proof: We note that N1 and N2 can easily be verified and proceed to prove N3. First, we establish Cauchy Schwartz inequality for vectors. This states that the inequality (10) holds true for all vectors x and y of an inner product space.

\left | x.y \right |\leq \left | x \right |.\left | y \right |                  (10)

Assume that, x=\left ( x_{1},x_{2},....,x_{n} \right ) and y=\left ( y_{1},y_{2},....,y_{n} \right )and also recall, that the dot product of x and y is given by x.y=x_{1}.y_{1}+x_{2}.y_{2}+...+x_{n}.y_{n}. Further, \left | x\right |=\sqrt{x.x}and the distance between x and y in a 1-dimensional space is simply d\left ( x,y \right )=\left | x-y \right |. 

Now,

\left | x-ay \right |^{2}=\left ( x-ay \right )(x=ay)=\left | x \right |^{2}-2a(x.y)+a^{2}\left | y \right |^{2}         (11)

Using discriminant of Eq.11, we’ve

\left ( -2x.y \right )^{2}-4\left | x \right |^{2}\left | y \right |^{2}\geq 0\Rightarrow 4\left ( x.y \right )^{2}\leq 4\left | x \right |^{2}\left | y \right |^{2}\left | x.y \right |\leq \left | x \right |.\left | y \right |

Which yields Eq.10 by dividing both sides N3 with 4 and taking the square roots. We now use this to proof

For N3,

\left | a+b \right |^{2}=\left ( a+b \right )(a+b)\leq \left | a \right |^{2}+2\left | a.b \right |+\left | b \right |^{2} (using Eq.10)

\leq \left | a \right |^{2}+2\left | a \right |\left | b \right |+\left | b \right |^{2}=\left ( \left | a \right |+\left | b \right | \right )^{2}\Rightarrow \left | a+b \right |^{2}\leq \left ( \left | a \right |\left | b \right | \right )^{2} and\left | a+b \right |\leq \left | a \right |+\left | b \right |

R_{k}^{s} For any three points/vectors

dist_{\left ( s,t \right )}=\left | s-t \right |=\left | s-r+r-t \right |\leq \left | s-r \right |+\left | r-t \right |=dist_{\left ( s,r \right )}+dist_{\left ( r,t \right )}     Q.E.D

KNN distance hit table: Predicting a test data entails generating a distance ’Hit’ Table (1) using Eq.6 and the principle of KNN. Since very data-set is a collection of peaks (vectors), we extended the KNN algorithm to accommodate this. In each interaction, we used Eq.6 to determine the distance between all possible pairs of vectors from the test data-set and train dataset. Then, for each row vector in a test data we noted the stage label of the train data-set vector nearest to it by comparing the k-minimum distance values between the row vectors of train and test data-sets using the distance hit (table 1). Below is an example of a typical hit (table 1).

Table 1: Hit Table.

Hit Table

 

CON

MCI

tAD

TEST1

TEST2

44

49

75

57

60

73

The column titles CON, MCI and tAD holds counts of the number of rows of stage label’s that has k-minimum distance values with respect to a TEST data. At the end, a test data is classified into the stage with the highest number of k-minimum hits, e.g., TEST1 (Table 2) is classified to be MCI while TEST2 is tAD based on majority vote.

Table 2: Test 1.

k=1

 

CON

MCI

tAD

CON

13

3

4

MCI

5

8

7

tAD

3

5

10#MCI

Using the Jackknife re-sampling technique, each disease stage produced 20 test data-sets. Consequently, weper formed sixty KNN classification iterations with k = 1 and another set of sixty iterations with k = 5. The confusion matrix below is the classification performance obtained with k = 1.

In detail, KNN at k = 1 correctly classified 65% instances of CON data points and correctly classified 40% and 50% instances of MCI and tAD data points respectively, with 10% of tAD elements not conclusively classified.

On the other hand, for the same test samples KNN correctly classified 85% of control(CON) samples, 50% of MCI test samples and 0% of tAD samples with k = 5. Overall, 52% and 45% instances were correctly classified using KNN at k = 1 and k = 5 respectively (Table 3).

Table 3: Test 2.

k=5

 

CON

MCI

tAD

CON

17

3

0

MCI

10

10

0

tAD

16

4

0

RESULTS AND DISCUSSION

The goal of SELDI-TOF discovery proteomics is to quantify and interpret changes in features as to their abundances identified a priori/de novo in the SELDI spectra of analyzed samples by further investigating obtained SELDI Spectra data, inconclusive classifications occurs if an iteration produces equal hit scores for two stages; e.g. tAD#MCI means an iterations that produced equal hit values for MCI and tAD, for a suspected tAD test data (t) (Table 4).

Table 4: The default diagnosis is diagnosis done by flipping a coin. An unbiased expected output is presented by the confusion matrix below;

Default Diagnoses

Predisposed

NO

YES

NO

10

10

YES

20

20

p(type 2 error)=0.33

50% Accuracy

In this paper, we adopted the principle of KNN and introduced a 2-scale distance function to build a KNN classifier for Alzheimer’s disease stages based on the molecular mass and TOF-Intensity of ions contained in SELDI Saliva Spectra data. This study forms a basis and provides a pathway into studies on early and reliable diagnoses of AD and Dementia disease in general.

The data structure was the first problem we had to overcome. The decomposition of the feature matrices to a collection of feature vectors, sequel to a systematic feature selection enabled us to solve the problem in a2-dimensional space.

This work pinpoints inherent pattern in the saliva SELDI data. The results of 5-NN algorithm on tAD test data points, clearly indicates a characteristic ’elusiveness’ possessed by the data, which can be explained by the lack of cognition suffered by Alzheimer’s disease (Dementia) patients in general. On the other hand, it further proves the reliability of SELDI process and reproducibility of mass spectra results as studied by Keith et al., [16].

If combined with clinical records and coupled with clinical verifications, the result of this study forms a basis for discriminating and diagnosing Alzheimer’s disease. It can also serve as a tool to monitor AD patients conditions since the disease severity status can easily be determined with the number of ’hit points’ in the KNN distance table, knowing that, the distance measure between two vectors remains the same except if there is a change in the geographical locations of one or both of them.

Using Saliva SELDI data was also a plus owing to how easily saliva samples can be obtained. The presence of several molecular mass values but with different intensity values made this KNN approach possible, in that, we were able to geometrically mark the intensities of similar mass values in space and used their geometric location for discrimination.

There are classification scenarios that need to be further explained clinically in terms of false negative predictions/ classifications. For instance, if the hit point scores for two stages (e.g. MCI and tAD) are the same, what should be the result of such classification? To a layman, this indicates a YES to the question of having the disease.

By virtue of the delete-1 Jackknife procedure, every datapoint in the data-set was in turn tested. Thus, we exhaustively evaluated the model’s performance. This was handy since the data is small in size and works on saliva SELDI data-set is not available in literature.

A further interpretation of our result is two way classifications; viz-a-viz predisposed and indisposed persons. Consider k=1 confusion matrix again, notice that members of MCI stage has the tendency to exhibits 50% characteristics of both CON and tAD as evident in the proportion of miss classified instances of MCI. Owing to this, let’s regroup the sampled population; CON as non-predisposed and MCI and tAD as predisposed and compare the tendency of committing type II error (Table 5) based on k = 1 classification against result obtained via default diagnosis of the same population size.

Table 5: Through default diagnosis the probability of committing type II error is 0.33

Predisposed (k=1) Diagnoses

Predisposed

NO

YES

NO

13

7

YES

8

32

p(type 2 error)=0.13

75% Accuracy

CONCLUSION

It is worthy to ask if adding additional features into the distance function will improve the result of this work. Similarly, will it improve the obtained result if only ions of a particular peak type are used or if ions are categorized and used based on their molecular weight or signal to noise ratio?

The model described here was done with SELDI saliva data set generated with CM10 (cation exchange surface) chemistry at low (1800 nJ) laser energy bombardment condition, as another possible area of future work, one could extend this work to other SELDI data generated under other energy conditions and/or chemistry. A sensitivity analysis of saliva SELDI data with regards to the best time of the day saliva samples can be obtained from donors for SELDI examination is also a possible future work.

Having transformed the matrix data points into a collection of row vectors, building and determining the performance of other models with other learning algorithms including applying the statistical distribution of mass and corresponding TOF Intensity values of ion molecules as expressed across stages can be looked at as a future work.

In conclusion, while studies aimed towards personalized medicine are currently on going, the focus on closing the gap between bio-marker identification and the diagnosis of incurable diseases (e.g Dementia) should not be lost.

REFERENCES
  1. Issaq H, Veenstra T, Conrads T, Felscow D. The SELDI-TOF MS Approach to Proteomics: Protein Profilingand Biomarker Identification. Biochem Biophys Res Commun. 2002; 292: 587-597.
  2. Raghava GPS. Bioinformatics and drug discovery. 2015.
  3. Jonathan M Street, James W Dear. The Application of Mass Spectrometry Based Protein BiomakerDiscovery to Theragnostics. Br J Clin Pharmacol. 2010; 69: 367-378.
  4. Alzheimer’s Association.
  5. Jae Won Lee, Jung Bok Lee, Mira Park, SeuckHeun Song. An extensive comparison of recent classificationtools applied to microarray data. Comput Sta Data Anal. 2005; 48: 869-885.
  6. Dudoit S, Fridlyand J, Speed P. Comparison of discrimination methods for classification of tumors usinggene expression data. JASA. 2002; 97: 77-87.
  7. José Crossa, Jorge Franco. Statistical Methods for Classifying genotypes. Euphytica. 2004; 137: 19-87.
  8. Li L, Weinberg CR, Thomas A. Darden and Lee G. Pedersen. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameter of GA/KNN method. Bioinformatics. 2001; 17: 1131-1142.
  9. Min-Ling, Zhi-Hua Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recogn. 2007; 40: 2038-2048.
  10. Wangmeng Zuo, David Zhang, Kuanquan Wang. Bidirectional PCA with assembled matrix distance for image recognition. Cybernetics. 2006; 36: 863-872.
  11. Holm L, Sander C. Protein Structure Comparison by Alignment of Distance Matrices. J Mol Biol. 1993; 233: 123-138.
  12. Kilian Q. Weinberger, Lawrence K. Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. JMLR. 2009; 10: 2007-2244.
  13. Hao Zhang, Alexander C. Berg, Michael Maire, Jitendra Malik. SVM-KNN: Discriminative Nearest NeighborClassification for Visual Category Recognition. IEEE. 2006; 2: 2126-2136.
  14. Avery I. McIntosh. "The Jackknife Estimation Method". 2016.
  15. Sandra L. Taylor, Kyoungmi Kim. A Jackknife and Voting Classifier Approach to Feature Selection and Classification. Cancer Inform. 2011; 10: 133-147.
  16. Baggerly KA, Morris JS, Coombes KR. Reproducibility of SELDI-TOF protein patterns in serum: Comparing datasets from different experiments. Bioinformatics. 2004; 20: 777-785.

Anyaiwe DEO, Wilson GD, Geddes TJ, Singh GB (2016) K-NN Classification of Mass Spectra Data for Diagnosing Alzheimer’s Disease. Comput Sci Eng 1(1): 1004.

Received : 01 Oct 2016
Accepted : 15 Nov 2016
Published : 17 Nov 2016
Journals
Annals of Otolaryngology and Rhinology
ISSN : 2379-948X
Launched : 2014
JSM Schizophrenia
Launched : 2016
Journal of Nausea
Launched : 2020
JSM Internal Medicine
Launched : 2016
JSM Hepatitis
Launched : 2016
JSM Oro Facial Surgeries
ISSN : 2578-3211
Launched : 2016
Journal of Human Nutrition and Food Science
ISSN : 2333-6706
Launched : 2013
JSM Regenerative Medicine and Bioengineering
ISSN : 2379-0490
Launched : 2013
JSM Spine
ISSN : 2578-3181
Launched : 2016
Archives of Palliative Care
ISSN : 2573-1165
Launched : 2016
JSM Nutritional Disorders
ISSN : 2578-3203
Launched : 2017
Annals of Neurodegenerative Disorders
ISSN : 2476-2032
Launched : 2016
Journal of Fever
ISSN : 2641-7782
Launched : 2017
JSM Bone Marrow Research
ISSN : 2578-3351
Launched : 2016
JSM Mathematics and Statistics
ISSN : 2578-3173
Launched : 2014
Journal of Autoimmunity and Research
ISSN : 2573-1173
Launched : 2014
JSM Arthritis
ISSN : 2475-9155
Launched : 2016
JSM Head and Neck Cancer-Cases and Reviews
ISSN : 2573-1610
Launched : 2016
JSM General Surgery Cases and Images
ISSN : 2573-1564
Launched : 2016
JSM Anatomy and Physiology
ISSN : 2573-1262
Launched : 2016
JSM Dental Surgery
ISSN : 2573-1548
Launched : 2016
Annals of Emergency Surgery
ISSN : 2573-1017
Launched : 2016
Annals of Mens Health and Wellness
ISSN : 2641-7707
Launched : 2017
Journal of Preventive Medicine and Health Care
ISSN : 2576-0084
Launched : 2018
Journal of Chronic Diseases and Management
ISSN : 2573-1300
Launched : 2016
Annals of Vaccines and Immunization
ISSN : 2378-9379
Launched : 2014
JSM Heart Surgery Cases and Images
ISSN : 2578-3157
Launched : 2016
Annals of Reproductive Medicine and Treatment
ISSN : 2573-1092
Launched : 2016
JSM Brain Science
ISSN : 2573-1289
Launched : 2016
JSM Biomarkers
ISSN : 2578-3815
Launched : 2014
JSM Biology
ISSN : 2475-9392
Launched : 2016
Archives of Stem Cell and Research
ISSN : 2578-3580
Launched : 2014
Annals of Clinical and Medical Microbiology
ISSN : 2578-3629
Launched : 2014
JSM Pediatric Surgery
ISSN : 2578-3149
Launched : 2017
Journal of Memory Disorder and Rehabilitation
ISSN : 2578-319X
Launched : 2016
JSM Tropical Medicine and Research
ISSN : 2578-3165
Launched : 2016
JSM Head and Face Medicine
ISSN : 2578-3793
Launched : 2016
JSM Cardiothoracic Surgery
ISSN : 2573-1297
Launched : 2016
JSM Bone and Joint Diseases
ISSN : 2578-3351
Launched : 2017
JSM Bioavailability and Bioequivalence
ISSN : 2641-7812
Launched : 2017
JSM Atherosclerosis
ISSN : 2573-1270
Launched : 2016
Journal of Genitourinary Disorders
ISSN : 2641-7790
Launched : 2017
Journal of Fractures and Sprains
ISSN : 2578-3831
Launched : 2016
Journal of Autism and Epilepsy
ISSN : 2641-7774
Launched : 2016
Annals of Marine Biology and Research
ISSN : 2573-105X
Launched : 2014
JSM Health Education & Primary Health Care
ISSN : 2578-3777
Launched : 2016
JSM Communication Disorders
ISSN : 2578-3807
Launched : 2016
Annals of Musculoskeletal Disorders
ISSN : 2578-3599
Launched : 2016
Annals of Virology and Research
ISSN : 2573-1122
Launched : 2014
JSM Renal Medicine
ISSN : 2573-1637
Launched : 2016
Journal of Muscle Health
ISSN : 2578-3823
Launched : 2016
JSM Genetics and Genomics
ISSN : 2334-1823
Launched : 2013
JSM Anxiety and Depression
ISSN : 2475-9139
Launched : 2016
Clinical Journal of Heart Diseases
ISSN : 2641-7766
Launched : 2016
Annals of Medicinal Chemistry and Research
ISSN : 2378-9336
Launched : 2014
JSM Pain and Management
ISSN : 2578-3378
Launched : 2016
JSM Women's Health
ISSN : 2578-3696
Launched : 2016
Clinical Research in HIV or AIDS
ISSN : 2374-0094
Launched : 2013
Journal of Endocrinology, Diabetes and Obesity
ISSN : 2333-6692
Launched : 2013
Journal of Substance Abuse and Alcoholism
ISSN : 2373-9363
Launched : 2013
JSM Neurosurgery and Spine
ISSN : 2373-9479
Launched : 2013
Journal of Liver and Clinical Research
ISSN : 2379-0830
Launched : 2014
Journal of Drug Design and Research
ISSN : 2379-089X
Launched : 2014
JSM Clinical Oncology and Research
ISSN : 2373-938X
Launched : 2013
JSM Bioinformatics, Genomics and Proteomics
ISSN : 2576-1102
Launched : 2014
JSM Chemistry
ISSN : 2334-1831
Launched : 2013
Journal of Trauma and Care
ISSN : 2573-1246
Launched : 2014
JSM Surgical Oncology and Research
ISSN : 2578-3688
Launched : 2016
Annals of Food Processing and Preservation
ISSN : 2573-1033
Launched : 2016
Journal of Radiology and Radiation Therapy
ISSN : 2333-7095
Launched : 2013
JSM Physical Medicine and Rehabilitation
ISSN : 2578-3572
Launched : 2016
Annals of Clinical Pathology
ISSN : 2373-9282
Launched : 2013
Annals of Cardiovascular Diseases
ISSN : 2641-7731
Launched : 2016
Journal of Behavior
ISSN : 2576-0076
Launched : 2016
Annals of Clinical and Experimental Metabolism
ISSN : 2572-2492
Launched : 2016
Clinical Research in Infectious Diseases
ISSN : 2379-0636
Launched : 2013
JSM Microbiology
ISSN : 2333-6455
Launched : 2013
Journal of Urology and Research
ISSN : 2379-951X
Launched : 2014
Journal of Family Medicine and Community Health
ISSN : 2379-0547
Launched : 2013
Annals of Pregnancy and Care
ISSN : 2578-336X
Launched : 2017
JSM Cell and Developmental Biology
ISSN : 2379-061X
Launched : 2013
Annals of Aquaculture and Research
ISSN : 2379-0881
Launched : 2014
Clinical Research in Pulmonology
ISSN : 2333-6625
Launched : 2013
Journal of Immunology and Clinical Research
ISSN : 2333-6714
Launched : 2013
Annals of Forensic Research and Analysis
ISSN : 2378-9476
Launched : 2014
JSM Biochemistry and Molecular Biology
ISSN : 2333-7109
Launched : 2013
Annals of Breast Cancer Research
ISSN : 2641-7685
Launched : 2016
Annals of Gerontology and Geriatric Research
ISSN : 2378-9409
Launched : 2014
Journal of Sleep Medicine and Disorders
ISSN : 2379-0822
Launched : 2014
JSM Burns and Trauma
ISSN : 2475-9406
Launched : 2016
Chemical Engineering and Process Techniques
ISSN : 2333-6633
Launched : 2013
Annals of Clinical Cytology and Pathology
ISSN : 2475-9430
Launched : 2014
JSM Allergy and Asthma
ISSN : 2573-1254
Launched : 2016
Journal of Neurological Disorders and Stroke
ISSN : 2334-2307
Launched : 2013
Annals of Sports Medicine and Research
ISSN : 2379-0571
Launched : 2014
JSM Sexual Medicine
ISSN : 2578-3718
Launched : 2016
Annals of Vascular Medicine and Research
ISSN : 2378-9344
Launched : 2014
JSM Biotechnology and Biomedical Engineering
ISSN : 2333-7117
Launched : 2013
Journal of Hematology and Transfusion
ISSN : 2333-6684
Launched : 2013
JSM Environmental Science and Ecology
ISSN : 2333-7141
Launched : 2013
Journal of Cardiology and Clinical Research
ISSN : 2333-6676
Launched : 2013
JSM Nanotechnology and Nanomedicine
ISSN : 2334-1815
Launched : 2013
Journal of Ear, Nose and Throat Disorders
ISSN : 2475-9473
Launched : 2016
JSM Ophthalmology
ISSN : 2333-6447
Launched : 2013
Journal of Pharmacology and Clinical Toxicology
ISSN : 2333-7079
Launched : 2013
Annals of Psychiatry and Mental Health
ISSN : 2374-0124
Launched : 2013
Medical Journal of Obstetrics and Gynecology
ISSN : 2333-6439
Launched : 2013
Annals of Pediatrics and Child Health
ISSN : 2373-9312
Launched : 2013
JSM Clinical Pharmaceutics
ISSN : 2379-9498
Launched : 2014
JSM Foot and Ankle
ISSN : 2475-9112
Launched : 2016
JSM Alzheimer's Disease and Related Dementia
ISSN : 2378-9565
Launched : 2014
Journal of Addiction Medicine and Therapy
ISSN : 2333-665X
Launched : 2013
Journal of Veterinary Medicine and Research
ISSN : 2378-931X
Launched : 2013
Annals of Public Health and Research
ISSN : 2378-9328
Launched : 2014
Annals of Orthopedics and Rheumatology
ISSN : 2373-9290
Launched : 2013
Journal of Clinical Nephrology and Research
ISSN : 2379-0652
Launched : 2014
Annals of Community Medicine and Practice
ISSN : 2475-9465
Launched : 2014
Annals of Biometrics and Biostatistics
ISSN : 2374-0116
Launched : 2013
JSM Clinical Case Reports
ISSN : 2373-9819
Launched : 2013
Journal of Cancer Biology and Research
ISSN : 2373-9436
Launched : 2013
Journal of Surgery and Transplantation Science
ISSN : 2379-0911
Launched : 2013
Journal of Dermatology and Clinical Research
ISSN : 2373-9371
Launched : 2013
JSM Gastroenterology and Hepatology
ISSN : 2373-9487
Launched : 2013
Annals of Nursing and Practice
ISSN : 2379-9501
Launched : 2014
JSM Dentistry
ISSN : 2333-7133
Launched : 2013
Author Information X