Friday, January 16, 2015

The Power of Anti-Data

Introduction


Recently I came across a very interesting paper named Data Smashing. In that paper, the authors propose a new method for quantifying similarity between sources of arbitrary data streams, without the need to design features that are good representatives of those data sources, which usually demands a deep domain knowledge of the problem in question. The main idea is that, through some unique algebraic operations applied on the quantized data streams, one can measure the similarity between the stochastic generating processes corresponding to those data streams.

The main operations are stream inversion and stream summation. Given an arbitrary data stream, when one applies summation between it and its inverted copy (obtained by applying the inversion operation to it), all the statistical information in the data is annihilated and what is left is just Flat White Noise, meaning that a given data stream is measured to be perfectly similar to itself. Therefore, when this principle is applied to two arbitrary data streams that are generated by distinct stochastic processes, it is expected that the outcome will be Flat White Noise plus some statistical information associated to the dissimilarity between those data streams.

In this post, I will show those principles applied to the problem of classification of basic hand movements, measured from EMG data. I will use my own implementation (written in R) of the algorithms and principles described in the paper, but I will not focus on the details of the principles and algorithms. For that, I suggest you refer to the paper.


The Data


The data set to be used is the sEMG for Basic Hand Movements from the UCI Machine Learning Repository. In this domain (EMG data), classification is usually performed through the application of a linear or non-linear classifier on a bunch of time and frequency features computed from the raw data. What I will show here, instead, is the application of the Data Smashing principles direct on the raw data.

This data set has 100 measurements for each of the 6 different hand gestures captured through two sensors. Each measurement (data stream) has 2,500 data points for each sensor. The sequence of measurements was taken from a single subject, for 3 times. Below we see a picture of those 6 hand gestures:





For the sake of simplicity, I will load and compare only the data that corresponds to the first two chunks of the data set, which in turn corresponds to the cylindrical and hook-type gestures, for one measurement sequence (from the 3 available). Nevertheless, the principles I will show are the same when comparing other portions of the data set too.

Below is a plot of one of the data streams (one measurement) from the matrix of cylindrical gesture types:


The Code


We start by reading in the data, which is in Matlab's matrix format. So, I use R's R.matlab package to read the data in:

library(R.matlab)
M <- readMat(file.path('male_day_1.mat'))

Then we grab the chunks that will be used. For each gesture type we keep the data in a separate matrix. I will be using only the available data from one of the sensors (sensor #2). Notice that readMat reads into a list first and from that list we get the matrix of interest:

cyl <- (M[[2]])
hook <- (M[[4]])

The first step before applying Data Smashing to the data set is to quantize it. Several quantization approaches are described in the literature, but I will implement the suggested procedure, as described in the paper. What it basically suggests is to slice the data set into ranges with approximately equal number of data points occurrences each, and then assign a different symbol for each range. For that, we define the quantize function in a simple way, only operating for a symbol alphabet of sizes 2 or 3:

quantize <- function(x, sigma) {
  v <- vector()
  q <- 1/sigma

  for(i in 1:(sigma-1)) {
    v <- c(v, q)
    q <- q*2
  }
  q <- quantile(x, v)
  qx <- vector()
  if(sigma == 2) {
    qx[which(x<=q)] <- as.integer(0)
    qx[which(x>q)] <- as.integer(1)
  }
  if(sigma == 3) {
    qx[which(x<q[1])] <- as.integer(0)
    qx[which(x>=q[1] & x<q[2])] <- as.integer(1)
    qx[which(x>=q[2])] <- as.integer(2)
  }
  qx
}


In the function above, x is the vector of data points to be quantized, and sigma is the size of the alphabet symbol. For example, with an alphabet size of 3, each data point in the vector x is mapped to one of 3 possible different symbols, say '0', '1', or '2'. The range allocated for each symbol is defined such that the number of symbol occurrences is more or less the same in each range. There is much more in the paper regarding quantization approaches, including some heuristics on how to choose the proper size for the symbol alphabet, which I am not going to discuss here.

We then apply the quantization for the entire data set to be analyzed. This is important, because the data range allocated for each symbol has to have the same meaning, that is, it has to be comparable between each data stream:

M <- rbind(cyl, hook)
M <- apply(M, 2, function(x) quantize(x, 3))

Now we reassemble the quantized data set into the separate matrices for each hand gesture. We do so, in order to separate the data set into train and test sets. From the 100 available data streams for each gesture type, we randomly pick 10 for the train set and the rest is our test set:

q_cyl <- M[1:100,]
q_hook <- M[101:200,]

n <- 10
i_cyl_train <- sample(1:nrow(cyl), n)
i_hook_train <- sample(1:nrow(hook), n)
i_cyl_test <- which(!1:nrow(cyl) %in% i_cyl_train)
i_hook_test <- which(!1:nrow(hook) %in% i_hook_train)

The idea is to use the test set as the ground-truth data and apply Data Smashing between any data stream from the train set and those from the test set. If Data Smashing yields a high similarity score, this means that there is a high probability that the stream we are testing is of the same class (same gesture type) as the stream of the corresponding train set.

To illustrate this principle, I will show the application of Data Smashing between all data streams in our train set. When plotting the heat map of the results, we clearly see the clusters naturally formed due to the similarity difference between the data corresponding to different gesture types:

smash <- function(s1, s2, sigma) {
  deviation(sigma, summation(s1, inversion(sigma, s2)))
}

DS <- NULL
DS <- rbind(DS, q_cyl[i_cyl_train,])
DS <- rbind(DS, q_hook[i_hook_train,])
s <- vector()
for(i in 1:nrow(DS)) {
  for(j in 1:nrow(DS)) {
    s <- c(s, smash(DS[i,], DS[j,], 3))
  }
}
H <- matrix(s, nrow=nrow(DS), byrow=T)
heatmap(H, Rowv=NA, Colv='Rowv')

The code above defines a function named smash, which takes as arguments two quantized data streams (s1 and s2) and the size of the alphabet used in the quantization process, sigma. The smash function uses the deviation function to measure the similarity between two data streams by comparing, through the summation function, one stream against the inverted copy of the other. That inverted copy is computed through the inversion function. As stated before, I will not detail the implementation of those functions (deviation, summation, and inversion). Their corresponding algorithms are described in detail in the paper.

The code above generates a heat map for 100 comparisons, as we have selected 10 streams measured from hand gestures of cylindrical type and another 10 measured from hand gestures of hook type:

From that heat map we can clearly see similar data streams grouped into two distinct clusters. The darker the color in a given region, the more similar are the corresponding compared streams.

Finally we apply Data Smashing between each data stream from the test set and all streams from the train set. For each data stream from the test set we will compute 20 features through Data Smashing. Note that each of those features are simply the similarity measured by applying Data Smashing between the data streams, and no domain specific knowledge about the data set was needed:

D <- NULL

for(i in i_cyl_test) {
  features <- vector()
  for(j in i_cyl_train) {
    features <- c(features, smash(q_cyl[i,], q_cyl[j,], 3))
  }
  for(j in i_hook_train) {
    features <- c(features, smash(q_cyl[i,], q_hook[j,], 3))
  }
  D <- rbind(D, features)
}

for(i in i_hook_test) {
  features <- vector()
  for(j in i_cyl_train) {
    features <- c(features, smash(q_hook[i,], q_cyl[j,], 3))
  }
  for(j in i_hook_train) {
    features <- c(features, smash(q_hook[i,], q_hook[j,], 3))
  }
  D <- rbind(D, features)
}

The last step is then to apply a simple clustering algorithm to the test data, such as k-means, and see if the test data is grouped properly according to the corresponding hand gesture types:

k <- kmeans(D, centers=2)

cluster_cyl <- k$cluster[1:90]
cluster_hook <- k$cluster[91:180]

By printing the cross tabulations for the clustering results we get:

xtabs(~ cluster_cyl)
cluster_cyl
 1  2 
 1 89

xtabs(~ cluster_hook)
cluster_hook
 1  2 
87  3

This shows a very good result:
98.9% of the test data representing cylindrical hand gesture type and 96.7% representing hook hand gesture type were correctly grouped by k-means.