Friday, February 6, 2015

Machine Learning with H2O on Hadoop

Introduction


Machine learning on Hadoop has evolved a lot since the debut of Apache Mahout. Back then, a data scientist had to be able to understand at least some basics of Java programming and Linux shell scripting in order to assemble more complex MapReduce jobs and take the most of Mahout. And Mahout is batch-only oriented, which makes it difficult for interactive, data exploratory analysis. But now, machine learning at scale on top of Hadoop is getting a lot of traction with new-generation  machine learning frameworks. By embracing YARN, those are not tied only to the MapReduce paradigm anymore and are capable of in-memory data processing. Some examples from this new generation are Apache Spark’s  MLlib and Oxdata’s H2O. Moreover, Mahout is being ported to take advantage of those new frameworks as is shown in the “Mahout News” section of its home page.

In this tutorial I will show how to install H2O on a Hadoop cluster and run some basic machine learning tasks. For the Hadoop cluster environment I will use Hortonworks 2.1 Sandbox VM. And for H2O, I will use the H2o driver specific for HDP 2.1 from the latest stable release.

1. Download and Install HDP Sandbox VM


Download and install the Hortonworks Sandbox VM. Note: as of this writing, it seems that H2O is not properly running on the latest HDP Sandbox VM, which is version 2.2. For this tutorial I used HDP Sandbox version 2.1, which can be downloaded from here.

2. Download and Install H2O on the HDP Sandbox VM


Log in your HDP Sandbox VM:
ssh root@<your-hdp-sandbox-ip-address> -p 22

Create a directory for H2O download and installation:
# cd /opt
# mkdir h2o
# cd h2o

Check for H2O latest stable release here and click on the corresponding download link; this will open the download page; copy the download link.

Download H2O:
# wget <your-h2o-download-link>

Unzip and launch H2O as a MapReduce process:
# unzip <your-h2o-downloaded-file>
# cd h2o-<version>/hadoop
# hadoop jar h2odriver_hdp2.1.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 1g -nodes 1 -output h2o_out

Note that the H2O job is launched with a memory limit of 1 GB. You can increase it, but first you have to configure YARN appropriately. You can do that by following the steps shown here. Also note that if you are in a real, multiple node cluster, you can specify ‘-nodes <n>’ where n will be the number of map tasks instantiated on Hadoop, each task being an H2O cluster node.

If the cluster comes up successfully, you will get a message similar to this:
H2O cluster (1 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...

Verify that the H2O cluster is up and running:
Open a web browser at http://<your-hdp-sandbox-ip-address>:8000/jobbrowser (HUE Job Browser) or http://< your-hdp-sandbox-ip-address >:8088/cluster (YARN Resource Manager web interface). You should see something similar to the image below (shown for the YARN Resource Manager UI):


Notice that an H2O cluster run as a map-only  MapReduce process on top of Hadoop MapReduce v1 or Hadoop MapReduce on YARN. This is for the purpose of starting up the cluster infrastructure only. H2O tasks do not run as Hadoop MapReduce jobs. If you are interested in further understand H2O’s software architecture, I suggest you to take a look here.

3. Test #1: Run a Simple Machine Learning Task through the H2O Web Interface


Upload and prepare a small dataset in Hadoop: sometimes it is necessary to pre-process a dataset in order to prepare it for a machine learning task. In this case, we will use a dataset that is tab-separated, but there are some entries that are separated by more than one tab. We will use a built-in Hive function that allows for find-and-replace for regular expressions, so that the outcome is the same dataset with entries separated by a unique comma.

The first step is to download the dataset. We will use the seeds dataset from the UCI Machine Learning Repository.  Go to the seeds dataset repository here and click on “Data Folder”.  Then copy the download link.
Download the seeds dataset in your sandbox and put the file into HDFS. We will be using the guest’s home folder and HDFS folder for that:
# cd /home/guest
# wget <seeds-dataset-download-link>
# hdfs dfs –put seeds_dataset.txt /user/guest/seeds_dataset.txt

Now prepare the seeds dataset with help from Hive, as described earlier:
# hive
hive> CREATE TABLE seeds_dataset(data STRING);
hive> LOAD DATA INPATH '/user/guest/seeds_dataset.txt' OVERWRITE INTO TABLE `seeds_dataset`
hive> SELECT regexp_replace(data, '[\t]+', ',') FROM seeds_dataset LIMIT 10;
hive> INSERT OVERWRITE DIRECTORY '/user/guest/seeds_dataset' SELECT regexp_replace(data, '[\t]+', ',') FROM seeds_dataset;

Open the H2O web interface at: http://<your-hdp-sandbox-ip-address>:54321

Click on 'Data' -> 'Import Files':



Enter the file path on HDFS (hdfs://<your-hdp-sandbox-ip-address>:8020/user/guest/seeds_dataset/000000_0). 

Then click on 'Submit':



If the upload is successful, your file will appear on the next screen. Click on it:



In the next screen you have some options to parse the dataset before importing into H2O store. In this case, nothing else is necessary. Just click on 'Submit':



In the next screen,  you can inspect the imported dataset and build machine learning models with it. Lets build a random forest classifier. Click on 'BigData Random Forest':



Now you have to enter your model parameters. In the 'destination key' field, enter a key value to save your model into H2O store. In this case, the key used is 'RandomForest_Model'. In the 'source' field, the key value of the imported dataset should be already filled. In this case, it is 'X000000_0.hex', In the 'response' field, select the column that corresponds to the data class in the dataset. It is 'C8' in this case. Leave the other fields unchanged and click on 'Submit':



After running the model, you should see a screen with the results containing a confusion matrix and some other useful model statistics. The dataset has 210 rows equally divided into 3 distinct classes, therefore 70 rows for each class. From the confusion matrix, we notice that the model correctly classified 61 out of 70 rows for class 1, 68 out of 70 for class 2, and 67 out of 70 for class 3:



You can also inspect H2O data store to see what you have so far. Click on 'Data' -> 'View All'. You should see 2 entries. One for the imported dataset and another for your random forest model:


4. Test #2: Run a Simple Machine Learning Task through the H2O R Package


Now, instead of using the H2O built-in web interface to drive the analysis, we will use the H2O R package. This allows a data scientist to drive a data analysis from an R client outside the Hadoop cluster. Notice that in this case, there is no R code running in the cluster. In this case R is just a front-end that translates H2O’s native Java interface into the familiar (for a growing number of data scients) R syntax.

My approach for that scenario is to install R and RStudio Server (R’s web IDE) in the HDP sandbox VM. But it should also work for a remote, local R instance to access the H2O cluster, as the access protocol is just a REST interface.

In the HDP sandbox VM, the first step is to enable epel (if not already installed or latest release), for the proper R installation. You also need to install the required openssl package for RStudio:
# yum install epel-release
# yum install R
# yum install openssl098e

Then, download and install the RStudio server package:
# wget http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm
# yum install --nogpgcheck rstudio-server-0.98.1091-x86_64.rpm
# rstudio-server verify-installation

Now, change the password for user guest, as you will use it to log in RStudio server. It requires a  non-system user to log in:
# passwd guest

The next step is to install H2O R package. But before that, you need to install curl-devel package in your HDP sandbox VM:
# yum install curl-devel

Now, open RStudio in your browser at http://<your-hdp-sandbox-ip-address>:8787 and log in with user guest and the corresponding password you changed before.
In the console pane of RStudio, enter the command:
> install.packages(c('RCurl', 'rjson', 'statmod'))

After installation of those packages, you should see somewhere in the output messages:
* DONE (RCurl)
* DONE (rjson)
* DONE (statmod)



Again in the console pane of RStudio, enter the command:
> install.packages('/opt/h2o/h2o-2.8.4.1/R/h2o_2.8.4.1.tar.gz', repos=NULL, type='source')

you should see somewhere in the output messages:
* DONE (h2o)



Finally, in the editor pane of RStudio, enter the following block of code:
library(h2o)
localH2O <- h2o.init(ip = '192.168.100.130', port =54321)
path <- 'hdfs://192.168.100.130:8020/user/guest/seeds_dataset/000000_0'
seeds <- h2o.importFile(localH2O, path = path, key = "seeds.hex")
summary(seeds)
head(seeds)
index.train <- sample(1:nrow(seeds), nrow(seeds)*0.8)
seeds.train <- seeds[index.train,]
seeds.test <- seeds[-index.train,]
seeds.gbm <- h2o.gbm(y = 'C8', x = c('C1','C2','C3','C4','C5','C6','C7'), data = seeds.train)
seeds.fit <- h2o.predict(object=seeds.gbm, newdata=seeds.test)
h2o.confusionMatrix(seeds.fit[,1],seeds.test[,8])

That block of R code will do the following, in order of line execution:
- Load H2O R library in your R session
- Open a connection to your H2O cluster and store that connection in a local variable
- Define a local variable for the path of your dataset on HDFS
- Import your dataset from HDFS into H2O’s in-memory key-value store and define a local variable that points to it
- Print a statistical summary of the imported dataset
- Print the first few lines of the imported dataset
- Create a vector with 80% of the row numbers randomly sampled from the imported dataset
- Create a train dataset containing the rows indexed by the vector above (80% of the entire dataset)
- Create a test dataset with the remaining rows (20% of the entire dataset)
- Train a GBM (Gradient Boosting Model), which is an ensemble of models for classification, on the training dataset
- Score the GBM model on the test dataset
- Print a confusion matrix showing the actual versus predicted classes from the test dataset

You could run the above block of code entirely, by selecting all of it on the editor pane in RStudio and clicking on ‘Run’, or you can run line by line, by positioning the cursor on the first line of code and clicking on ‘Run’ subsequently:



The seeds dataset defines 7 different measurements of seeds that seem to be useful for classification of them into 3 different groups. The idea is to learn  rules or combinations of those measurements (features, variables, parameters, or covariates,  in machine learning jargon) that are able to classify a given seed into one of those 3 classes. What is typically done is to separate the original dataset into a training and test sets, train the model on the training set, and assess how well the model generalizes its prediction power on the test set.

By running the code above, we get the following output in the console pane of RStudio after printing the confusion matrix:
        Predicted
Actual    1  2  3   Error
  1      10  2  1 0.23077
  2       0 14  0 0.00000
  3       3  0 12 0.20000
  Totals 13 16 13 0.14286

The output above shows a relatively good performance of the model considering that the entire dataset is very small, having only 210 instances. From these, 168 instances were used to train the model and 42 to test it. For class 1, it correctly classified 10 out of 13 seeds. For class 2 all 14 seeds were correctly classified. And for class 3, 12 out 15 seeds were correctly classified.

It is also important to notice that this entire analysis was run on the H2O cluster. The only data that was created locally in the R session were the vector defining the row indices for creating the training data, and the variable holding the path to the dataset in HDFS. The remaining R objects in the session were mere pointers to the corresponding objects in the H2O cluster.