Hadoop on a VM

You can set up a virtual Hadoop cluster running in a virtual machine. You may find it easier to test and experiment with this setup than using our Cluster. Of course, this won't be a fast way to process multiple terabytes of data, but it will be enough to test your code on small data sets.

The instructions here use the Cascading Hadoop Cluster (forked from the original to add Spark and Hive support) to get things running.

The single-node setup will require about 2GB of RAM, and the four-node about 4GB (but see below for details). If you have a computer that can allocate that (and still run your OS and web browser and whatever else), then this is a good solution for you.

Setup

Install VirtualBox. In Ubuntu, this can by done by installing the package virtualbox-qt.
Install Vagrant. In Ubuntu, install the vagrant package.
Get the virtual cluster configuration code: git clone https://github.com/gregbaker/vagrant-cascading-hadoop-cluster.git
If you want the single-node version of the “cluster”, change to the single-node directory. If your computer can handle the four-node version, stay in the repository root directory.
See “Customizing your VM” below: there may be some ways you want to customize your Vagrantfile.
Start the cluster: vagrant up. This will take some time on the first run (maybe 45 minutes) and download a bunch of packages. (i.e. do it when you're plugged in and on a decent network, not tethered to your phone on the bus.)

Customizing your VM

In the Vagrantfile (in the repo root for the four-node configuration, or in single-node for the one-node version), you can set the CPU and memory given to each VM to something reasonable. If you're using the multi-node setup, remember that you are going to be hosting four VMs with these specs. [I have had bad luck with less than 1024MB of memory in the multi-node setup, or 2048MB for the single-node.]

You can also add a shared folder so the code you're working on (on your actual computer) is available in the VM. Inside the master node config add a synced_folder line like this:

config.vm.define :master, primary: true do |master|
  config.vm.synced_folder "/home/me/CMPT732", "/home/vagrant/CMPT732"
  ⋮
end

Starting your “cluster”

First get the VM(s) running and SSH in:

vagrant up
vagrant ssh

The first time you start the cluster, you need to initialize the filesystem:

sudo prepare-cluster.sh

Then in the VM, start the Hadoop cluster:

sudo start-all.sh
sudo start-hbase.sh # if you need HBase running

You can access the web frontends for the cluster at these URLs:

HDFS namenode: http://master.local:50070/
YARN application master: http://master.local:8088/
MapReduce job history server: http://master.local:19888/
HBase frontend: http://master.local:16010/

Stopping your “cluster”

Inside the master node:

sudo stop-hbase.sh # if you started HBase
sudo stop-all.sh

And then exit the SSH session and shut down the nodes:

vagrant halt

Updated Thu Aug. 22 2024, 11:06 by ggbaker.

Simon Fraser University
Engaging the World

CourSys

Hadoop on a VM

Setup

Customizing your VM

Starting your “cluster”

Stopping your “cluster”