Hadoop on a VM
You can set up a virtual Hadoop cluster running in a virtual machine. You may find it easier to test and experiment with this setup than using our Cluster. Of course, this won't be a fast way to process multiple terabytes of data, but it will be enough to test your code on small data sets.
The instructions here use the Cascading Hadoop Cluster (forked from the original to add Spark and Hive support) to get things running.
The single-node setup will require about 2GB of RAM, and the four-node about 4GB (but see below for details). If you have a computer that can allocate that (and still run your OS and web browser and whatever else), then this is a good solution for you.
- Install VirtualBox.
In Ubuntu, this can by done by installing the package
- Install Vagrant. In Ubuntu, install the
- Get the virtual cluster configuration code:
git clone https://github.com/gregbaker/vagrant-cascading-hadoop-cluster.git
- If you want the single-node version of the “cluster”, change to the
single-nodedirectory. If your computer can handle the four-node version, stay in the repository root directory.
- See “Customizing your VM” below: there may be some ways you want to customize your
- Start the cluster:
vagrant up. This will take some time on the first run (maybe 45 minutes) and download a bunch of packages. (i.e. do it when you're plugged in and on a decent network, not tethered to your phone on the bus.)
Customizing your VM
Vagrantfile (in the repo root for the four-node configuration, or in
single-node for the one-node version), you can set the CPU and memory given to each VM to something reasonable. If you're using the multi-node setup, remember that you are going to be hosting four VMs with these specs. [I have had bad luck with less than 1024MB of memory in the multi-node setup, or 2048MB for the single-node.]
You can also add a shared folder so the code you're working on (on your actual computer) is available in the VM. Inside the master node config add a
synced_folder line like this:
config.vm.define :master, primary: true do |master| config.vm.synced_folder "/home/me/CMPT732", "/home/vagrant/CMPT732" ⋮ end
Starting your “cluster”
First get the VM(s) running and SSH in:
vagrant up vagrant ssh
The first time you start the cluster, you need to initialize the filesystem:
Then in the VM, start the Hadoop cluster:
sudo start-all.sh sudo start-hbase.sh # if you need HBase running
You can access the web frontends for the cluster at these URLs:
- HDFS namenode: http://master.local:50070/
- YARN application master: http://master.local:8088/
- MapReduce job history server: http://master.local:19888/
- HBase frontend: http://master.local:16010/
Stopping your “cluster”
Inside the master node:
sudo stop-hbase.sh # if you started HBase sudo stop-all.sh
And then exit the SSH session and shut down the nodes: