Virtualizing Hadoop VMware’s Big Data Extensions — Under the Hood Part 2
Here we are continuing my four-part blog series in an attempt to look at some interesting scenarios with VMware’s Big Data Extensions (BDE). This segment will go a bit further into the architecture and some practical benefits of VMware BDE over other Hadoop deployment scenarios. Specifically, we’re focusing on the installation steps and “gotchas” in that process.
We’ve already identified a key problem with Hadoop: it isn’t really all that fun or simple to deploy — it’s manual, repetitive and requires mucking around with hardware on the computer, network and storage fronts. There are, in fact, other interesting challenges in using Hadoop, one of which is getting data into and out of a Hadoop Distributed File System (HDFS), but I’ll get to that in my next segment when I cover HDFS integration on EMC Isilon.
Deploying VMware’s BDE
Having touched on the topic of using VMware BDE, let’s look at how we deploy it.
The first thing to understand is that BDE is installed as a virtual appliance and requires the vCenter Web Client for fully featured installation and management through the vCenter plug-in. So, right off the bat, you’ll need to get used to using the Web Client, if you haven’t done so already. On with the installation.
1. Where to get the BDE virtual appliance: Go to
Log in with your credentials, visit the product download section and search for Big Data Extensions 2.0.
2. Ensure that you’ve downloaded the virtual appliance to a location that’s not too far from your vCenter Server (and therefore, your target cluster) since the appliance is 5GB in size. Once downloaded, log in to the vCenter Web UI, navigate to your target cluster, right-click and select “Deploy OVF Template.”
3. Walking through the install is really nice and the folks at VMware did a great job making this virtual appliance as simple to deploy as possible. As an upfront note, you will need to ensure that you’ve got dedicated static IPs and that you’ve registered them in your DNS. In the graphic below, we’re configuring the SSO lookup URL on the vCenter Server.
4. When you reach the end of the wizard setup, there is an opportunity to review the configuration before committing to it. I strongly recommend that you carefully review these items as changing them later can be somewhat of a pain.
5. If you’ve reviewed the configuration and you’re confident that all is well, then proceed. You can observe the deployment from the Recent Tasks frame within the Web Client:
And the BDE vApp:
6. Once the BDE virtual appliance is initialized and installed, you’ll want be make sure to address a little “gotcha” that I found. By mousing over the vApp and looking at the vApp state link, you can observe that the BDE Serengeti Management Server name is set to some odd expression value. You’ll need to manage the vApp settings and change this to a DNS-resolvable name of your choice. Just remember that after changing this name, you’ll need to re-enable SSO from the BDE Serengeti Management Server.
Remember: to update SSO from the BDE Serengeti Management Server, open a browser and visit https://<serengetimanagement server IP>:5480.
7. So let’s see, you’ve updated the Serengeti Management Server name, you’ve updated the SSO certificate and you’ve re-enabled SSO. Now, you must register the vCenter plug-in for BDE. Open a browser and visit https://<serengetimanagement server IP>:8443/register-plugin.
8. Once the vCenter BDE plug-in has been registered, you’ll need to log out and log in to the vCenter Web Client. At this point you should be able to see the Big Data Extensions plug-in listed in the left frame of the Web Client.
9. Select “Big Data Extensions” and then select the link “Register Serengeti Server.” A window will open within which you can navigate the list of available VMs until you find the Serengeti Management Server. Highlight the target server and select “Test Connection.”
10. Status check — Let’s make sure you’ve made it through all the prerequisites.
- BDE installed
- DNS updated
- BDE plug-in registered
- BDE Serengeti Management Server registration added
If you’ve made it this far, then proceed. If you’ve reached a roadblock, please keep plugging away, reach out to me or go “to the Google!”
11. Deploying a BDE Cluster
First, you must define the cluster resources. There are two bits to this part: the first bit is the storage and the second is the IP range that the cluster will use.
Storage — For the storage portion, you’ll need to make a determination between defining local and shared storage resources for your Hadoop cluster Compute and Worker nodes. Now the compelling factors between the two have to do with scale vs. performance, and both are core to understanding your Hadoop use-case and performance demand. This goes into the fundamentals of Hadoop architecture and beyond the scope of this blog. That said, being a vSphere infrastructure, you may not have a lot of local storage available on your hosts, so shared storage may be the only option. The good news is that BDE makes the deployment and redeployment of resources super simple and repeatable, so at this point don’t sweat the small stuff.
Network — On the network side, it’s a bit more straightforward. Here we need to define the IP pool that will be used by the Hadoop cluster components during rollout. You’ll want to select an available port group for your Hadoop cluster systems to access the appropriate network for your purposes. Give the network resource a name, select the target port group and provide the static IP address details — rather intuitive.
12. Create the Big Data Cluster
This is about as simple as it can get. On the BDE plug-in menu, select “Big Data Clusters” and then select the “Create Big Data Cluster” icon in the upper left of the center frame. Follow the wizard by entering the cluster name (the default Hadoop distribution included with the BDE is Apache), select your deployment type (here I selected Compute-only since I’ll be pointing this at Isilon for storage; more about that in a later session), select the Data Master URL (here we input the path to Isilon, but your build may be different depending on your HDFS platform) and finally define the nodes for Compute (1) and Worker (2) and their resources — make them small since we’re just testing and want to get the cluster up and running right away.
Note: While the node resource profiles are selectable as Large, Medium, Small and Custom configurations, you can use the custom setting to set the resource configuration to whatever your requirements may be (vCPU, vRAM or vDisk).
Watch the cluster deploy
Watch the cluster deploy
This is the beauty of BDE — once the wizard configuration is complete, you can sit back and let the Hadoop cluster deploy itself.
- Next steps
At this point, the Hadoop cluster has deployed and you’re ready to proceed with your MapReduce work. You can connect to your Compute node and begin to run MapReduce jobs. Since we’re using Isilon HDFS integration for our cluster’s Name and Data node functionality, your process may differ from ours. We’ll cover the Isilon HDFS integration in more detail in a later section of this series, but here’s a sample quick-and-dirty overview of the process we went through to get going with MapReduce in our BDE on Isilon HDFS environment.
First, we created a directory in the Isilon OneFS called “hadoop” and set the appropriate permissions. Then, and here’s the cool part of using Isilon with HDFS, we can navigate to the hadoop folder we created and create our HDFS structure:
- Open Windows File Explorer, navigate the hadoop folder
- Update Isilon HDFS protocol to use this folder (/ifs/hadoop) as the HDFS root
- Create a folder “Input”
- Create a folder “Output”
- Navigate to the “Input” folder
- Copy a sample data set — in this case we used the freely available NYC Parking Violations list for FY 2013
- Log in to your Hadoop Compute node via SSH
- List the contents of the HDFS root folder with the command:
[root@10 conf]# hadoop fs -ls /Input
Found 1 items
-rw——- 1 root wheel 1796223012 2014-10-28 18:43 /Input/NYC_Parking_Violations_Issued_FY2013.csv
Note: This is the cool part — we put a file on Isilon in a CIFS share on the OneFS file system and then listed that same file on HDFS — no coding required! Isilon for the win!
9. At this point, on the Compute node, you can launch a MapReduce job such as the following:
[root@10 conf]# hadoop jar /usr/lib/hadoop-1.2.1/hadoop-examples-1.2.1.jar wordcount /Input/NYC_Parking_Violations_Issued_FY2013.csv COutput/Data
We downloaded and installed the VMware BDE virtual appliance. We configured Big Data Cluster resources. We configured our Big Data Cluster. We added data to our Isilon HDFS location.
My next part of this series will go into greater detail around Isilon HDFS and the benefits thereof. Please feel free to reach out with any questions or comments.