Virtualizing Hadoop – VMware’s Big Data Extensions
I’m starting this 4-part blog series in an attempt to look at some interesting scenarios with VMware’s Big Data Extensions (BDE). In this segment, I’m going to cover a basic overview of VMware BDE. My next segment will go a bit further into the architecture and some practical benefits of VMware BDE over other Hadoop deployment scenarios. Following that segment, I’ll issue a use-case that integrates VMware BDE with HDFS integrated storage such as EMC’s Isilon. The final segment in this series will look at data mining technologies from Informatica, such as Informatica BDE, that will really bring the whole story together into a total solution.
So for many folks starting out with Hadoop, the typical approach is to visit the Apache.org Hadoop project site and start digging into what Hadoop is in the first place. That’s certainly not a bad approach. There are also other wonderful free resources available online that are rich with information. So, while this posting won’t attempt to delve into the details of the whats and whys of Hadoop, it assumes you’ve arrived upon the identified need for Hadoop.
Therein lies the next problem: Hadoop isn’t really all that fun or simple to deploy – it’s manual, it’s repetitive and it requires mucking around with hardware on the compute, network and storage fronts. In order to really unlock the power of Hadoop to solve big data challenges with rapidly deployable and rapidly scalable (up and down) Hadoop cluster setups, let’s look at VMware BDE.
What is VMware’s BDE?
Well, by now, perhaps many have heard of VMware BDE. It started out as a little known open source initiative called Serengeti spawned in part by VMware itself. The purpose of this project was to create a new and improved way to easily deploy and elastically scale Hadoop clusters. Traditional approaches to Big Data Mining workload solutions build-out have been focused on carefully architecting a relational database: structured fields, indexes, etc. In the modern era, data scientists need only dump data into Hadoop and go to work.
VMware’s BDE takes the simplification process one step further – by enabling the deployment of vertically and laterally scalable Hadoop cluster components through the virtualization layer, they have resolved much of the complexity of getting started with Hadoop. This is because of how VMware BDE itself is deployed, as Virtual Appliance through VCenter Server. Once installed, setting up a Hadoop Cluster is as simple as walking through the wizard process of defining the cluster resources and scale of the cluster.
If you’re worried about choice on the Hadoop Distribution front, have no fear. By default, BDE ships ready to deploy Apache Hadoop Distribution with no additional setup. However, other distributions are deployed by either tarballs or by creating a local YUM repository with the correct RPMs. Supported Hadoop distributions include Apache Hadoop and BigTop, Cloudera, Hortonworks, Ingel MapR and Pivotal.
That’s all for this segment. Next, we’ll look into the particulars of deploying VMware’s BDE and some of the ‘Gotchas’ that I’ve discovered and worked around.
Please feel free to add any questions or concerns in the comment box below about the information presented in this document.