The Benefits of Netapp Enterprise Storage for HDFS Workloads
In the previous blog of this series, we expressed the case for Clinical Big Data Analytics. While the benefits of this is clear, what’s not clear is how and where does one run these analytics? Enter Hadoop.
Hadoop has brought the power of data analytics to the masses by being cost effective from both a software and hardware perspective. At the heart of Hadoop is HDFS. HDFS is a feature-rich, integrated file system that offers cluster-aware data protection, fault tolerance, and the ability to balance workloads. The HDFS method of mirroring is based on a per file basis. Therefore, by default Hadoop will replicate all files on HDFS to three times.
HDFS mirroring was designed this way intentionally to ensure the successful completion of MapReduce jobs thus providing data high availability (HA). In High Performance Computing terms, HA is required to provide for uninterrupted access to data and services.
Although HDFS has all of the benefits that it does with its distributed, feature rich, and flexible nature, it does have a heavy cost associated with it. In terms of throughput and operational costs from the position of a server-based data management and ingestion model, HDFS is expensive in terms of server resources to support filesystem and data operations. In particular, redistribution of data following recovery of a failed disk or Data Node, and space utilization can be very time consuming and costly in terms of server resource utilization.
Here are some key examples of the real cost of HDFS triple replication:
- FACT: For each usable terabyte (TB) of data, 3TB of HDFS capacity is required.
- FACT: Server-based triple replication creates a significant additional IO, and therefore CPU, load on the Hadoop cluster servers.
- FACT: Server-based triple replication also creates a significant load on the network due to the interconnect replication traffic between the Hadoop nodes.
Corporate Technologies, Hadoop and Netapp
For BIO-IT Conference 2016, Corporate Technologies is showcasing the art of what’s possible for a better Hadoop platform on Netapp storage. So, what does this mean for you? Hadoop is a shared-nothing, massively parallel data-processing platform that emphasizes cost effectiveness and replication of data availability. Many enterprises have little or no experience with Hadoop. On the other hand, many enterprises have extensive experience with traditional databases where rapidly growing needs to process data and extract business value are overwhelming the current technology stack. Add to that, the costs of traditional database solutions can be so high that many customers cannot afford them. The advent of Hadoop has ushered a new era of data analytics, both from a performance and cost perspective.
NetApp has developed a reference architecture to greatly improve the performance and cost model of HDFS:
- Reduce the operational and functional costs associated with distributed software-based mirroring (block replication).
- Because it is a decoupled solution, the storage and compute independence provides flexibility in Hadoop to manage the independence of each other.
- Eliminate job impact and application disruption of drive failures.
- Provide a proven and certified solution.
- Offer linear performance scalability.
- Scale-up storage controllers on demand.
- Scale-out capacity on demand.
Benefits of NetApp Solutions for Hadoop
The Netapp E-Series/FAS platforms represent some of Netapp’s enterprise grade storage systems that support a multitude of business applications and mission critical workloads. These systems can deliver the perfect system upon which one can build a Hadoop system that will perform and scale in a way that would be impossible with local storage HDFS.
Some of the reasons why you simply must consider Netapp E-Series/FAS for your Hadoop projects:
- Ready-to-deploy Hadoop solution
- Only two copies of data required while providing high data availability
- Linear scalability for handling data and business growth
- Transparent RAID operations
- Transparent rebuild of failed media
- Fully online serviceability
- Faster deployment with network boot
- Performance for the next generation of Data Nodes (SMP) and networks (10GbE)
- Proven, tested, supported, and fully compatible system design
- Proven storage hardware based on enterprise-grade designs, offering enterprise-grade performance, uptime, and support
- Lower total cost of ownership (TCO)
Netapp has successfully augmented the share-nothing HDFS architecture in such a way that it enhances the performance efficiencies of HDFS as well as providing enhanced protection capabilities. If you or your enterprise are working with Hadoop or would like to get going, Netapp certainly has the right architecture to get your project off to a great start.
If you want to learn more about this and discuss more opportunities, please visit us at the BIO IT World Conference in Boston, April 5-7, 2016. We are at booth 125 & 127.
1.NetApp Solutions for Hadoop, Performance, Sizing, and Best Practice Guide, Karthikeyan Nagalingham, NetApp November 2015 | TR-3969