Cloudera Hadoop with Apache Hive
We hear a lot about businesses who are looking to leverage Hadoop but don’t know where to start or don’t understand the use-cases or perhaps they’ve started and have really had problems in getting things going with the project. To provide some direction to the story of Hadoop, we’ve built a proof-of-concept solution with Cloudera Hadoop on EMC Isilon and leveraged Apache Hive with Informatica’s Big Data Edition. After showing off this solution at the BIO-IT Conference 2015 in Boston, we’ve garnered a great deal of interest in how we did what we did, and why we built it this way. In this posting, we’re going to focus on the portion of the Hadoop ecosystem that makes it useful to upstream applications: Apache Hive.
The Apache Hive data warehouse software allows for the querying and managing of large datasets residing in non-centralized storage. Apache Hive is built right on top of Apache Hadoop, and it provides the following features:
- Tools to enable easy data extract/transform/load (ETL)
- A mechanism to impose structure on a variety of data formats
- Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBase
- Query execution via MapReduce
The Hive Server defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. QL can also be extended with custom scalar functions (UDF’s), aggregations (UDAF’s), and table functions (UDTF’s).
Hive does not mandate read or written data be in the “Hive format” — there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats.
Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.
The various versions of Cloudera CDH include different versions of Hive Server, so you’ll want to be sure to verify your requirements prior to setting out. If you’ve worked with Hadoop, you know that it serves a fairly simple purpose, despite its many layers of complexity. Fundamentally, if you’re going to leverage Hadoop to process any-structured data, then you’re going to need some meaningful way to leverage the resultant data. As you’ve seen here, Apache Hive provides that SQL-like interface to enable your presentation applications with the data they need and in a format they understand.
If you’d like to see the solution discussed in this blog live and in person, please feel free to reach out to me and let me know.