1. Problem
Large analytical data management (OLAP) with commodity clusters
2. Challenges
- Performance and efficiency
- hadoop
- is not for structured data analysis
- scale well
- open source, without cost
- hadoop
- scalability, fault-tolerance, and flexibility
- Previous parallel db fit into tens of nodes, not thousand of nodes.
- not scale well for failures, heterogeneous machines, performance untested
3. Solutions
- MapReduce (Hadoop) + parallel db (or single-node DBs) = HadoopDB
- Translation layer: Hive
- Communication layer: Hadoop
- Database layer: PostgreSQL (or MySQL, … JDBC)
Components
- Database Connector
- Interface among dbs on nodes
- extends InputFormat class -> InputFormat Implementation lib
- JDBC-complaint
- Catalog
- metainformation as XML = connection para + metadata
- Data Loader
- Global Hasher: MR, repartition raw data from HDFS to NO. of nodes
- Local Hasher: copy from HDFS, repartition the partition into chunks
- Better load balance than Hadoop
- SQL - MR - SQL (SMS) Planner
- extend Hive for
- open and low cost
- each table stored separately in HDFS, low performance in multi-table trans. intercept normal Hive flow in
- update MetaStore before query execution
- between query plan generation and MR jobs retrieve fields, determine partition keys
traverse DAG bottom-up
only support filter, select, aggregation
- Database Connector
Extend performance [23] with fault tolerance and heterogeneous node exp
TODO: modify the current task scheduler, connect not straggler node but the replicas.
Exp on EC2, Hadoop, HadoopDB, Vertica, DBMS-X
4. Conclusion
- HadoopDB’s performance < parallel db for
- PostgreSQL (not column-store, not compression)
- Hadoop and Hive are young
- performance, heterogeneous environ., fault tolerance, flexibility
Related Readings
[23] A Comparison of Approaches to Large Scale Data Analysis
[6] Scope [11] Hive [24] C-store [4] What is the right way to measure scale?