Apache Drill’s Role in the Big Data Enterprise Data Architecture!
As of this writing, Drill is a very active Apache incubating project led by MapR with six to seven
companies actively participating, and more than 250+ people currently on the Drill mailing list.
The goal of Drill is to create an interactive analysis platform for Big Data using a standard
SQL-supporting relational database management system (RDBMS), Hadoop, and other NoSQL
implementations (including Cassandra and MongoDB).
The foundation of the Drill architecture is a set of Drillbits processes that is, Drill executables running on Hadoop’s DataNodes to provide data locality and parallel query execution. An individual query request can be delivered to any of the Drillbits. It is first processed by a SQL query parser, which parses the incoming query and passes it to the co-located query planner.
The SQL query planner provides query optimization. A default optimizer is a cost-based optimizer, but additional custom optimizers can be introduced, based on the open APIs provided by Drill.
Once a query plan is ready, it is processed by a set of distributed executors. A query execution is spread between multiple DataNodes in order to support data locality. Execution of a query on a particular data set is done on the node where the data is located. Additionally, to improve the overall performance, results of queries for local data sets are aggregated locally, and only combined query results are returned back to the executor that started a query.
Finally, the SQL parser supports custom domain-specific SQL extensions based on User Defined Functions (UDFs), User Defined Table Functions (UDTFs), and custom operators (for example, Mahout’s k-means operator). Whereas Drill is, for the most part, still in a development state, the other specialized Hadoop query language implementation and the Impala is currently available for initial experimentation and testing.Let’s take a closer look at the details of Impala implementation in next blog.
To conclude we would be writing the SQL queries on a variety of data types including structured data in a Hive table, semi-structured data in HBase or MapR-DB, and complex data file types, such as Parquet and JSON by using Apache Drill. Apache Drill is must tool stack in big enterprise data intensive architecture.
Reference – mapr.com, Professional Hadoop Solutions, Boris Lublinsky. As always please feel free to suggest or comment firstname.lastname@example.org.
Intresting ? Please subscribe to our blogs at www.dataottam.com to keep yourself trendy on Big Data & Analytics.