I/O of the Google BigQuery Execution
I/O of the Google BigQuery Execution
Dear cloud community friends, this week I would love to share the post titled “I/O of the Google BigQuery Execution“. In this we will discuss about the internals of Google BigQuery, and how it executes and gives the super performance for the big data problems.
Interested in learning the story of big data at Google, please do click here for the big data stack 2.0 and beyond. BigQuery was generally available from November 2011 at the Google Atmosphere conference, from that moment still its creating a great impact for the big data problem domains.
Before we start what is the I/O in the title, is it for Google’s flagship conference Input/Output or it’s due to Innovation in the Open? Nope it is just Inside Out, hence the title I/O of the Google BigQuery Execution.
And let us have a cheering news on the Google BigQuery service, where it is free for up to 1TB of data analysed each month and 10GB of data stored. And the Google BigQuery is a fast, highly scalable, cost-effective and fully-managed enterprise data warehouse for analytics at any scale. It is a multi-regional GCP service.
Before we get into the internal and execution details of BigQuery, let us have clear understanding between Google BigQuery vs. Dremel. Google BigQuery is the implementation of the Dremel, where it provides core set of features from Dremel and which can be accessed via REST API, cli, web UI, access control and much more. Now let us see why the Google BigQuery is so fast and performant. And you know this is because of the two core technologies called columnar storage format in Colossus File System and the tree architecture are borrowed from Dremel.
In columnar storage, the data is stored in a columnar storage fashion which makes possible to achieve very high compression ratio and scan throughput, and the tree architecture is used for dispatching queries and aggregating results across thousands of machines in a few seconds. Because of the BigQuery release and GA, we the outsider of Google can utilize the power of Dremel for their Big Data processing requirements. Hence the BigQuery and Dremel share the same underlying architecture and performance characteristics.
From the past 50 years ago, all the SQL engineers including me have a mind set we should be thinking of what data should be returned by the query and how the data is obtained. And this goes with the thumbs rule of if any query runs quickly, then query might be inefficient, but Google BigQuery this totally. Because of the parallel architecture in the design of BigQuery we can do complex manipulation in line with the query without significant change in the execution time.
Big thanks to all the researchers in the database engineering development, because of them we have great product to use. Do you the Dremel project goal and objective was performing a table scan over a 1 TB in less than 1 second, which is super success now. To have analogy, if we are reading the same from hard disk then it will take us 3 hours approximately.
The above picture gives us the single pane of glass idea on how BigQuery execution happens (picture credits Tino) very performant. And the most costly and expensive part of any query execution over the big data is always the I/O.
Now lets discuss few pointers on the storage architecture for the BigQuery, which is again the most expensive in any big data operations. Hence the BigQuery uses the Dremel’s two technologies the first one is called Colossus, which is a large, parallel, distributed file system, developed at Google as a successor for the Google File System (GFS) and the second one is the storage format called ColumnIO, which arranges the data in manner that makes it easier to query.
Colossus is a distributed file system, which is nothing but that the storage which is not physically attached to the machines requesting the data, and the data is distributed across the all network. And more over all the data in Colossus is stored on commodity disks, so we need to prepare very well to handle the inevitable failure. Chunk servers and tail latency are two vital concepts. The machines that contain the data disks and serve up the data are called chunk servers, the term for having laggard among a lot of samples is called tail latency. And the Colossus handles the tail latency problem via replication. Google kept the details of Colossus is confidential; but the insiders and blogs says that it is a enhancement of GFS to fix the number of scalability problems.
Now let’s dive in to the storage format, the ColumnIO is the key file format used to store data in BigQuery. The ColumnIO data format is laid out to ensure very fast access for BigQuery workloads, which takes a more basic approach, just reading every single row for each query. While a database such as MySQL can skip rows it doesn’t need, but the BigQuery takes an alternative approach; it can avoid reading columns it doesn’t need.
Selectivity and compression are the two prime factors that make reading from BigQuery ColumnIO storage format faster than record-oriented storage. Durability and availability are the two major factors behind the implementation of the BigQuery, which is top priorities for Google.
In the other blog, lets deep dive into BigQuery Serving tree, Broadcast Join, Shuffled Queries, Materialize Query, Root Server, Mixers, Leaf Nodes, and Jupiter.
Hence to conclude the two reasons for the best performants in the BigQuery are ColumnIO storage format and the Colossus File System. Hence the BigQuery service from the Google Cloud Platform is a query service that allows us to run SQL-like queries against multiple terabytes of data in a matter of seconds, not in hours. And coming to cost factor, we need to pay just for GBs actually scanned during our query execution, and if we attempt to reuse the cached results, then it will be free of cost. And for the storage context, pay for the GB/month, which will be still in-expensive for us if our table is not modified for 90 days.
Keep yourself up-to-the-minute on ABCDE of Data (Artificial Intelligence, Automation, Big Data, Blockchain, Cloud Computing, Collaborative Tech, Digital Transformation, Edge Computing), by subscribing to dataottam blog.
Reach us via firstname.lastname@example.org, Happy Reading !