5 Rules & 1 Checklist for Big Data Program Success !
Listened and got it from atscale webinar, felt great and use full hence happy to share with our big data & analytics community.
It’s very clear Hadoop is budding from its batch processing origins into a flexible, economical hub where enterprise store raw data, keep archival data active, and grow their options for data investigation, modeling, analysis, and reporting. – Ovum Research
Business intelligence is serving business along with IT for decades. Since business users were inception to get insights out of business intelligence tool stacks, the Hadoop emerged and it changed the rules of data warehousing. And with Hadoop adoption is on the rise. Organization needs to identify how to bring their business intelligence capabilities to accept against volumes, variety, velocity, values and insights on demands or as-a-service. And existing BI tools struggle to deliver against these demands for insights and its pain from very slow response times, and incapability to support the current contemporary data types.
Legacy BI Tools for Hadoop:
In the early days of Hadoop, new visualization platforms attempted to solve these problems and provide basic visibility to business users. Yet many of these tools failed to make Hadoop work for the legacy BI. They were slow, inflexible and required IT to move data into dedicated business view or server layers. Even bad, that they required business users to adopt new visualization tools and SQL data access languages to analyze data in Hadoop. Every Tier-1, Tier-2, and Blue Chip companies started to move their data into Hadoop clusters called Data Lake or Business Data Lake. Many SQL-on-Hadoop technologies are emerged to query this data from Data Lake. As a outcome, the “BI on Hadoop” movement has matured rapidly. New approaches and technologies appeared, giving way to a new generation of tools and applications that provide speed and pluggable.
RULE No 1: Ingress or Ingestion of data is not a difficult thing
Natural in the value proposition of Hadoop is a multi-structure, multi-workload environment for distributed parallel processing of huge data sets. Hadoop scales horizontally as workloads and data volumes grow. However, traditional approaches to BI required pre-aggregated and pre-formatted data before consumption by visualization and analysis tools like Tableau, QlikView, and Microsoft Excel. IT organizations, while using Hadoop as a batch processing environment for ETL, often continue to egress data off of the Hadoop cluster and into data structures pre-built for specific BI tool consumption. It results into lots of accumulation code, multiplying of data extracts and data marts, and results in to loss of BI agility. But the recent modern approaches to delivering BI on Hadoop have a design goal of accessing data as it is written, directly on the Hadoop cluster, instead of egress data out of hadoop and make available in a different system just for consume and access. The benefits of this type of “query-in-place” approach are vital, where BI Flexibility is significantly enhanced, Reduces the operational costs and complexity, then finally very important factor of Data freshness & inception approachable is dramatically increased.
RULE 2: Lack of data governance destroys without availability
The practice of data visualization and analysis for business intelligence has been around for decades, and some clear winners have come to dominate the desktops of today’s enterprise business users. Tools like Tableau, QlikView, Spotfire, and even Microsoft Excel are the de-facto tools business analysts use to access, manipulate, and analyze business data. It’s critical for any BI-on-Hadoop initiative to recognize the utility and prevalence of these tools and ensure users are productive in their analysis environments, even when accessing new types and sources of data. As enterprises look to expose their Hadoop data to existing BI users, they face a challenge: giving end users direct access to the processing power and diversity of Hadoop data while at the same time preventing a data “free-for-all” as individual users create multiple versions of what should be standard measures and dimensions. Historically the problem of this type of governance was solved through the creation of business friendly “cubes”, where a cube represents a set of standard measures and dimensions. The cube interface supports easy consumption by business users and integrates nicely with their existing BI tools while also providing a layer of governance to ensure standardization of business logic across data consumers.
RULE 3: Demand Speed from Hadoop
Hadoop is an extremely scalable data platform. The Hadoop File System, or HDFS, provides a low cost, scalable, and redundant storage substrate that is used today in the most demanding Big Data environments. Additionally, parallel computing frameworks like MapReduce support horizontally scalable batch processing jobs that perform better than traditional relational systems at a much lower cost. And now, with the emergence of in-memory systems like Spark and the SQL-on-Hadoop engines, interactive data processing on this same scale-out architecture is a reality.
With Hadoop’s support for both big and fast data processing workloads, it’s now feasible to bring BI workloads directly onto the Hadoop cluster, instead of bringing Hadoop data into traditional multi-component stacks (ETL engines, data marts, and cubes) or into separate purpose-built analysis clusters
RULE 4: Schema on Demand = Agility
In addition to supporting the volume and velocity of big data, Hadoop provides rich support for the variety of data formats and types that are found in modern data sets – data from sensors, security log files, and the “Internet of Things”. Traditional approaches to business intelligence require that these data sets first be transformed into rows and columns before they can be accessed and analyzed by BI tools. This schema-on-load approach greatly reduces the ability for BI teams to respond to end-user requests for new data types: first ETL jobs must be modified to extract new data elements from unstructured data fields; relational data structures (DDLs) need to be updated and loaded with columns representing the new elements; aggregates need to be re-defined and created; and BI tool metadata must be modified. For a well-functioning enterprise BI organization this process of exposing a new data element can take weeks, or even months.
Contrast this approach with the schema-on-demand approach that Hadoop supports: data can be written to Hadoop in its “native” format: maps, arrays, and JSON for example. The Hadoop query engines, through native language support or the use of SerDes (Serializer/Deserializers), can extract specific elements from these unstructured fields at query time. This capability eliminates the need to create pre-defined relational structures for each new data element requested. A modern approach to BI on Hadoop takes advantage of this great capability and can directly expose measures and dimensions to BI users based on this innovative schema-on-demand technique. The result is that new data elements can be delivered to business analysts in minutes instead of weeks or months.
RULE 5 : Must Count Uniques. Fast!
In the world of business intellgence, calculating distinct counts is a critical function – whether it is a count of customers, website visitors, RFID sensors, or financial transactions – enterprises need to understand how these counts grow, change, and behave over time. Historically doing distinct counts has been the bane of any OLAP system that depends on aggregations for performance, because distinct counts are not additive.
Because distinct counts are not traditionally additive measures, traditional OLAP systems need to pre-aggregate and store every single combination of dimensions to get high-performance interactive distinct count metrics. Clearly this approach does not scale.
Modern, Hadoop-native business intellgence stacks are able to use statistical and sampling based approaches to estimate distinct counts using algorithms like HyperLogLog. This type of approach results in both faster original computation of distinct counts, but also (and more importantly) leads to additive aggregate results. This means that an OLAP approach to data analysis is achieved without sacrificing distinct count metrics.
The Secret is Out!
We hope this document provided you with some useful tips and insights to consider as you begin your own BI on Hadoop initiative. The checklist below contains a quick synthesis of some important criteria to keep in mind as you begin your journey.
- Don’t move data out of your Hadoop cluster. It will cost you time, money and agility. And it will reduce your opportunity.
- Focus on interactivity for end-users. Don’t take them out of the tools they already know and love. Also remember, that power users also need a business friendly interface into Hadoop. Banish command line tools for them too. You have options.Your users will demand fresh and complete data. Avoid cube/additional processing to query data or in-memory-only systems that don’t scale horizontally.
- The only constant is change. Use a schema-on-demand model to increase agility.
- Don’t discount unique counts. With the right approach, they can be as fast as your counts
- Analytics & Big Data Open Source Community