Apache Spark is fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics. RDD is great abstraction for data sets, Immutable collection of data, which stands for Resilient Distributed Storage In Spark all work is expressed in following – Creating new RDDs, Transforming existing RDDs, Calling Operations on RDDs (eg.val […]
Spark SQL is a spark interface for both structured and semi-structured data Loads data from a variety of structured sources like Hive Tables, JSON and Parquet columnar storage Spark SQL allows to query data using SQL, both in internal & external to Spark core engine It provides robust integration between SQL and Python/Java/Scala code Spark SQL […]
Spark Streaming is Sparks module for applications such are benefits from data as soon as it lands/arrives from various sources. E.g. page view in real time, train a machine learning model, automatically detect anomalies. Developer can use a API which is very similar to batch jobs, also we can reuse the same API skills and […]
Team, this time i go with the title called “Top 3 methods of skipping big data’s bad data using Hadoop !“ which describes about how to get corrupt records out from the large data sets which has different format of data. While doing our analysis if the corrupt records are in small percentage we can ignore or […]
Team thanks for reading & engaging ! This time am planned to share with you the my learning on Hadoop Schedulers; titled “Simplified Hadoop Schedulers Overview !” With the help of choosing suitable scheduler, we can make the response times faster for all smaller jobs and also for all the production jobs it’s guaranteed with SLA’s (Service […]
Thank you for your valuable time & it’s much appreciated. This time i like to share the blog called “Quick Card On – Apache Hive Joins !” – a handy Apache Hive Joins reference card or cheat sheet. An SQL JOIN clause is used to combine rows from two or more tables, based on a common […]
Hadoop compression techniques bring us more benefits in the Hadoop I/O operations, such as space savings and processing speeds. We’ve lot compression formats and algorithm, with pros and cons. Here nothing new is added, just consolidated to have it handy to use it in production implementations. All techniques exhibit a space/time trade-off. We’ve options from […]
This time i go with a blog called, The 10 Distributed SQL Query Engine for Big Data! A Much Thank for your time, it’s truly appreciated! Data…Data…Data…Yep, it’s everywhere starting from Software to Salt stores which is tagged as Big Data. But who is the friend who can help us to get the insights/values from the data […]
Tons of thanks for your valuable time, this time we like to share with you the details on how data movement is happening in the big data ecosystem. It’s named as “The Data Movement in Big Data Ecosystem”. Ingesting data in to Hadoop is so vital from systems like RDBMS, Mainframes, logs, machine-generated data, event data […]
Many thanks for your cherished time, this time we like to share with you the details on what is 3 S’s of Spark as we all know the 3 V’s of Big Data is Volume, Variety & Velocity. And even added with kernel V’s like Veracity & Values. Big Data is defined as a collection of […]
Spark began life in 2009 as a project within the AMPLab at the University of California, Berkeley. More specifically, it was born out of the necessity to prove out the concept of Mesos, which was also created in the AMPLab. Spark was first discussed in the Mesos white paper titled Mesos: A Platform for Fine-Grained Resource […]
The below tips are not written by me (Kumar Chinnakali). It is actually learnt from mammothdata.com and felt it could help our big data community, where Apache Spark is currently changing the world of Analytics & Big Data. Mamothdata team, tons of thanks for sharing with us. Spark is written in Scala, so new features […]
The Bot 101 [ Part 1 ] For me bot is new word, on first time...
Getting Started with Google Cloud Platform ! Last month got...
PocketGear on Getting Started with Google Cloud Platform !...
Top 10 Reasons to Run Hadoop in the Public Cloud ! Hadoop...