Self-Learn Yourself Apache Spark in 21 Blogs – #7 Key Concepts of Resilient Distributed Datasets (RDDs) and more… In this blog how do we create the RDDs and what operations can we perform with RDDs. Have quick read on the other blogs in this learning series. In simple RDD (Resilient Distributed Dataset); if data in […]
What is RDD, Actions, and Transformations ? In Blog 6, we will see The RDD, and RDDs Input with Hands-on. Click to have quick read on the other blogs in this learning series. Hey, my dear friends. Before getting in to more deep dive into let’s have a look at who are the Spark Core Maintainers […]
In Blog 5, we will see Apache Spark Languages with basic Hands-on. Click to have quick read on the other blogs of Apache Spark in this learning series. With our cloud setup of our Apache Spark now we are ready to develop big data Spark applications. And before getting started with building Spark applications let’s […]
In Blog 4, we will see what are Apache Spark Core and its ecosystem and Apache Spark on AWS Cloud. Click to have quick read on blog 1, blog 2, and blog 3 in this learning series. Apache Spark has many components including Spark Core which is responsible for Task Scheduling, Memory Management, Fault Recovery, […]
In this Blog 3 – We will see what is Apache Spark’s History and Unified Platform for Big Data, and like to have quick read on blog 1 and blog 2. Spark was initially started by Matei at UC Berkeley AMPLab in 2009, and open sourced in 2010 under a BSD license. In 2013, the […]
By this blog we will share the titles for learning Apache Spark, Basics on Hadoop which is one of the big data tool, and motivations for Apache Spark which is not replacement of Apache Hadoop, but its friend of big data. Blog 1 – Introduction to Big Data Blog 2 – Hadoop, Spark’s Motivations Blog […]
In this new year 2016, we should be excited that Apache Spark community have released and announced the availability of Apache Spark 1.6, which is the 7th release on the 1.x line. Committers – Contributors to Spark had crossed 1000, which is doubled. Patches – Apache Spark 1.6 version includes & covers 1000 patches. Run […]
We have received many requests from friends who are constantly reading our blogs to provide them a complete guide to sparkle in Apache Spark. So here we have come up with learning initiative called “Self-Learn Yourself Apache Spark in 21 Blogs”. We have drilled down various sources and archives to provide a perfect learning path […]
Best wishes to you this holiday, and Happy New Year, from all of us at dataottam. This blog introduces Spark’s core abstraction for working with data, the RDD (Resilient Distributed Dataset). An RDD is simply a distributed collection of elements or objects (Java, Scala, Python, and user defined functions) across the Spark cluster. In Spark […]
SPORK Below is a snapshot from the actual post written by Mayur Rustagi of sigmoid As a data mashing platform the first key initiative is to bring the power & simplicity of Apache Pig on Apache Spark making existing ETL pipelines 100x faster than before. SPORK enables you to use Pig on Spark. We choose Apache Spark […]
RDD – Resilient Distributed Dataset The Only way to share data between the MR jobs are storing the data in disk. When performing more than one transformation on a huge set of data the O/P of every transformation has to be stored in disk and replicated for fault tolerance. And if we had to […]
BlinkDB a project being developed by the Berkeley University where the evolution of Spark started is a massively parallel interactive Query Engine processing tens of TB of data with response time of just a blink of an eye. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by […]
The Bot 101 [ Part 1 ] For me bot is new word, on first time...
Getting Started with Google Cloud Platform ! Last month got...
PocketGear on Getting Started with Google Cloud Platform !...
Top 10 Reasons to Run Hadoop in the Public Cloud ! Hadoop...