Pig is Flying: Apache Pig on Apache Spark(SPORK)
Below is a snapshot from the actual post written by Mayur Rustagi of sigmoid
As a data mashing platform the first key initiative is to bring the power & simplicity of Apache Pig on Apache Spark making existing ETL pipelines 100x faster than before. SPORK enables you to use Pig on Spark.
We choose Apache Spark as the underlying infrastructure. For the uninitiated, Apache Spark is open-source big data infrastructure that enables distributed fault-tolerant in-memory computation. As the kernel for the distributed computation, it empowers developers to write testable, readable & powerful big data applications in a number of languages like Python, Java & Scala.
Pig operates in a similar manner to most big data applications like Hive & Cascading. It has a query language quite akin to SQL which allows analyst & developers to design & write data flow. The query language is translated into Logical plan which is further translated into Physical plan containing operators. Those operators are then run on the designated execution engine (MR, Tez & now Spark). There is a whole bunch of details around tracking progress, handling errors etc which I will skip here.
- LoadOperator: An RDD is created for the data which could be used for the subsequent transformations. LoadConverter helps in loading data from hdfs using Spark api with parameters initialized from POLoad operator.
- StoreOperator: Store operation is useful for saving the end results or some intermediate data whenever required. StoreConverter is used to save data to hdfs with parameters from POStore operator.
- Local Rearrange: LocalRearrangeConverter directly passes data to POLocalRearrangeConverter, which inturn transforms data into required format. This happens through the Spark map api. The local rearrange operator is a part of the co-group implementation. It has an embedded physical plan that generates tuples of the form (grpKey,(indxed inp Tuple))
- Global Rearrange: Global Rearrange Converters is used in case of a groupBy operation or a join operation, converter method uses groupBy, map apis from Spark to achieve these. Incase of groupBy operation, results are converted in format of (key, Iterator(values)). Incase of CoGroup operation, results are in form of (index, key, value)
For More information please look into Pig is Flying: Apache Pig on Apache Spark by Mayur Rustagi.