9 Tips for best practices with Apache Spark
The below tips are not written by me (Kumar Chinnakali).
It is actually learnt from mammothdata.com and felt it could help our big data community, where Apache Spark is currently changing the world of Analytics & Big Data.
Mamothdata team, tons of thanks for sharing with us.
- Spark is written in Scala, so new features will be first available for Scala (and Java). Python and R bindings can lag behind on new features from version to version.
- If you are developing Spark in Java, use Java 8 or above if possible. The new lambda functionality in Java 8 removes a lot of verbosity required in your code.
- The fundamentals of software development still apply in the Big Data world – Spark can be unit-tested and integration-tested, and code should be reused between streaming and batch jobs wherever possible.
- When Spark is transferring data over the network, it needs to serialize objects into a binary form. This can have an effect on performance when shuffling or on other operations that require large amounts of data to be transferred. To ameliorate this, first try to make sure that your code is written in a way that minimizes the amount of shuffling that may occur (e.g. only use groupByKey as a last resort, preferring instead to use actions like reduceByKey which perform aggregation as in-place as possible). Second, consider using Kryo instead of java.io.Serializable for your objects, as it has a more compact binary representation than the standard Java serializer, and is also faster to compress or decompress. For further performance, especially when dealing with billions of objects, you can register classes with the Kryo serializer at start-up, saving more precious bytes.
- Use connection pools instead of creating dedicated connections when connecting to external data sources – e.g. if you are writing elements from an RDD into a Redis cluster, you might be surprised if it attempts to open 10 million connections to Redis when running on production traffic.
- Use Spark’s checkpointing features when running streaming applications to ensure recovery from failures. Spark can save checkpoints to local files, HDFS, or S3.
- With larger datasets (>200Gb), garbage collection on the JVM Spark runs may become a performance issue. In general, switching to the G1 GC over the default ParallelGC will ultimately be more performant. Although, some tuning will be required according to the details of your dataset and application (a process that Mammoth Data can easily assist with).
- If possible, use dataframes over RDDs for developing new applications. While dataframes are in the Spark SQL package rather than Spark Core, Databricks has indicated that they will be dedicating significant resources to improving the Catalyst optimizer for generating RDD code. By adopting the dataframe approach of development, your application is likely to benefit from any optimizer improvements in the upcoming development cycles.
- Remember, Spark Streaming is not a pure streaming architecture. If the microbatches do not provide a low enough latency for your processing, you may need to consider a different framework, e.g. Storm, Samza, or Flink.