Apache Spark: RDD, DataFrame and Dataset – API comparison and Performance Benchmark

Apache Spark is one of the most popular fast in-memory computation engines in the Big Data Space. Apache Spark includes SQL support and a rich Machine Learning library, which makes it a favourite choice for Analytics processing.

Apache Spark computations are performed on distributed object collections. The Resilient Distributed Dataset (RDD) was the first type of distributed object collection offered by Apache Spark from its initial version. Since then, Apache Spark has been extended to include DataFrames (Since version 1.3) and Datasets (Since version 1.6).

Our recent white paper compares the API and performance of these three distributed collections (RDD, DataFrame and Datasets) when performing common non-trivial data manipulation tasks (Filter, Sort, Join etc.).

Data Scientists and Spark Developers, who can sometimes spend as much as 80% of their time in data preparation, will find the insights provided in this whitepaper extremely valuable in optimizing Spark operations in both Spark 2.x and Spark 1.6.

Suggested Articles

Speak Your Mind