The Apache Spark RDD Data Structure
Overview
Spark applications are executed as independent sets of processes on a cluster, coordinated
by a driver program using a SparkContext object. Below is a diagram depictng the Spark architecture.
Looking under the hood of Apache Spark is the concept of the Resilient Distributed Dataset (RDD). It is an immutable distributed fault tolerant collection of elements that can be operated on in parallel. Spark RDDs have several notable characteristics such as the fact they are statically typed (i.e. RDD[String]). That is its type is explicitly declared and determined at compile time. By default, Spark RDDs are replicated to other executor nodes to provide fault tolerance. Spark RDDs can be partitioned, and the number of partitions can be specified.
Using the SparkContext object, an RDD can be created from a text file as well as other ways. For example, RDDs can be created from other RDDs.
Spark RDDs are equipped with a number of what we call transformations and actions. An example of a transformation would be reduceByKey. This transformation repartitions an RDD and merges records that have the same key. The resulting RDD will consist of the key and grouped elements. An example of an action would be to count the number of elements in the dataset.
Spark RDDs are one of three data structures Apache Spark currently provides support for. The other two are DataFrames and DataSets.
Rererences
Könemann, Alexander. Experimental comparison between Apache Spark and Flink in heterogeneous hardware environments. Diss. Hochschule für Angewandte Wissenschaften Hamburg, 2024.
Zeidan, Ayman. Distributed Partitioning and Processing of Large Spatial Datasets. Diss. City University of New York, 2022.
Alterkawi, Laila, and Matteo Migliavacca. "Enhancing Parallel Genetic Algorithms for Efficient Distributed Data Analysis: Automatic Termination and Population Sizing Integration." (2024).
Hajji, Tarik, et al. "Optimizations of Distributed Computing Processes on Apache Spark Platform." IAENG International Journal of Computer Science 50.2 (2023).
Antolínez García, Alfonso. "Spark Low-Level API." Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing. Berkeley, CA: Apress, 2023. 67-107.
Könemann, Alexander. Experimental comparison between Apache Spark and Flink in heterogeneous hardware environments. Diss. Hochschule für Angewandte Wissenschaften Hamburg, 2024.
Vossos, Alexandros Ioannis. Offline hotspot analysis over road network trajectories. MS thesis. Πανεπιστήμιο Πειραιώς, 2020.