Apache Spark RDD is the basic concept of Spark. It stands for Resilient Distributed Dataset that represents a stable and partition collection of components that can be operated in parallel.
Spark RDD lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDD handles the iterative algorithms and interactive data applications in a very efficient way that is hard to manage by the current computing frameworks and one of the main reasons for Spark RDD is that it can store data in memory.
The following are some of the main features and properties of Apache Spark RDD.
Apache Spark RDD Properties
Spark RDD is described by the following five properties.
- Spark RDD is a list of partitions.
- Spark RDD provides the functionality to compute each split of RDD.
- It is a list of dependencies on other RDDs.
- Spark RDD is also a partitioner key-value RDDs (e.g. to say that the RDD is hash-partitioned.)
- It is optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file).
Apache Spark RDD Features
The following are some of the features of Spark RDD.
1. Lazy Evaluation
All transformations in Spark are lazy that means when any transformation is applied to the RDD such as map (), filter (), or flatMap(), it does nothing and waits for actions and when actions like collect(), take(), foreach() invoke it does actual transformation/computation on the result of RDD.
2. In-memory Computation
The major advantage of Spark is that RDDs are stored in memory and not on a storage disk so any computation is performed in-memory will boosts the performance of Spark applications. This is the important advantage over Hadoop MapReduce because MapReduce programs store its intermediate results on disk which increased I/O whereas Spark stores them in RAM and that is the reason Spark needs high configured RAM.
3. Fault Tolerance
Spark internally creates a Lineage Graph that records each transformation applied to RDD so if a node failure is happening or any transformation is getting failed in such case Spark RDD will be recomputed based on the Lineage Graph. Spark RDD information is not stored at any place it is just RDD is recomputed based on the transformations applied to it and this is a most important art of RDD.
4. Immutability
Spark RDD is stable which indicates that if an RDD is created based on the source data, it can't be changed but preferably a new RDD can be created based on an existing RDD by using functions such as flatMap(), map().
5. Persistence
Spark RDD provides a very important feature called persistence through which it can persist dataset in memory or disk. Once the dataset is persisted in memory, then it can be reused multiple times and provides fast performance on future actions. Storing dataset in memory or disk is very helpful for those applications which are used for iterative algorithms or fast interactive processing.
We can do persist RDD in memory by calling these 2 functions to persist () or cache ().
The following is the list of Storage levels.
A. MEMORY_ONLY
In this option, the RDDs are stored as deserialized Java objects. In case the RDD partitions are not fitting in the memory then those RDD partitions will recomputation whenever required. This is the default Storage level.
B. MEMORY_AND_DISK
In this storage level, the RDD is stored as deserialized Java objects and in case if RDD is not fitting in the memory then the RDD partitions are stored on disk and used whenever required.
C. MEMORY_ONLY_SER(Java and Scala)
In this storage level, the RDD is stored as a one-byte array per partition. This storage level is very space-efficient in case of using a fast serializer but at the same time, it is also CPU-intensive to perform read operation.
D. MEMORY_AND_DISK_SER(Java and Scala)
It is similar to MEMORY_ONLY_SER but those RDD partitions which are not fitting in memory and disk are required every time during recomputation.
E. DISK_ONLY
In this storage level, the RDD is stored on the disk storage only.
F. MEMORY_ONLY_2, MEMORY_AND_DISK_2
It is the same as the levels above but replicates each partition on two cluster nodes.
G. OFF_HEAP
This option is comparable to MEMORY_ONLY_SER, but using this option data is stored in the off-heap memory and for that off-heap memory should be enabled.
6. Partitioned
Spark RDD is the dataset that is partitioned across multiple machines and it is permanent. Spark RDD provides the functionality to perform operations on the dataset that is partitioned. The main benefit of partitioning is that the dataset is logically partitioned over a cluster of nodes so any action performed on those datasets will be parallel which in turn provides better performance.