RDD creation in Apache Spark
There are three ways to create RDDs :
- Using parallelized collection
- From existing Apache Spark RDDs
- From external datasets.
1. Using Parallelized collection (parallelizing)
In the initial stage, RDDs are generally created by parallelized collection i.e. by taking an existing collection in the program and passing it to SparkContext’s parallelize() method. This method is used in the initial stage of learning Spark since it quickly creates our own RDDs in Spark shell and performs operations on them. This method is rarely used outside testing and prototyping because this method requires entire dataset on one machine.
Eg:
val data=spark.sparkContext.parallelize(Seq(("maths",52),("english",75),("science",82), ("computer",65),("maths",85)))
data.collect()
2. From existing Apache Spark RDDs
In Spark, the distributed dataset can be formed from any data source supported by Hadoop, including the local file system, HDFS, Cassandra, HBase etc. In this, the data is loaded from the external dataset. To create text file RDD, we can use SparkContext’s textFile method. It takes URL of the file and read it as a collection of line. URL can be a local path on the machine or a hdfs://, s3n://, etc.
The point to jot down is that the path of the local file system and worker node should be the same. The file should be present at same destinations both in the local file system and worker node. We can copy the file to the worker nodes or use a network mounted shared file system.
Eg:
//How many unique professions do we have in the data file?
var userFile = sc.textFile("/FileStore/tables/user.txt")
var fileData = userFile.map(lines=>lines.split(","))
var fileDataOpr = fileData.map(recd => (recd(0),recd(1),recd(2),recd(3),recd(4)))
var dataMapLogic = fileDataOpr.map{case(id,age,sex,profession,pincode)=>profession}
//println(dataMapLogic.distinct.collect().map{x=>println(x)})
println(dataMapLogic.distinct.count())
3. From external datasets
Transformation mutates one RDD into another RDD, thus transformation is the way to create an RDD from already existing RDD.
RDD Persistence and Caching in Spark?
Spark RDD persistence is an optimization technique in which saves the result of RDD evaluation. We can make persisted RDD through cache() and persist() methods. When we use the cache() method we can store all the RDD in-memory. We can persist the RDD in memory and use it efficiently across parallel operations. The difference between cache() and persist() is that using cache() the default storage level is MEMORY_ONLY while using persist() we can use various storage levels. When the RDD is computed for the first time, it is kept in memory on the node. The cache memory of the Spark is fault tolerant so whenever any partition of RDD is lost, it can be recovered by transformation Operation that originally created it.
Benefits of RDD Persistence in Spark
Some advantages of RDD caching and persistence mechanism in spark are :
Time efficient
Cost efficient
Lessen the execution time.
Storage levels of Persisted RDDs :
Using persist() we can use various storage levels to Store Persisted RDDs in Apache Spark.
1) MEMORY_ONLY
In this storage level, RDD is stored as deserialized Java object in the JVM. If the size of RDD is greater than memory, It will not cache some partition and recompute them next time whenever needed. In this level the space used for storage is very high, the CPU computation time is low, the data is stored in-memory. It does not make use of the disk.
2) MEMORY_AND_DISK
In this level, RDD is stored as deserialized Java object in the JVM. When the size of RDD is greater than the size of memory, it stores the excess partition on the disk, and retrieve from disk whenever required. In this level the space used for storage is high, the CPU computation time is medium, it makes use of both in-memory and on disk storage.
3) MEMORY_ONLY_SER
This level of Spark store the RDD as serialized Java object (one-byte array per partition). It is more space efficient as compared to deserialized objects, especially when it uses fast serializer. But it increases the overhead on CPU. In this level the storage space is low, the CPU computation time is high and the data is stored in-memory. It does not make use of the disk.
4) MEMORY_AND_DISK_SER
It is similar to MEMORY_ONLY_SER, but it drops the partition that does not fits into memory to disk, rather than recomputing each time it is needed. In this storage level, The space used for storage is low, the CPU computation time is high, it makes use of both in-memory and on disk storage.
5) DISK_ONLY
In this storage level, RDD is stored only on disk. The space used for storage is low, the CPU computation time is high and it makes use of on disk storage.
Unpersist RDD in Spark?
Spark monitor the cache of each node automatically and drop out the old data partition in the LRU (least recently used) fashion. LRU is an algorithm which ensures the least frequently used data. It spills out that data from the cache. We can also remove the cache manually using RDD.unpersist() method.
Features of Spark RDD
There are several advantages of using RDD. Some of them are-
1. In-memory computation
The data inside RDD are stored in memory for as long as you want to store. Keeping the data in-memory improves the performance by an order of magnitudes.
2. Lazy Evaluation
The data inside RDDs are not evaluated on the go. The changes or the computation is performed only after an action is triggered. Thus, it limits how much work it has to do.
3. Fault Tolerance
Upon the failure of worker node, using lineage of operations we can re-compute the lost partition of RDD from the original one. Thus, we can easily recover the lost data.
4. Immutability
RDDs are immutable in nature meaning once we create an RDD we can not manipulate it. And if we perform any transformation, it creates new RDD. We achieve consistency through immutability.
5. Persistence
We can store the frequently used RDD in in-memory and we can also retrieve them directly from memory without going to disk, this speedup the execution. We can perform Multiple operations on the same data, this happens by storing the data explicitly in memory by calling persist() or cache() function.
6. Partitioning
RDD partition the records logically and distributes the data across various nodes in the cluster. The logical divisions are only for processing and internally it has no division. Thus, it provides parallelism.
7. Parallel
RDD process the data parallelly over the cluster.
8. Location-Stickiness
RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of RDD. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus speed up computation.
9. Coarse-grained Operation
We apply coarse-grained transformations to RDD. Coarse-grained meaning the operation applies to the whole dataset not on an individual element in the data set of RDD.
10. No limitation
we can have any number of RDD. there is no limit to its number. the limit depends on the size of disk and memory.
Limitations of RDD in Apache Spark?
1. No input optimization engine
There is no provision in RDD for automatic optimization. It cannot make use of Spark advance optimizers like catalyst optimizer and Tungsten execution engine. We can optimize each RDD manually. This limitation is overcome in Dataset and DataFrame, both make use of Catalyst to generate optimized logical and physical query plan.
2. Runtime type safety
There is no Static typing and run-time type safety in RDD. It does not allow us to check error at the runtime. Dataset provides compile-time type safety to build complex data workflows. Compile-time type safety means if you try to add any other type of element to this list, it will give you compile time error.It helps detect errors at compile time and makes your code safe.
3. Degrade when not enough memory
The RDD degrades when there is not enough memory to store RDD in-memory or on disk. There comes storage issue when there is a lack of memory to store RDD. The partitions that overflow from RAM can be stored on disk and will provide the same level of performance. By increasing the size of RAM and disk it is possible to overcome this issue.
4. Performance limitation & Overhead of serialization & garbage collection
Since the RDD are in-memory JVM object, it involves the overhead of Garbage Collection and Java serialization this is expensive when the data grows. Since the cost of garbage collection is proportional to the number of Java objects. Using data structures with fewer objects will lower the cost or we can persist the object in serialized form.
5. Handling structured data
RDD does not provide schema view of data. It has no provision for handling structured data.
Dataset and DataFrame provide the Schema view of data. It is a distributed collection of data organized into named columns.
So, this was all in limitations of RDD in Apache Spark.
RDD Lineage
RDD is lazy in nature. It means a series of transformations are performed on an RDD, which is not even evaluated immediately. While we create a new RDD from an existing Spark RDD, that new RDD also carries a pointer to the parent RDD in Spark. That is the same as all the dependencies between the RDDs those are logged in a graph, rather than the actual data. It is what we call as lineage graph.
RDD lineage is nothing but the graph of all the parent RDDs of an RDD. We also call it an RDD operator graph or RDD dependency graph. To be very specific, it is an output of applying transformations to the spark. Then, it creates a logical execution plan.
RDD Persistence and Caching in Spark?
Spark RDD persistence is an optimization technique in which saves the result of RDD evaluation. We can make persisted RDD through cache() and persist() methods. When we use the cache() method we can store all the RDD in-memory. We can persist the RDD in memory and use it efficiently across parallel operations. The difference between cache() and persist() is that using cache() the default storage level is MEMORY_ONLY while using persist() we can use various storage levels. When the RDD is computed for the first time, it is kept in memory on the node. The cache memory of the Spark is fault tolerant so whenever any partition of RDD is lost, it can be recovered by transformation Operation that originally created it.
Benefits of RDD Persistence in Spark
Some advantages of RDD caching and persistence mechanism in spark are :
Time efficient
Cost efficient
Lessen the execution time.
Storage levels of Persisted RDDs :
Using persist() we can use various storage levels to Store Persisted RDDs in Apache Spark.
1) MEMORY_ONLY
In this storage level, RDD is stored as deserialized Java object in the JVM. If the size of RDD is greater than memory, It will not cache some partition and recompute them next time whenever needed. In this level the space used for storage is very high, the CPU computation time is low, the data is stored in-memory. It does not make use of the disk.
2) MEMORY_AND_DISK
In this level, RDD is stored as deserialized Java object in the JVM. When the size of RDD is greater than the size of memory, it stores the excess partition on the disk, and retrieve from disk whenever required. In this level the space used for storage is high, the CPU computation time is medium, it makes use of both in-memory and on disk storage.
3) MEMORY_ONLY_SER
This level of Spark store the RDD as serialized Java object (one-byte array per partition). It is more space efficient as compared to deserialized objects, especially when it uses fast serializer. But it increases the overhead on CPU. In this level the storage space is low, the CPU computation time is high and the data is stored in-memory. It does not make use of the disk.
4) MEMORY_AND_DISK_SER
It is similar to MEMORY_ONLY_SER, but it drops the partition that does not fits into memory to disk, rather than recomputing each time it is needed. In this storage level, The space used for storage is low, the CPU computation time is high, it makes use of both in-memory and on disk storage.
5) DISK_ONLY
In this storage level, RDD is stored only on disk. The space used for storage is low, the CPU computation time is high and it makes use of on disk storage.
Unpersist RDD in Spark?
Spark monitor the cache of each node automatically and drop out the old data partition in the LRU (least recently used) fashion. LRU is an algorithm which ensures the least frequently used data. It spills out that data from the cache. We can also remove the cache manually using RDD.unpersist() method.
Features of Spark RDD
There are several advantages of using RDD. Some of them are-
1. In-memory computation
The data inside RDD are stored in memory for as long as you want to store. Keeping the data in-memory improves the performance by an order of magnitudes.
2. Lazy Evaluation
The data inside RDDs are not evaluated on the go. The changes or the computation is performed only after an action is triggered. Thus, it limits how much work it has to do.
3. Fault Tolerance
Upon the failure of worker node, using lineage of operations we can re-compute the lost partition of RDD from the original one. Thus, we can easily recover the lost data.
4. Immutability
RDDs are immutable in nature meaning once we create an RDD we can not manipulate it. And if we perform any transformation, it creates new RDD. We achieve consistency through immutability.
5. Persistence
We can store the frequently used RDD in in-memory and we can also retrieve them directly from memory without going to disk, this speedup the execution. We can perform Multiple operations on the same data, this happens by storing the data explicitly in memory by calling persist() or cache() function.
6. Partitioning
RDD partition the records logically and distributes the data across various nodes in the cluster. The logical divisions are only for processing and internally it has no division. Thus, it provides parallelism.
7. Parallel
RDD process the data parallelly over the cluster.
8. Location-Stickiness
RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of RDD. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus speed up computation.
9. Coarse-grained Operation
We apply coarse-grained transformations to RDD. Coarse-grained meaning the operation applies to the whole dataset not on an individual element in the data set of RDD.
10. No limitation
we can have any number of RDD. there is no limit to its number. the limit depends on the size of disk and memory.
Limitations of RDD in Apache Spark?
1. No input optimization engine
There is no provision in RDD for automatic optimization. It cannot make use of Spark advance optimizers like catalyst optimizer and Tungsten execution engine. We can optimize each RDD manually. This limitation is overcome in Dataset and DataFrame, both make use of Catalyst to generate optimized logical and physical query plan.
2. Runtime type safety
There is no Static typing and run-time type safety in RDD. It does not allow us to check error at the runtime. Dataset provides compile-time type safety to build complex data workflows. Compile-time type safety means if you try to add any other type of element to this list, it will give you compile time error.It helps detect errors at compile time and makes your code safe.
3. Degrade when not enough memory
The RDD degrades when there is not enough memory to store RDD in-memory or on disk. There comes storage issue when there is a lack of memory to store RDD. The partitions that overflow from RAM can be stored on disk and will provide the same level of performance. By increasing the size of RAM and disk it is possible to overcome this issue.
4. Performance limitation & Overhead of serialization & garbage collection
Since the RDD are in-memory JVM object, it involves the overhead of Garbage Collection and Java serialization this is expensive when the data grows. Since the cost of garbage collection is proportional to the number of Java objects. Using data structures with fewer objects will lower the cost or we can persist the object in serialized form.
5. Handling structured data
RDD does not provide schema view of data. It has no provision for handling structured data.
Dataset and DataFrame provide the Schema view of data. It is a distributed collection of data organized into named columns.
So, this was all in limitations of RDD in Apache Spark.
RDD Lineage
RDD is lazy in nature. It means a series of transformations are performed on an RDD, which is not even evaluated immediately. While we create a new RDD from an existing Spark RDD, that new RDD also carries a pointer to the parent RDD in Spark. That is the same as all the dependencies between the RDDs those are logged in a graph, rather than the actual data. It is what we call as lineage graph.
RDD lineage is nothing but the graph of all the parent RDDs of an RDD. We also call it an RDD operator graph or RDD dependency graph. To be very specific, it is an output of applying transformations to the spark. Then, it creates a logical execution plan.







