Tuesday, 18 June 2019

Spark DataFrame

DataFrame in Spark

A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL. DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.

We need DataFrames for:
Multiple Programming languages
The best property of DataFrames in Spark is its support for multiple languages, which makes it easier for programmers from different programming background to use it.
DataFrames in Spark support R–Programming Language, Python, Scala, and Java.
Multiple data sources
DataFrames in Spark can support a large variety of sources of data. We shall discuss one by one in the use case we deal with the upcoming part of this article.
Processing Structured and Semi-Structured Data
The core requirement for which the DataFrames are introduced is to process the Big-Data with ease. DataFrames in Spark uses a table format to store the data in a versatile way along with the schema for the data it is dealing with.
Slicing and Dicing the data
DataFrame APIs support slicing and dicing the data. It can perform operations like select and filter upon rows, columns.Statistical data is always prone to have Missing values, Range Violations, and Irrelevant values. The user can manage the missing data explicitly by using DataFrames.

Spark DataFrame Features
Scalability: It allows processing petabytes of data at once.
Flexibility: It supports a broad array of data formats (csv, Elasticsearch, Avro, etc.) and storage systems (HDFS, Hive tables, etc.)
Custom Memory Management: Data is stored off-heap in a binary format that saves memory and removes garbage collection. Also, Java serialization is avoided here as the schema is already known.
Optimized Execution Plans: Spark catalyst optimizer executes query plans, and it executes the queries on RDDs.
  1. DataFrame is a distributed collection of data organized in named column. It is equivalent to the table in RDBMS.
  2. It can deal with both structured and unstructured data formats. For Example Avro, CSV, elastic search, and Cassandra. It also deals with storage systems HDFS, HIVE tables, MySQL, etc.
  3. Catalyst supports optimization. It has general libraries to represent trees. DataFrame uses Catalyst tree transformation in four phases: a) Analyze logical plan to solve references, b) Logical plan optimization c) Physical planning d) Code generation to compile part of a query to Java bytecode.
  4. The DataFrame API’s are available in various programming languages. For example Java, Scala, Python, and R.
  5.  It provides Hive compatibility. We can run unmodified Hive queries on existing Hive warehouse.
  6.  It can scale from kilobytes of data on the single laptop to petabytes of data on a large cluster.
  7. DataFrame provides easy integration with Big data tools and framework via Spark core.
Creating DataFrames in Apache Spark
To all the functionality of Spark, SparkSession class is the entry point. For the creation of basic SparkSession just use SparkSession.builder(). Using Spark Session, an application can create DataFrame from an existing RDD, Hive table or from Spark data sources. Spark SQL can operate on the variety of data sources using DataFrame interface. Using Spark SQL DataFrame we can create a temporary view. In the temporary view of dataframe, we can run the SQL query on the data.


Limitations of Spark DataFrames
Despite having multiple benefits, none of the technologies exist without loopholes. Considerable limitations of Spark DataFrames are as follows:
  • The compiler is not able to catch errors as the code refers to data attribute names. Errors are detected during the run time after the creation of query plans.
  • It works better with Scala and very limited with Java.
  • Domain objects cannot be regenerated from it.

No comments:

Post a Comment

Integrating Apache Hive with Spark

Hive Warehouse Connector for accessing Apache Spark data The Hive Warehouse Connector (HWC) is a Spark library/plugin that is launched w...