In this part the main way we will be working with Python and Spark is through the DataFrame Syntax. If you have worked with pandas in Python, R, SQL or even Excel, a DataFrame will feel very familiar!
- Spark DataFrames hold data in a column and row format
- Each column represents some feature or variable
- Each row represents an individual data point
Spark DataFrames are able to input and output data from a wide variety of sources. We can then use these DataFrames to apply various transformations on the data. At the end of these transformation calls, we can either show or collect the result to display or for some final processing.
How to create a DataFrame?
A DataFrame in Spark can be created in multiple ways:
- Using different data formats. For example, loading the data from JSON, CSV
- Loading data from Existing RDD
- Programmatically specifying schema
DataFrame basics for PySpark
Let’s jump to Jupiter Notebook and start. Before we can work with Spark, we have to start the session. If you are running this command for the first time, it will take you a minute or two to run it.
from pyspark import SparkSession spark= SparkSession.builder.appname('Basics').get0Create()
The next thing we want to do is to read a dataset. There many options for in and outputs, but for right now we are going to use a JSON file.
df = spark.read.json('people.json')
Now we have our DataFrame. Let’s actually show a little bit of this data. It will replace the missing data with null.
If you want to know the schema of the DataFrame use the schema method
If you want to see only the column names
Basic DataFrame verbs in Spark
Select
I can select a subset of columns. The method select()
takes either a list of column names or an unpacked list of names.
cols1 = ['PassengerId', 'Name']
df1.select(cols1).show(
)+-----------+--------+
|PassengerId| Name|
+-----------+--------+
| 1| Owen|
| 2|Florence|
| 3| Laina|
| 4| Lily|
| 5| William|
+-----------+--------+
Filter
You can filter a subset of rows. The method filter()
takes column expressions or SQL expressions.
df1.filter(df1.Sex == 'female').show
()+-----------+--------+------+--------+
|PassengerId| Name| Sex|Survived|
+-----------+--------+------+--------+
| 2|Florence|female| 1|
| 3| Laina|female| 1|
| 4| Lily|female| 1|
+-----------+--------+------+--------+
Create new columns
You can create new columns in Spark using .withColumn()
. I have yet found a convenient way to create multiple columns at once without chaining multiple .withColumn()
methods.
df2.withColumn('AgeTimesFare', df2.Age*df2.Fare).show()
+-----------+---+----+------+------------+ |PassengerId|Age|Fare|Pclass|AgeTimesFare|
+-----------+---+----+------+------------+
| 1| 22| 7.3| 3| 160.6|
| 2| 38|71.3| 1| 2709.4|
| 3| 26| 7.9| 3| 205.4|
| 4| 35|53.1| 1| 1858.5|
| 5| 35| 8.0| 3| 280.0|
+-----------+---+----+------+------------+
Setup Apache Spark
In order to understand the operations of DataFrame, you need to first setup the Apache Spark in your machine. For that you can check out our previous article, which will guide you to setup Apache Spark in Ubuntu.
DataFrame supports wide range of operations which are very useful while working with data. In this section, we will show you some of basic operations on DataFrame.
First step, in any Apache programming is to create a SparkContext. SparkContext is required when we want to execute operations in a cluster. SparkContext tells Spark how and where to access a cluster. And the first step is to connect with Apache Cluster. We can create the SparkContext by importing, initializing and providing the configuration settings.
from pyspark import SparkContext sc = SparkContext()