So if you know Pandas why should you learn Apache Spark?

Pandas features:

Tabular data ( and here more features than Spark )
Pandas can handle to million rows
Limit to a single machine

Pandas is not a distributed system.

Dask vs Spark

	Apache Spark	Dask
Language	Scala, Java, Python, R, SQL	Python
Scale	1-1000 machine cluster	1-1000 machine cluster
Ecosystem	All-in-one project	Part of the python ecosystem

	Apache Spark	Dask
Dataframes / SQL	Spark API / SQL	Pandas API – no SQL
Streaming	Great Performance	More complex
Maschine Learning	Common operations	Part of the python ecosystem
Graph processing	GraphX library	None

Does it make sense to use pandas in pyspark?

Yes absolutely! We use it to in our current project. we are using a mix of pyspark and pandas dataframe to process files of size more than 500gb.

pandas is used for smaller datasets and pyspark is used for larger datasets. Pandas returns results faster compared to pyspark. That means, based on availability of memory and data size you can switch between pyspark and pandas to gain performance benefits. When you think the data to be processed can fit into memory always use pandas over pyspark.

If you think data can not fit into memory, use pyspark.

If you want to know about Apache Spark please follow our blog for the Apache Spark training or contact us.

Daniel Jordan

Fellow Consulting AG

If you have any question please contact me on this channels:

More information

7. March 2024

Spark vs Pandas vs Dask

Does it make sense to use pandas in pyspark?

Daniel Jordan

More information

Making the right choice

The secret ingredient to success

Optical Character Recognition (OCR)

Maximise your success

MJR – New FELLOWPRO partner

Additional blogs:

Making the right choice

The secret ingredient to success

Optical Character Recognition (OCR)

Maximise your success

MJR – New FELLOWPRO partner