So if you know Pandas why should you learn Apache Spark?
Pandas features:
- Tabular data ( and here more features than Spark )
- Pandas can handle to million rows
- Limit to a single machine
Pandas is not a distributed system.
Dask vs Spark
Apache Spark | Dask | |
Language | Scala, Java, Python, R, SQL | Python |
Scale | 1-1000 machine cluster | 1-1000 machine cluster |
Ecosystem | All-in-one project | Part of the python ecosystem |
Apache Spark | Dask | |
Dataframes / SQL | Spark API / SQL | Pandas API – no SQL |
Streaming | Great Performance | More complex |
Maschine Learning | Common operations | Part of the python ecosystem |
Graph processing | GraphX library | None |
Does it make sense to use pandas in pyspark?
Yes absolutely! We use it to in our current project. we are using a mix of pyspark and pandas dataframe to process files of size more than 500gb.
pandas is used for smaller datasets and pyspark is used for larger datasets. Pandas returns results faster compared to pyspark. That means, based on availability of memory and data size you can switch between pyspark and pandas to gain performance benefits. When you think the data to be processed can fit into memory always use pandas over pyspark.
If you think data can not fit into memory, use pyspark.
If you want to know about Apache Spark please follow our blog for the Apache Spark training or contact us.
Daniel Jordan
Fellow Consulting AG
If you have any question please contact me on this channels: