Pandas, Pyspark, Koalas - Which library should you choose for your Data Science project?

This article compares three Python libraries in data analysis. Pandas suitable for limited volumes, PySpark manages large quantities thanks to parallelization, and Koalas makes the transition from Pandas to Spark easy.

Introduction

Data science projects require specific tools depending on the phase of the project. This article presents An overview of 3 Python libraries often used in data analysis and Confront their use.Pandas, the base...

If you have already done data analysis on Python, you couldn't miss the Pandas library!
Indeed, this library is one of the most used for data manipulation. It became a unmissable because of its ease of handling, its practicality and all the functionalities available for data manipulation.

The basic object of this library is DataFrames (tables, see below) make it a very intuitive tool because this object is very easy to understand when working with structured data. Pandas makes it very easy to load or write data from a CSV file, Excel or an SQL database. In addition, Pandas contains a multitude of optimized functions allowing this tabular data to be manipulated. Indeed, the critical parts of code are sometimes coded in Python or C to increase speed of execution.

‍

‍

These various advantages make it the perfect choice for data processing on Python when working with a limited volume of data.

For more performance: PySpark

The use of a framework like Spark usually comes after the proof of concept, or PoC, phase. As a reminder, a PoC is a very short-term project aimed at resolving a data science subject. The PoC is used to prove the feasibility and added value of the approach on a small scale (with a reasonable amount of data). In this context, data scientists therefore have every interest in using their favorite tools in order to move forward quickly.

When in a second step, we want industrialize this PoC, that is to say applying it on a larger scale, we generally find ourselves facing performance problems. Indeed, it is possible that the volume of data explodes when we go to the complete perimeter compared to the initial sample, which can create large latencies in the execution of calculations or even the non-completion of some.

To remedy this, it will be necessary to use frameworks that allow parallelize calculations, hence the use of Spark with its PySpark python interface. One of the advantages of PySpark is also the distinction between”Lazy Operation” and actions. Indeed, Spark will only execute the code and therefore launch calculations only when necessary.

Here is an example of code in PySpark compared to Pandas, it should be noted that the language differs a lot and that an adaptation is necessary to move from Pandas to PySpark:

Pandas

Import pandas as pdf df = pd.DataFrame ({'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}) df ['col4'] = df.col1 * df.col1

PySpark

df = spark.read.option (“inferSchema”, “true”) .option (“how”, True) .csv (“my_data.csv”) df = df.Todf ('col1', 'col2', 'col3') df = df.withColumn ('col4', df.col1*df.col1)

For a smooth transition: Koalas?

Koalas is a very recent library (end of 2019) that aims to write Spark programs with Pandas syntax. It makes it possible to unify the production of experimental and industrialization code under the same tool, while benefiting from the flexibility of Pandas and the distributed performance of Spark.

So it's a Intermediate solution This is very relevant and will be particularly suitable for Data Scientists who already master Pandas and who wish to move towards larger volumes of data while using familiar tools, and therefore without having to train entirely in a new language. However, it should be noted that all the features of Pandas are not yet available in the Koalas library.

‍

For example, using the same code as before on Koalas:

import databricks.koalas as ks df = ks.DataFrame ({'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}) df ['col4'] = df.col1 * df.col1

You can clearly see the similarity with writing the code in Pandas. Here, only the name of the bookstore has changed but the code has remained exactly the same as with Pandas.

Should we abandon PySpark anyway? Not really.

Subtleties still exist between the two environments and PySpark remains the Big Data reference framework from the Data Engineers community. They are often already familiar with the Spark language and will have very little interest in changing it for the benefit of Koalas. In addition, Koalas is an increase in the Spark DataFrame API to be closer to Pandas, so the background language remains Spark. For example, if there is a need for additional performance, it may be necessary to return to Spark without overlay. In addition, Spark is designed to be easily integrated with a lot of other tools given its popularity.

In the graph below (produced by Databricks), we can see that PySpark still has superior performances in Koalas, even if Koalas is already very good compared to Pandas. It is also interesting to note that for small data sets, Pandas is more efficient (due to the initialization and data transfer operations of distributed solutions) and therefore more suitable.

‍

Koalas can also be used as an approach to learn Spark gradually, but in any case it is necessary to practice in the Spark environment.

It is obviously possible to switch from one language to another using certain functions such as to_pandas () but it is recommended to use them as little as possible because they are very expensive operations as this impacts the data storage format.

And at Aqsone in all of this?

Our Data scientists use (Py) Spark and Koalas to process calculations on large volumes of data. If you want to learn more about this technology and implement it in your big data projects, contact us!

Pandas, Pyspark, Koalas - Which library should you choose for your Data Science project?

Introduction

For more performance: PySpark

Pandas

PySpark

For a smooth transition: Koalas?

Should we abandon PySpark anyway? Not really.

And at Aqsone in all of this?

Latest blog posts

How to detect objects more effectively with YOLO?

How AI is Revolutionizing Quality Control in Aerospace with Smart Sampling?

Can we chat with our company database like with ChatGPT?