Jurnal Puisi

Suatu hari yang indah, the end. “Jurnal Puisi” is published by Wira Assya.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Horizontal Parallelism with Pyspark

This is one of my series in spark deep dive series.

Spark is a distributed parallel computation framework but still there are some functions which can be parallelized with python multi-processing Module.

Let us see the following steps in detail.

Sequential execution of Pyspark function

There are lot of functions which will result in idle executors .For example

let us consider a simple function which takes dups count on a column level

we will have a sample titanic dataset

The functions takes the column and will get the duplicate count for each column and will be stored in global list opt .I have added time to find time

Let us call the function

If we see the result above we can see that the col will be called one after other sequentially despite the fact we have more executor memory and cores.

Note:Since the dataset is small we are not able to see larger time diff

Horizontal Parallelism with ThreadPool

To overcome this we will use python multiprocessing and execute the same function

Now we have used thread pool from python multi processing with no of processes=2 and we can see that the function gets executed in pairs for 2 columns by seeing the last 2 digits of time.

Note:Small diff I suspect may be due to maybe some side effects of print function

How does it work:

As soon as we call with the function multiple tasks will be submitted in parallel to spark executor from pyspark-driver at the same time and spark executor will execute the tasks in parallel provided we have enough cores

Note this will work only if we have required executor cores to execute the parallel task

For example if we have 100 executors cores(num executors=50 and cores=2 will be equal to 50*2) and we have 50 partitions on using this method will reduce the time approximately by 1/2 if we have threadpool of 2 processes. But on the other hand if we specified a threadpool of 3 we will have the same performance because we will have only 100 executors so at the same time only 2 tasks can run even though three tasks have been submitted from the driver to executor only 2 process will run and the third task will be picked by executor only upon completion of the two tasks.

For example in above function most of the executors will be idle because we are working on a single column.

To improve performance we can increase the no of processes = No of cores on driver since the submission of these task will take from driver machine as shown below

We can see a subtle decrase in wall time to 3.35 seconds

Since these threads doesnt do any heavy computational task we can further increase the processes

We can further see a decrase in wall time to 2.85 seconds

Thus this depends on 2 factors

Use case Leveraging Horizontal parallelism

We can use this in the following use case

That’s all for the day !! :)

Note: There are other multiprocessing modules like pool,process etc which can also tried out for parallelising through python

Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)

Learn and let others Learn!!

Add a comment

Related posts:

Too soon to judge

Please follow the author and Written Tales to continue to read more wonderful tales like this. Thank you for your kind support and don’t forget to subscribe to the Written Tales Weekly!

Startup Tale

Gone are those days when 9–5 jobs attracted people. This is true with respect to the individuals who are multi-talented and find it hard to stick to an arena of fixed job with a ring-master juggling…

A Letter of Accountability and Learning

Published by Healing from White Yoga under the direction and guidance of Dharmic mentors (Dharmic means Indigenous Hindu, Buddhist, and Jain people) I hope this note finds you holding up okay, and…