Suatu hari yang indah, the end. “Jurnal Puisi” is published by Wira Assya.
This is one of my series in spark deep dive series.
Spark is a distributed parallel computation framework but still there are some functions which can be parallelized with python multi-processing Module.
Let us see the following steps in detail.
Sequential execution of Pyspark function
There are lot of functions which will result in idle executors .For example
let us consider a simple function which takes dups count on a column level
we will have a sample titanic dataset
The functions takes the column and will get the duplicate count for each column and will be stored in global list opt .I have added time to find time
Let us call the function
If we see the result above we can see that the col will be called one after other sequentially despite the fact we have more executor memory and cores.
Note:Since the dataset is small we are not able to see larger time diff
Horizontal Parallelism with ThreadPool
To overcome this we will use python multiprocessing and execute the same function
Now we have used thread pool from python multi processing with no of processes=2 and we can see that the function gets executed in pairs for 2 columns by seeing the last 2 digits of time.
Note:Small diff I suspect may be due to maybe some side effects of print function
How does it work:
As soon as we call with the function multiple tasks will be submitted in parallel to spark executor from pyspark-driver at the same time and spark executor will execute the tasks in parallel provided we have enough cores
Note this will work only if we have required executor cores to execute the parallel task
For example if we have 100 executors cores(num executors=50 and cores=2 will be equal to 50*2) and we have 50 partitions on using this method will reduce the time approximately by 1/2 if we have threadpool of 2 processes. But on the other hand if we specified a threadpool of 3 we will have the same performance because we will have only 100 executors so at the same time only 2 tasks can run even though three tasks have been submitted from the driver to executor only 2 process will run and the third task will be picked by executor only upon completion of the two tasks.
For example in above function most of the executors will be idle because we are working on a single column.
To improve performance we can increase the no of processes = No of cores on driver since the submission of these task will take from driver machine as shown below
We can see a subtle decrase in wall time to 3.35 seconds
Since these threads doesnt do any heavy computational task we can further increase the processes
We can further see a decrase in wall time to 2.85 seconds
Thus this depends on 2 factors
Use case Leveraging Horizontal parallelism
We can use this in the following use case
That’s all for the day !! :)
Note: There are other multiprocessing modules like pool,process etc which can also tried out for parallelising through python
Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)
Learn and let others Learn!!
Please follow the author and Written Tales to continue to read more wonderful tales like this. Thank you for your kind support and don’t forget to subscribe to the Written Tales Weekly!
Gone are those days when 9–5 jobs attracted people. This is true with respect to the individuals who are multi-talented and find it hard to stick to an arena of fixed job with a ring-master juggling…
Published by Healing from White Yoga under the direction and guidance of Dharmic mentors (Dharmic means Indigenous Hindu, Buddhist, and Jain people) I hope this note finds you holding up okay, and…