Pyspark Compare VMWare, WSL2 and Native Windows
Pyspark is a python API for Apache Spark. Spark is an execution framework that handles distributed workloads. Written in Scala, it uses in-memory caching and optimized execution, and support batch processing. Spark in turn uses Hadoop. Hadoop is a Java open source storage and processing framework, which has a distributed file system (HDFS), YARN (yet another resource negotiator), map reduce (parallel computing) and a common set of java libraries.
Jupyter notebooks are great for developing and running Python scripts. You can control the notebook from inside a web browser, it allows you to experimenting with a script, save it and share it. They also support visual outputs like graphs. As jupyter notebooks can use the Ipython kernel you can reset the python session and re-run all your code, all from within a web page.
I wanted to use Pyspark for processing some large parquet files, so first I wanted a way to see which method would be fastest to run under windows, either WSL2 with Ubuntu, VM ware with Ubuntu or native Windows.
Rather than ramble on about what I used to test the times, here are the results
All results from same PC, running same version of Windows 10 Pro
PC Specs
AMD Ryzen 5950x
64GB DDR 3200
PCIe Gen 4 NVMe SSD
RTX 3090
Tools and Installation
Spark 3.1.2
Python 3.9.5
OpenJDK 11.0.11
Guest OS
Ubuntu 20.4.02
Windows VMWare Host running Ubuntu guest Firefox 12.7 seconds
Windows WSL2 Host running Ubuntu external Firefox 12.7 seconds
Host OS
Windows 10 Pro Version 10.0.19043 Build 19043
Browsers
Windows 10 Native Firefox 15.79 seconds
Windows 10 Native Microsoft Edge 15.6 seconds
Summary
VMWare running on Windows 10 Pro with a guest Ubuntu (using the Linux version of the tools) seems to be around 25% faster.
Running normal native windows version of the tools, or via WSL2 are both 25% slower.
And so the best thing to use is a virtual machine if you can. The VMWare Player 16 is free, and use an Ubuntu 20.4 LTS to be the guest.
I ran many types of scripts and they all seem to point to VMWare being faster than windows.
Example script and Results
import pyspark.sql.functions as F
from pyspark.sql.functions import col, expr
from pyspark.context import SparkContext
from pyspark.sql import SparkSession #, SQLContext
from pyspark import SparkFiles
import numpy as np
import matplotlib.pyplot as plt
import pyspark
import time
start_time = time.time()
try:
sc
except NameError:
sc = SparkContext(‘local’)
spark = SparkSession(sc)
%matplotlib inline
filename = ‘Q:/Downloads/UchuuData/sim_tmp/halolist_z0p00_0.h5.0.parquet’
print(SparkFiles.get(filename))
halos = spark.read.parquet(SparkFiles.get(filename))
%time hosts = halos.where(col(‘pid’) == -1)
%time n_hosts = hosts.count()
print(‘Total # halos =’, n_hosts)
%time hosts.orderBy(‘Mvir’, ascending = False).select(‘Mvir’, ‘Vmax’, ‘Rvir’, ‘rs’).show(n=10)
%time quantiles = list(1-np.logspace(-6.5, 0, 100))
%time Mvir_quantiles = hosts.approxQuantile(‘Mvir’, quantiles, 0)
%time Mvir_quantiles = np.array(Mvir_quantiles)
%time N_vals = n_hosts + 1 — np.array(quantiles)*n_hosts
plt.plot(Mvir_quantiles, N_vals)
plt.xlabel(‘$M_{vir}$’)
plt.ylabel(‘N(> $M_{vir}$)’)
plt.loglog()
np.savetxt(‘halo_mass_fn.csv’, np.array([Mvir_quantiles, N_vals]), delimiter = ‘,’)
plt.show()
print(“ — — %s seconds — -” % (time.time() — start_time))
Native Windows 10 Pro with Firefox
--- 15.79399824142456 seconds ---
Native Windows 10 Pro with Microsoft Edge
--- 15.594499349594116 seconds ---
WSL 2 with Microsoft Edge running in Native Windows 10 Pro
--- 15.728499412536621 seconds ---
VMWare running in Windows 10 with a Ubuntu Host running Firefox
— — 12.69355845451355 seconds — -