Pyspark Compare VMWare, WSL2 and Native Windows

Thefuturebymob
3 min readOct 4, 2021

Pyspark is a python API for Apache Spark. Spark is an execution framework that handles distributed workloads. Written in Scala, it uses in-memory caching and optimized execution, and support batch processing. Spark in turn uses Hadoop. Hadoop is a Java open source storage and processing framework, which has a distributed file system (HDFS), YARN (yet another resource negotiator), map reduce (parallel computing) and a common set of java libraries.

Jupyter notebooks are great for developing and running Python scripts. You can control the notebook from inside a web browser, it allows you to experimenting with a script, save it and share it. They also support visual outputs like graphs. As jupyter notebooks can use the Ipython kernel you can reset the python session and re-run all your code, all from within a web page.

I wanted to use Pyspark for processing some large parquet files, so first I wanted a way to see which method would be fastest to run under windows, either WSL2 with Ubuntu, VM ware with Ubuntu or native Windows.

Rather than ramble on about what I used to test the times, here are the results

All results from same PC, running same version of Windows 10 Pro

PC Specs

AMD Ryzen 5950x

64GB DDR 3200

PCIe Gen 4 NVMe SSD

RTX 3090

Tools and Installation

Spark 3.1.2

Python 3.9.5

OpenJDK 11.0.11

Guest OS

Ubuntu 20.4.02

Windows VMWare Host running Ubuntu guest Firefox 12.7 seconds

Windows WSL2 Host running Ubuntu external Firefox 12.7 seconds

Host OS

Windows 10 Pro Version 10.0.19043 Build 19043

Browsers

Windows 10 Native Firefox 15.79 seconds

Windows 10 Native Microsoft Edge 15.6 seconds

Summary

VMWare running on Windows 10 Pro with a guest Ubuntu (using the Linux version of the tools) seems to be around 25% faster.

Running normal native windows version of the tools, or via WSL2 are both 25% slower.

And so the best thing to use is a virtual machine if you can. The VMWare Player 16 is free, and use an Ubuntu 20.4 LTS to be the guest.

I ran many types of scripts and they all seem to point to VMWare being faster than windows.

Example script and Results

import pyspark.sql.functions as F
from pyspark.sql.functions import col, expr
from pyspark.context import SparkContext
from pyspark.sql import SparkSession #, SQLContext
from pyspark import SparkFiles
import numpy as np
import matplotlib.pyplot as plt
import pyspark

import time
start_time = time.time()

try:
sc
except NameError:
sc = SparkContext(‘local’)
spark = SparkSession(sc)

%matplotlib inline

filename = ‘Q:/Downloads/UchuuData/sim_tmp/halolist_z0p00_0.h5.0.parquet’
print(SparkFiles.get(filename))
halos = spark.read.parquet(SparkFiles.get(filename))
%time hosts = halos.where(col(‘pid’) == -1)

%time n_hosts = hosts.count()
print(‘Total # halos =’, n_hosts)

%time hosts.orderBy(‘Mvir’, ascending = False).select(‘Mvir’, ‘Vmax’, ‘Rvir’, ‘rs’).show(n=10)

%time quantiles = list(1-np.logspace(-6.5, 0, 100))

%time Mvir_quantiles = hosts.approxQuantile(‘Mvir’, quantiles, 0)

%time Mvir_quantiles = np.array(Mvir_quantiles)

%time N_vals = n_hosts + 1 — np.array(quantiles)*n_hosts

plt.plot(Mvir_quantiles, N_vals)
plt.xlabel(‘$M_{vir}$’)
plt.ylabel(‘N(> $M_{vir}$)’)
plt.loglog()
np.savetxt(‘halo_mass_fn.csv’, np.array([Mvir_quantiles, N_vals]), delimiter = ‘,’)

plt.show()

print(“ — — %s seconds — -” % (time.time() — start_time))

Native Windows 10 Pro with Firefox

--- 15.79399824142456 seconds ---

Native Windows 10 Pro with Microsoft Edge

--- 15.594499349594116 seconds ---

WSL 2 with Microsoft Edge running in Native Windows 10 Pro

--- 15.728499412536621 seconds ---

VMWare running in Windows 10 with a Ubuntu Host running Firefox

— — 12.69355845451355 seconds — -

--

--