Recently, I have been working with the Python API for Spark to use distrbuted computing techniques to perform analytics at scale. When you write Spark code in Scala or Java, you can bundle your dependencies in the jar file that you submit to Spark. However, when writing Spark code in Python, dependency management becomes more difficult because each of the Spark executor nodes performing computations needs to have all of the Python dependencies installed locally.
Typically, Python deals with dependencies using pip and virtualenv. However, even if you follow this convention, you will still need to install your Spark code dependencies on each Spark executor machine in the cluster.
A way around this is to bundle the dependencies in a zip file and pass them to Spark when you submit your job using the --py-files
flag. The command will look something like this:
$ /path/to/spark-submit --py-files deps.zip my_spark_job.py
Building the deps.zip
is easiest if you use virtualenvwrapper. If you don’t have virtualenvwrapper set up already, I like this guide to get started. When you install dependencies within a virtualenv via pip
, they are placed in the folder $VIRTUAL_ENV'/lib/python2.7/site-packages
where $VIRTUAL_ENV
is ~/Envs/<env_name>
. $VIRTUAL_ENV
also becomes available as a bash variable if you are using virtualenvwrapper have used workon <env_name>
. To create the deps.zip
file, cd
into the site-packages
folder and run:
$ zip -r /target/path/for/deps.zip .
This command will zip all files and folders in site-packages
at the top level of deps.zip
. This distinction is worth noting because the files and folders must appear at the top level when deps.zip
is unzip
ped. For make sure this works, create file from inside the site-packages
folder – do not zip a folder containing all of the files and folders. If you run the zip command properly, you will see
$ zip -r ../deps.zip .
adding: __init__.py (stored 0%)
adding: _markerlib/ (stored 0%)
adding: _markerlib/__init__.py (deflated 55%)
adding: _markerlib/__init__.pyc (deflated 60%)
adding: _markerlib/markers.py (deflated 63%)
adding: _markerlib/markers.pyc (deflated 61%)
adding: aniso8601/ (stored 0%)
...
rather than
$ zip -r deps.zip folder_name/*
adding: folder_name/__init__.py (stored 0%)
adding: folder_name/_markerlib/ (stored 0%)
adding: folder_name/_markerlib/__init__.py (deflated 55%)
adding: folder_name/_markerlib/__init__.pyc (deflated 60%)
adding: folder_name/_markerlib/markers.py (deflated 63%)
adding: folder_name/_markerlib/markers.pyc (deflated 61%)
adding: folder_name/aniso8601/ (stored 0%)
...
Note: OSX obfuscates this distinction when you unzip deps.zip
in Finder. For the former case, OSX will unzip all of the files and folders to a new folder with the same name as the zip file. For example, if your zip file is named my_deps.zip
, OSX will create a folder named my_deps
and unzip the contents of my_deps.zip
to that folder. For the later case, also unzipping with Finder, OSX will unzip the contents as they were zipped, yielding a folder named folder_name
. The results are similar, but only the former case will work when you zipping dependencies for Spark. The distinction becomes more obvious if you use zip
and unzip
, as the former case will extract all files and folders to the current working directory, while the latter case will extract to a folder containing those same files and folders in the current working directory.
You should be ready to run PySpark jobs in a “jarified” way.
Afternote: I’ve run into issues getting boto3
to run on a remote Spark cluster using this method.