In an effort to find the best IDE for developing Python code in Spark, I came across PyDev for Eclipse. Unlike some of the other tools for Python development, PyDev provides Code completion, debugging, and a host of other features (see pydev.org).
I followed the Anaconda integration guide and this instructions from prossbald for the configuration. But it did not seem to work with some of my existing code. While running the Python code for Spark in the IDE I would get an error while calling a map function on a RDD.
After some research, I discovered this my have something to do with differences between the Python 2.x and 3.x versions. I’m using Python 3.5 (Anaconda) on my Mac. But there are also other older versions of Python on the system. Somehow, the PyDev (or Spark) is picking up the wrong version of Python.
After trial and error and reviewing the Apache Spark program guide for Python, solution that worked for me was to add a new Environment variable in Eclipse –> Preferences –> PyDev –> Python Interpreter. Click the Environment button and add the following: PYSPARK_PYTHON=<path to python 3.x folder>/bin/python.
Results after adding PYSPARK_PYTHON variable:
Note: You may get a warning stating “Please install psutil to have better support with spilling” when you run operations that “shuffle” data. I fix this issue by installing XCode and then executing the following in a terminal:
$ conda install psutil
My next task will be to update the syntax differences from Python 2.x and 3.x within my existing code.