Koalas requires PySpark so please make sure your PySpark is available.
To install Koalas, you can use:
Conda
PyPI
Installation from source
To install PySpark, you can use:
Installation with the official release channel
Officially Python 3.5 and above.
First you will need Conda to be installed. After that, we should create a new conda environment. A conda environment is similar with a virtualenv that allows you to specify a specific version of Python and set of libraries. Run the following commands from a terminal window:
conda create --name koalas-dev-env
This will create a minimal environment with only Python installed in it. To put your self inside this environment run:
conda activate koalas-dev-env
The final step required is to install Koalas. This can be done with the following command:
conda install -c conda-forge koalas
To install a specific Koalas version:
conda install -c conda-forge koalas=0.19.0
Koalas can be installed via pip from PyPI:
pip install koalas
See the Contribution Guide for complete instructions.
You can install PySpark by downloading a release in the official release channel. Once you download the release, un-tar it first as below:
tar xzvf spark-2.4.4-bin-hadoop2.7.tgz
After that, make sure set SPARK_HOME environment variable to indicate the directory you untar-ed:
SPARK_HOME
cd spark-2.4.4-bin-hadoop2.7 export SPARK_HOME=`pwd`
Also, make sure your PYTHONPATH can find the PySpark and Py4J under $SPARK_HOME/python/lib:
PYTHONPATH
$SPARK_HOME/python/lib
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
PySpark can be installed via Conda:
conda install -c conda-forge pyspark
PySpark can be installed via pip from PyPI:
pip install pyspark
To install PySpark from source, refer Building Spark.
Likewise, make sure you set SPARK_HOME environment variable to the git-cloned directory, and your PYTHONPATH environment variable can find the PySpark and Py4J under $SPARK_HOME/python/lib:
Package
Required version
pandas
>=0.23.2
pyspark
>=2.4.0
pyarrow
>=0.10
matplotlib
>=3.0.0,<3.3.0
numpy
>=1.14,<1.19.0
mlflow
>=1.0