Koalas requires PySpark so please make sure your PySpark is available.
To install Koalas, you can use:
Conda
PyPI
Installation from source
To install PySpark, you can use:
Installation with the official release channel
Officially Python 3.5 and above.
First you will need Conda to be installed. After that, we should create a new conda environment. A conda environment is similar with a virtualenv that allows you to specify a specific version of Python and set of libraries. Run the following commands from a terminal window:
conda create --name koalas-dev-env
This will create a minimal environment with only Python installed in it. To put your self inside this environment run:
conda activate koalas-dev-env
The final step required is to install Koalas. This can be done with the following command:
conda install -c conda-forge koalas
To install a specific Koalas version:
conda install -c conda-forge koalas=0.19.0
Koalas can be installed via pip from PyPI:
pip install koalas
See the Contribution Guide for complete instructions.
You can install PySpark by downloading a release in the official release channel. Once you download the release, un-tar it first as below:
tar xzvf spark-2.4.4-bin-hadoop2.7.tgz
After that, make sure set SPARK_HOME environment variable to indicate the directory you untar-ed:
SPARK_HOME
cd spark-2.4.4-bin-hadoop2.7 export SPARK_HOME=`pwd`
Also, make sure your PYTHONPATH can find the PySpark and Py4J under $SPARK_HOME/python/lib:
PYTHONPATH
$SPARK_HOME/python/lib
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
PySpark can be installed via Conda:
conda install -c conda-forge pyspark
PySpark can be installed via pip from PyPI:
pip install pyspark
To install PySpark from source, refer Building Spark.
Likewise, make sure you set SPARK_HOME environment variable to the git-cloned directory, and your PYTHONPATH environment variable can find the PySpark and Py4J under $SPARK_HOME/python/lib:
Package
Minimum supported version
pandas
0.23
PySpark
2.4
pyarrow
0.10
matplotlib
3.0.0
mlflow
1.0