Version 0.31.0

PyArrow>=0.15 support is back

We added PyArrow>=0.15 support back (#1110).

Note that, when working with pyarrow>=0.15 and pyspark<3.0, Koalas will set an environment variable ARROW_PRE_0_15_IPC_FORMAT=1 if it does not exist, as per the instruction in SPARK-29367, but it will NOT work if there is a Spark context already launched. In that case, you have to manage the environment variable by yourselves.

Spark specific improvements

Broadcast hint

We added broadcast function in namespace.py (#1360).

We can use it with merge, join, and update which invoke join operation in Spark when you know one of the DataFrame is small enough to fit in memory, and we can expect much more performant than shuffle-based joins.

For example,

>>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey')
>>> merged.explain()
== Physical Plan ==
...
...BroadcastHashJoin...
...

persist function and storage level

We added persist function to specify the storage level when caching (#1381), and also, we added storage_level property to check the current storage level (#1385).

>>> with df.cache() as cached_df:
...     print(cached_df.storage_level)
...
Disk Memory Deserialized 1x Replicated

>>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df:
...     print(cached_df.storage_level)
...
Memory Serialized 1x Replicated

Other new features and improvements

We added the following new feature:

DataFrame:

Series:

Other improvements

  • Add a way to specify index column in I/O APIs (#1379)

  • Fix iloc.__setitem__ with the other Series from the same DataFrame. (#1388)

  • Add support Series from different DataFrames for loc/iloc.__setitem__. (#1391)

  • Refine __setitem__ for loc/iloc with DataFrame. (#1394)

  • Help misuse of options argument. (#1402)

  • Add blog posts in Koalas documentation (#1406)

  • Fix mod & rmod for matching with pandas. (#1399)