Version 0.31.0¶
PyArrow>=0.15 support is back¶
We added PyArrow>=0.15 support back (#1110).
Note that, when working with pyarrow>=0.15
and pyspark<3.0
,
Koalas will set an environment variable ARROW_PRE_0_15_IPC_FORMAT=1
if it does not exist, as per the instruction in
SPARK-29367, but
it will NOT work if there is a Spark context already launched. In that
case, you have to manage the environment variable by yourselves.
Spark specific improvements¶
Broadcast hint¶
We added broadcast
function in namespace.py (#1360).
We can use it with merge
, join
, and update
which invoke join
operation in Spark when you know one of the DataFrame is small enough to
fit in memory, and we can expect much more performant than shuffle-based
joins.
For example,
>>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey')
>>> merged.explain()
== Physical Plan ==
...
...BroadcastHashJoin...
...
persist function and storage level¶
We added persist
function to specify the storage level when caching
(#1381), and also, we added storage_level
property to check the
current storage level (#1385).
>>> with df.cache() as cached_df:
... print(cached_df.storage_level)
...
Disk Memory Deserialized 1x Replicated
>>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df:
... print(cached_df.storage_level)
...
Memory Serialized 1x Replicated
Other improvements¶
Add a way to specify index column in I/O APIs (#1379)
Fix
iloc.__setitem__
with the other Series from the same DataFrame. (#1388)Add support Series from different DataFrames for
loc/iloc.__setitem__
. (#1391)Refine
__setitem__
for loc/iloc with DataFrame. (#1394)Help misuse of options argument. (#1402)
Add blog posts in Koalas documentation (#1406)
Fix mod & rmod for matching with pandas. (#1399)