We added PyArrow>=0.15 support back (#1110).
Note that, when working with pyarrow>=0.15 and pyspark<3.0,
Koalas will set an environment variable ARROW_PRE_0_15_IPC_FORMAT=1
if it does not exist, as per the instruction in
it will NOT work if there is a Spark context already launched. In that
case, you have to manage the environment variable by yourselves.
We added broadcast function in namespace.py (#1360).
We can use it with merge, join, and update which invoke join
operation in Spark when you know one of the DataFrame is small enough to
fit in memory, and we can expect much more performant than shuffle-based
>>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey')
== Physical Plan ==
We added persist function to specify the storage level when caching
(#1381), and also, we added storage_level property to check the
current storage level (#1385).
>>> with df.cache() as cached_df:
Disk Memory Deserialized 1x Replicated
>>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df:
Memory Serialized 1x Replicated
We added the following new feature:
Add a way to specify index column in I/O APIs (#1379)
Fix iloc.__setitem__ with the other Series from the same
Add support Series from different DataFrames for
Refine __setitem__ for loc/iloc with DataFrame. (#1394)
Help misuse of options argument. (#1402)
Add blog posts in Koalas documentation (#1406)
Fix mod & rmod for matching with pandas. (#1399)