PyArrow>=0.15 support is back¶
We added PyArrow>=0.15 support back (#1110).
Note that, when working with
Koalas will set an environment variable
if it does not exist, as per the instruction in
it will NOT work if there is a Spark context already launched. In that
case, you have to manage the environment variable by yourselves.
Spark specific improvements¶
broadcast function in namespace.py (#1360).
We can use it with
update which invoke join
operation in Spark when you know one of the DataFrame is small enough to
fit in memory, and we can expect much more performant than shuffle-based
>>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey') >>> merged.explain() == Physical Plan == ... ...BroadcastHashJoin... ...
persist function and storage level¶
>>> with df.cache() as cached_df: ... print(cached_df.storage_level) ... Disk Memory Deserialized 1x Replicated >>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df: ... print(cached_df.storage_level) ... Memory Serialized 1x Replicated
Other new features and improvements¶
We added the following new feature:
Add a way to specify index column in I/O APIs (#1379)
iloc.__setitem__with the other Series from the same DataFrame. (#1388)
Add support Series from different DataFrames for
__setitem__for loc/iloc with DataFrame. (#1394)
Help misuse of options argument. (#1402)
Add blog posts in Koalas documentation (#1406)
Fix mod & rmod for matching with pandas. (#1399)