We added PyArrow>=0.15 support back (#1110).
Note that, when working with pyarrow>=0.15 and pyspark<3.0, Koalas will set an environment variable ARROW_PRE_0_15_IPC_FORMAT=1 if it does not exist, as per the instruction in SPARK-29367, but it will NOT work if there is a Spark context already launched. In that case, you have to manage the environment variable by yourselves.
pyarrow>=0.15
pyspark<3.0
ARROW_PRE_0_15_IPC_FORMAT=1
We added broadcast function in namespace.py (#1360).
broadcast
We can use it with merge, join, and update which invoke join operation in Spark when you know one of the DataFrame is small enough to fit in memory, and we can expect much more performant than shuffle-based joins.
merge
join
update
For example,
>>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey') >>> merged.explain() == Physical Plan == ... ...BroadcastHashJoin... ...
We added persist function to specify the storage level when caching (#1381), and also, we added storage_level property to check the current storage level (#1385).
persist
storage_level
>>> with df.cache() as cached_df: ... print(cached_df.storage_level) ... Disk Memory Deserialized 1x Replicated >>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df: ... print(cached_df.storage_level) ... Memory Serialized 1x Replicated
We added the following new feature:
DataFrame:
to_markdown (#1377)
to_markdown
squeeze (#1389)
squeeze
Series:
asof (#1366)
asof
Add a way to specify index column in I/O APIs (#1379)
Fix iloc.__setitem__ with the other Series from the same DataFrame. (#1388)
iloc.__setitem__
Add support Series from different DataFrames for loc/iloc.__setitem__. (#1391)
loc/iloc.__setitem__
Refine __setitem__ for loc/iloc with DataFrame. (#1394)
__setitem__
Help misuse of options argument. (#1402)
Add blog posts in Koalas documentation (#1406)
Fix mod & rmod for matching with pandas. (#1399)