Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES environment variable is set to true as below:
OPS_ON_DIFF_FRAMES
true
>>> import databricks.koalas as ks >>> >>> kdf1 = ks.range(5) >>> kdf2 = ks.DataFrame({'id': [5, 4, 3]}) >>> (kdf1 - kdf2).sort_index() id 0 -5.0 1 -3.0 2 -1.0 3 NaN 4 NaN
>>> import databricks.koalas as ks >>> >>> kdf = ks.range(5) >>> kdf['new_col'] = ks.Series([1, 2, 3, 4]) >>> kdf id new_col 0 0 1.0 1 1 2.0 3 3 4.0 2 2 3.0 4 4 NaN
Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX as one of three types:
DEFAULT_INDEX
(default) one-by-one: It implements a one-by-one sequence by Window function without specifying partition. This index type should be avoided when the data is large.
one-by-one
>>> ks.range(3) id 0 0 1 1 2 2
distributed-one-by-one: It implements a one-by-one sequence by group-by and group-map approach. It still generates a one-by-one sequential index globally. If the default index must be a one-by-one sequence in a large dataset, this index can be used.
distributed-one-by-one
distributed: It implements a monotonically increasing sequence simply by using Spark’s monotonically_increasing_id function. If the index does not have to be a one-by-one sequence, this index can be used. Performance-wise, this index almost does not have any penalty comparing to other index types.
distributed
monotonically_increasing_id
>>> ks.range(3) id 25769803776 0 60129542144 1 94489280512 2
Thirdly, we implemented many plot APIs in Series as follows:
plot.pie() (#669)
plot.area() (#670)
plot.line() (#671)
plot.barh() (#673)
See the example below:
import databricks.koalas as ks ks.range(10).to_pandas().id.plot.pie()
image¶
Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:
DataFrame.sort_index()(#637)
DataFrame.sort_index()
GroupBy.diff()(#653)
GroupBy.diff()
GroupBy.rank()(#653)
GroupBy.rank()
Series.any()(#652)
Series.any()
Series.all()(#652)
Series.all()
DataFrame.any()(#652)
DataFrame.any()
DataFrame.all()(#652)
DataFrame.all()
DataFrame.assign()(#657)
DataFrame.assign()
DataFrame.drop()(#658)
DataFrame.drop()
DataFrame.reindex()(#659)
DataFrame.reindex()
Series.quantile()(#663)
Series.quantile()
Series,transform()(#663)
Series,transform()
DataFrame.select_dtypes()(#662)
DataFrame.select_dtypes()
DataFrame.transpose()(#664).
DataFrame.transpose()
Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:
koalas.DataFrame
duplicated() (#569)
fillna() (#640)
bfill() (#640)
pad() (#640)
ffill() (#640)
koalas.groupby.GroupBy:
diff() (#622)
nunique() (#617)
nlargest() (#654)
nsmallest() (#654)
idxmax() (#649)
idxmin() (#649)
Along with the following improvements:
Add a basic infrastructure for configurations. (#645)
Always use column_index. (#648)
column_index
Allow to omit type hint in GroupBy.transform, filter, apply (#646)