Version 0.16.0¶
Firstly, we introduced new mode to enable operations on different
DataFrames (#633). This mode can be enabled by setting
OPS_ON_DIFF_FRAMES environment variable is set to true as below:
>>> import databricks.koalas as ks
>>>
>>> kdf1 = ks.range(5)
>>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
>>> (kdf1 - kdf2).sort_index()
id
0 -5.0
1 -3.0
2 -1.0
3 NaN
4 NaN
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(5)
>>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
>>> kdf
id new_col
0 0 1.0
1 1 2.0
3 3 4.0
2 2 3.0
4 4 NaN
Secondly, we also introduced default index and disallowed Koalas
DataFrame with no index internally (#639)(#655). For example, if you
create Koalas DataFrame from Spark DataFrame, the default index is used.
The default index implementation can be configured by setting
DEFAULT_INDEX as one of three types:
(default)
one-by-one: It implements a one-by-one sequence by Window function without specifying partition. This index type should be avoided when the data is large.>>> ks.range(3) id 0 0 1 1 2 2
distributed-one-by-one: It implements a one-by-one sequence by group-by and group-map approach. It still generates a one-by-one sequential index globally. If the default index must be a one-by-one sequence in a large dataset, this index can be used.>>> ks.range(3) id 0 0 1 1 2 2
distributed: It implements a monotonically increasing sequence simply by using Spark’smonotonically_increasing_idfunction. If the index does not have to be a one-by-one sequence, this index can be used. Performance-wise, this index almost does not have any penalty comparing to other index types.>>> ks.range(3) id 25769803776 0 60129542144 1 94489280512 2
Thirdly, we implemented many plot APIs in Series as follows:
See the example below:
import databricks.koalas as ks
ks.range(10).to_pandas().id.plot.pie()
image¶
Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:
DataFrame.sort_index()(#637)GroupBy.diff()(#653)GroupBy.rank()(#653)Series.any()(#652)Series.all()(#652)DataFrame.any()(#652)DataFrame.all()(#652)DataFrame.assign()(#657)DataFrame.drop()(#658)DataFrame.reindex()(#659)Series.quantile()(#663)Series,transform()(#663)DataFrame.select_dtypes()(#662)DataFrame.transpose()(#664).
Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:
koalas.DataFrame
koalas.groupby.GroupBy:
Along with the following improvements: