Version 1.8.0

Koalas 1.8.0 is the last minor release because Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.

Categorical type and ExtensionDtype

We added the support of pandas’ categorical type (#2064, #2106).

>>> s = ks.Series(list("abbccc"), dtype="category")
>>> s
0    a
1    b
2    b
3    c
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> s.cat.codes
0    0
1    1
2    1
3    2
4    2
5    2
dtype: int8
>>> idx = ks.CategoricalIndex(list("abbccc"))
>>> idx
CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'],
                 categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.codes
Int64Index([0, 1, 1, 2, 2, 2], dtype='int64')
>>> idx.categories
Index(['a', 'b', 'c'], dtype='object')

and ExtensionDtype as type arguments to annotate return types (#2120, #2123, #2132, #2127, #2126, #2125, #2124):

def func() -> ks.Series[pd.Int32Dtype()]:
    ...

Other new features, improvements and bug fixes

We added the following new features:

DataFrame:

Series:

DatetimeIndex:

  • indexer_between_time (#2104)

  • indexer_at_time (#2109)

  • between_time (#2111)

Along with the following fixes:

  • Support tuple to (DataFrame|Series).replace() (#2095)

  • Check index_dtype and data_dtypes more strictly. (#2100)

  • Return actual values via toPandas. (#2077)

  • Add lines and orient to read_json and to_json to improve error message (#2110)

  • Fix isin to accept numpy array (#2103)

  • Allow multi-index column names for inferring return type schema with names. (#2117)

  • Add a short JDBC user guide (#2148)

  • Remove upper bound pandas 1.2 (#2141)

  • Standardize exceptions of arithmetic operations on Datetime-like data (#2101)