Version 1.4.0¶

Better type support¶

We improved the type mapping between pandas and Koalas (#1870, #1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.

Here are some examples:

Added np.float32 and "float32" (matched to FloatType)

>>> ks.Series([10]).astype(np.float32)
0    10.0
dtype: float32

>>> ks.Series([10]).astype("float32")
0    10.0
dtype: float32

Added np.datetime64 and "datetime64[ns]" (matched to TimestampType)

>>> ks.Series(["2020-10-26"]).astype(np.datetime64)
0   2020-10-26
dtype: datetime64[ns]

>>> ks.Series(["2020-10-26"]).astype("datetime64[ns]")
0   2020-10-26
dtype: datetime64[ns]

Fixed np.int to match LongType, not IntegerType.

>>> pd.Series([100]).astype(np.int)
0    100.0
dtype: int64

>>> ks.Series([100]).astype(np.int)
0    100.0
dtype: int32  # This fixed to `int64` now.

Fixed np.float to match DoubleType, not FloatType.

>>> pd.Series([100]).astype(np.float)
0    100.0
dtype: float64

>>> ks.Series([100]).astype(np.float)
0    100.0
dtype: float32  # This fixed to `float64` now.

We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: Type Support In Koalas.

Return type annotations for major Koalas objects¶

To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (#1852, #1857, #1859, #1863, #1871, #1882, #1884, #1889, #1892, #1894, #1898, #1899, #1900, #1902).

The return type annotations help auto-completion libraries, such as Jedi, to infer the actual data type and provide proper suggestions:

Before

Before¶

After

After¶

It also helps mypy enable static analysis over the method body.

pandas 1.1.4 support¶

We verified the behaviors of pandas 1.1.4 in Koalas.

As pandas 1.1.4 introduced a behavior change related to MultiIndex.is_monotonic (MultiIndex.is_monotonic_increasing) and MultiIndex.is_monotonic_decreasing (pandas-dev/pandas#37220), Koalas also changes the behavior (#1881).

Other new features and improvements¶

We added the following new features:

DataFrame:

__neg__ (#1847)
rename_axis (#1843)
spark.repartition (#1864)
spark.coalesce (#1873)
spark.checkpoint (#1877)
spark.local_checkpoint (#1878)
reindex_like (#1880)

Series:

rename_axis (#1843)
compare (#1802)
reindex_like (#1880)

Index:

intersection (#1747)

MultiIndex:

intersection (#1747)

Other improvements and bug fixes¶

Use SF.repeat in series.str.repeat (#1844)
Remove warning when use cache in the context manager (#1848)
Support a non-string name in Series’ boxplot (#1849)
Calculate fliers correctly in Series.plot.box (#1846)
Show type name rather than type class in error messages (#1851)
Fix DataFrame.spark.hint to reflect internal changes. (#1865)
DataFrame.reindex supports named columns index (#1876)
Separate InternalFrame.index_map into index_spark_column_names and index_names. (#1879)
Fix DataFrame.xs to handle internal changes properly. (#1896)
Explicitly disallow empty list as index_spark_colum_names and index_names. (#1895)
Use nullable inferred schema in function apply (#1897)
Introduce InternalFrame.index_level. (#1890)
Remove InternalFrame.index_map. (#1901)
Force to use the Spark’s system default precision and scale when inferred data type contains DecimalType. (#1904)
Upgrade PyArrow from 1.0.1 to 2.0.0 in CI (#1860)
Fix read_excel to support squeeze argument. (#1905)
Fix to_csv to avoid duplicated option ‘path’ for DataFrameWriter. (#1912)

Version 1.5.0 Version 1.3.0