Version 1.4.0

Better type support

We improved the type mapping between pandas and Koalas (#1870, #1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.

Here are some examples:

  • Added np.float32 and "float32" (matched to FloatType)

    >>> ks.Series([10]).astype(np.float32)
    0    10.0
    dtype: float32
    
    >>> ks.Series([10]).astype("float32")
    0    10.0
    dtype: float32
    
  • Added np.datetime64 and "datetime64[ns]" (matched to TimestampType)

    >>> ks.Series(["2020-10-26"]).astype(np.datetime64)
    0   2020-10-26
    dtype: datetime64[ns]
    
    >>> ks.Series(["2020-10-26"]).astype("datetime64[ns]")
    0   2020-10-26
    dtype: datetime64[ns]
    
  • Fixed np.int to match LongType, not IntegerType.

    >>> pd.Series([100]).astype(np.int)
    0    100.0
    dtype: int64
    
    >>> ks.Series([100]).astype(np.int)
    0    100.0
    dtype: int32  # This fixed to `int64` now.
    
  • Fixed np.float to match DoubleType, not FloatType.

    >>> pd.Series([100]).astype(np.float)
    0    100.0
    dtype: float64
    
    >>> ks.Series([100]).astype(np.float)
    0    100.0
    dtype: float32  # This fixed to `float64` now.
    

We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: Type Support In Koalas.

Return type annotations for major Koalas objects

To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (#1852, #1857, #1859, #1863, #1871, #1882, #1884, #1889, #1892, #1894, #1898, #1899, #1900, #1902).

The return type annotations help auto-completion libraries, such as Jedi, to infer the actual data type and provide proper suggestions:

  • Before

Before

Before

  • After

After

After

It also helps mypy enable static analysis over the method body.

pandas 1.1.4 support

We verified the behaviors of pandas 1.1.4 in Koalas.

As pandas 1.1.4 introduced a behavior change related to MultiIndex.is_monotonic (MultiIndex.is_monotonic_increasing) and MultiIndex.is_monotonic_decreasing (pandas-dev/pandas#37220), Koalas also changes the behavior (#1881).

Other new features and improvements

We added the following new features:

DataFrame:

Series:

Index:

MultiIndex:

Other improvements and bug fixes

  • Use SF.repeat in series.str.repeat (#1844)

  • Remove warning when use cache in the context manager (#1848)

  • Support a non-string name in Series’ boxplot (#1849)

  • Calculate fliers correctly in Series.plot.box (#1846)

  • Show type name rather than type class in error messages (#1851)

  • Fix DataFrame.spark.hint to reflect internal changes. (#1865)

  • DataFrame.reindex supports named columns index (#1876)

  • Separate InternalFrame.index_map into index_spark_column_names and index_names. (#1879)

  • Fix DataFrame.xs to handle internal changes properly. (#1896)

  • Explicitly disallow empty list as index_spark_colum_names and index_names. (#1895)

  • Use nullable inferred schema in function apply (#1897)

  • Introduce InternalFrame.index_level. (#1890)

  • Remove InternalFrame.index_map. (#1901)

  • Force to use the Spark’s system default precision and scale when inferred data type contains DecimalType. (#1904)

  • Upgrade PyArrow from 1.0.1 to 2.0.0 in CI (#1860)

  • Fix read_excel to support squeeze argument. (#1905)

  • Fix to_csv to avoid duplicated option ‘path’ for DataFrameWriter. (#1912)