Version 1.4.0¶
Better type support¶
We improved the type mapping between pandas and Koalas (#1870, #1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.
Here are some examples:
Added
np.float32
and"float32"
(matched toFloatType
)>>> ks.Series([10]).astype(np.float32) 0 10.0 dtype: float32 >>> ks.Series([10]).astype("float32") 0 10.0 dtype: float32
Added
np.datetime64
and"datetime64[ns]"
(matched toTimestampType
)>>> ks.Series(["2020-10-26"]).astype(np.datetime64) 0 2020-10-26 dtype: datetime64[ns] >>> ks.Series(["2020-10-26"]).astype("datetime64[ns]") 0 2020-10-26 dtype: datetime64[ns]
Fixed
np.int
to matchLongType
, notIntegerType
.>>> pd.Series([100]).astype(np.int) 0 100.0 dtype: int64 >>> ks.Series([100]).astype(np.int) 0 100.0 dtype: int32 # This fixed to `int64` now.
Fixed
np.float
to matchDoubleType
, notFloatType
.>>> pd.Series([100]).astype(np.float) 0 100.0 dtype: float64 >>> ks.Series([100]).astype(np.float) 0 100.0 dtype: float32 # This fixed to `float64` now.
We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: Type Support In Koalas.
Return type annotations for major Koalas objects¶
To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (#1852, #1857, #1859, #1863, #1871, #1882, #1884, #1889, #1892, #1894, #1898, #1899, #1900, #1902).
The return type annotations help auto-completion libraries, such as Jedi, to infer the actual data type and provide proper suggestions:
Before
After
It also helps mypy enable static analysis over the method body.
pandas 1.1.4 support¶
We verified the behaviors of pandas 1.1.4 in Koalas.
As pandas 1.1.4 introduced a behavior change related to
MultiIndex.is_monotonic
(MultiIndex.is_monotonic_increasing
) and
MultiIndex.is_monotonic_decreasing
(pandas-dev/pandas#37220), Koalas
also changes the behavior (#1881).
Other new features and improvements¶
We added the following new features:
DataFrame:
__neg__
(#1847)rename_axis
(#1843)spark.repartition
(#1864)spark.coalesce
(#1873)spark.checkpoint
(#1877)spark.local_checkpoint
(#1878)reindex_like
(#1880)
Series:
Index:
intersection
(#1747)
MultiIndex:
intersection
(#1747)
Other improvements and bug fixes¶
Use SF.repeat in series.str.repeat (#1844)
Remove warning when use cache in the context manager (#1848)
Support a non-string name in Series’ boxplot (#1849)
Calculate fliers correctly in Series.plot.box (#1846)
Show type name rather than type class in error messages (#1851)
Fix DataFrame.spark.hint to reflect internal changes. (#1865)
DataFrame.reindex supports named columns index (#1876)
Separate InternalFrame.index_map into index_spark_column_names and index_names. (#1879)
Fix DataFrame.xs to handle internal changes properly. (#1896)
Explicitly disallow empty list as index_spark_colum_names and index_names. (#1895)
Use nullable inferred schema in function apply (#1897)
Introduce InternalFrame.index_level. (#1890)
Remove InternalFrame.index_map. (#1901)
Force to use the Spark’s system default precision and scale when inferred data type contains DecimalType. (#1904)
Upgrade PyArrow from 1.0.1 to 2.0.0 in CI (#1860)
Fix read_excel to support squeeze argument. (#1905)
Fix to_csv to avoid duplicated option ‘path’ for DataFrameWriter. (#1912)