We improved the type mapping between pandas and Koalas (#1870, #1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.
Here are some examples:
Added np.float32 and "float32" (matched to FloatType)
np.float32
"float32"
FloatType
>>> ks.Series([10]).astype(np.float32) 0 10.0 dtype: float32 >>> ks.Series([10]).astype("float32") 0 10.0 dtype: float32
Added np.datetime64 and "datetime64[ns]" (matched to TimestampType)
np.datetime64
"datetime64[ns]"
TimestampType
>>> ks.Series(["2020-10-26"]).astype(np.datetime64) 0 2020-10-26 dtype: datetime64[ns] >>> ks.Series(["2020-10-26"]).astype("datetime64[ns]") 0 2020-10-26 dtype: datetime64[ns]
Fixed np.int to match LongType, not IntegerType.
np.int
LongType
IntegerType
>>> pd.Series([100]).astype(np.int) 0 100.0 dtype: int64 >>> ks.Series([100]).astype(np.int) 0 100.0 dtype: int32 # This fixed to `int64` now.
Fixed np.float to match DoubleType, not FloatType.
np.float
DoubleType
>>> pd.Series([100]).astype(np.float) 0 100.0 dtype: float64 >>> ks.Series([100]).astype(np.float) 0 100.0 dtype: float32 # This fixed to `float64` now.
We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: Type Support In Koalas.
To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (#1852, #1857, #1859, #1863, #1871, #1882, #1884, #1889, #1892, #1894, #1898, #1899, #1900, #1902).
The return type annotations help auto-completion libraries, such as Jedi, to infer the actual data type and provide proper suggestions:
Before
Before¶
After
After¶
It also helps mypy enable static analysis over the method body.
We verified the behaviors of pandas 1.1.4 in Koalas.
As pandas 1.1.4 introduced a behavior change related to MultiIndex.is_monotonic (MultiIndex.is_monotonic_increasing) and MultiIndex.is_monotonic_decreasing (pandas-dev/pandas#37220), Koalas also changes the behavior (#1881).
MultiIndex.is_monotonic
MultiIndex.is_monotonic_increasing
MultiIndex.is_monotonic_decreasing
We added the following new features:
DataFrame:
__neg__ (#1847)
__neg__
rename_axis (#1843)
rename_axis
spark.repartition (#1864)
spark.repartition
spark.coalesce (#1873)
spark.coalesce
spark.checkpoint (#1877)
spark.checkpoint
spark.local_checkpoint (#1878)
spark.local_checkpoint
reindex_like (#1880)
reindex_like
Series:
compare (#1802)
compare
Index:
intersection (#1747)
intersection
MultiIndex:
Use SF.repeat in series.str.repeat (#1844)
Remove warning when use cache in the context manager (#1848)
Support a non-string name in Series’ boxplot (#1849)
Calculate fliers correctly in Series.plot.box (#1846)
Show type name rather than type class in error messages (#1851)
Fix DataFrame.spark.hint to reflect internal changes. (#1865)
DataFrame.reindex supports named columns index (#1876)
Separate InternalFrame.index_map into index_spark_column_names and index_names. (#1879)
Fix DataFrame.xs to handle internal changes properly. (#1896)
Explicitly disallow empty list as index_spark_colum_names and index_names. (#1895)
Use nullable inferred schema in function apply (#1897)
Introduce InternalFrame.index_level. (#1890)
Remove InternalFrame.index_map. (#1901)
Force to use the Spark’s system default precision and scale when inferred data type contains DecimalType. (#1904)
Upgrade PyArrow from 1.0.1 to 2.0.0 in CI (#1860)
Fix read_excel to support squeeze argument. (#1905)
Fix to_csv to avoid duplicated option ‘path’ for DataFrameWriter. (#1912)