We improved plotting support by implementing pie, histogram and box plots with Plotly plot backend. Koalas now can plot data with Plotly via:
DataFrame.plot.pie and Series.plot.pie (#1971)
DataFrame.plot.pie
Series.plot.pie
DataFrame.plot.hist and Series.plot.hist (#1999)
DataFrame.plot.hist
Series.plot.hist
Series.plot.box (#2007)
Series.plot.box
In addition, we optimized histogram calculation as a single pass in DataFrame (#1997) instead of launching each job to calculate each Series in DataFrame.
DataFrame
Series
The operations between Series and Index are now supported as below (#1996):
Index
>>> kser = ks.Series([1, 2, 3, 4, 5, 6, 7]) >>> kidx = ks.Index([0, 1, 2, 3, 4, 5, 6]) >>> (kser + 1 + 10 * kidx).sort_index() 0 2 1 13 2 24 3 35 4 46 5 57 6 68 dtype: int64 >>> (kidx + 1 + 10 * kser).sort_index() 0 11 1 22 2 33 3 44 4 55 5 66 6 77 dtype: int64
We have added the support of setting a column via attribute assignment in DataFrame, (#1989).
>>> kdf = ks.DataFrame({'A': [1, 2, 3, None]}) >>> kdf.A = kdf.A.fillna(kdf.A.median()) >>> kdf A 0 1.0 1 2.0 2 3.0 3 2.0
We added the following new features:
Series:
factorize (#1972)
factorize
sem (#1993)
sem
insert (#1983)
insert
In addition, we also implement new parameters:
Add min_count parameter for Frame.sum. (#1978)
Added ddof parameter for GroupBy.std() and GroupBy.var() (#1994)
Support ddof parameter for std and var. (#1986)
Along with the following fixes:
Fix stat functions with no numeric columns. (#1967)
Fix DataFrame.replace with NaN/None values (#1962)
Fix cumsum and cumprod. (#1982)
Use Python type name instead of Spark’s in error messages. (#1985)
Use object.__setattr__ in Series. (#1991)
Adjust Series.mode to match pandas Series.mode (#1995)
Adjust data when all the values in a column are nulls. (#2004)
Fix as_spark_type to not support “bigint”. (#2011)