We switched the default plotting backend from Matplotlib to Plotly (#2029, #2033). In addition, we added more Plotly methods such as DataFrame.plot.kde and Series.plot.kde (#2028).
DataFrame.plot.kde
Series.plot.kde
import databricks.koalas as ks kdf = ks.DataFrame({ 'a': [1, 2, 2.5, 3, 3.5, 4, 5], 'b': [1, 2, 3, 4, 5, 6, 7], 'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]}) kdf.plot.hist()
Koalas_plotly_hist_plot¶
Plotting backend can be switched to matplotlib by setting ks.options.plotting.backend to matplotlib.
matplotlib
ks.options.plotting.backend
ks.options.plotting.backend = "matplotlib"
We added more types of Index such as Index64Index, Float64Index and DatetimeIndex (#2025, #2066).
Index
Index64Index
Float64Index
DatetimeIndex
When creating an index, Index instance is always returned regardless of the data type.
But now Int64Index, Float64Index or DatetimeIndex is returned depending on the data type of the index.
Int64Index
>>> type(ks.Index([1, 2, 3])) <class 'databricks.koalas.indexes.numeric.Int64Index'> >>> type(ks.Index([1.1, 2.5, 3.0])) <class 'databricks.koalas.indexes.numeric.Float64Index'> >>> type(ks.Index([datetime.datetime(2021, 3, 9)])) <class 'databricks.koalas.indexes.datetimes.DatetimeIndex'>
In addition, we added many properties for DatetimeIndex such as year, month, day, hour, minute, second, etc. (#2074) and added APIs for DatetimeIndex such as round(), floor(), ceil(), normalize(), strftime(), month_name() and day_name() (#2082, #2086, #2089).
year
month
day
hour
minute
second
round()
floor()
ceil()
normalize()
strftime()
month_name()
day_name()
Index can be created by taking Series or Index objects (#2071).
Series
>>> kser = ks.Series([1, 2, 3], name="a", index=[10, 20, 30]) >>> ks.Index(kser) Int64Index([1, 2, 3], dtype='int64', name='a') >>> ks.Int64Index(kser) Int64Index([1, 2, 3], dtype='int64', name='a') >>> ks.Float64Index(kser) Float64Index([1.0, 2.0, 3.0], dtype='float64', name='a')
>>> kser = ks.Series([datetime(2021, 3, 1), datetime(2021, 3, 2)], index=[10, 20]) >>> ks.Index(kser) DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None) >>> ks.DatetimeIndex(kser) DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
We added basic extension dtypes support (#2039).
>>> kdf = ks.DataFrame( ... { ... "a": [1, 2, None, 3], ... "b": [4.5, 5.2, 6.1, None], ... "c": ["A", "B", "C", None], ... "d": [False, None, True, False], ... } ... ).astype({"a": "Int32", "b": "Float64", "c": "string", "d": "boolean"}) >>> kdf a b c d 0 1 4.5 A False 1 2 5.2 B <NA> 2 <NA> 6.1 C True 3 3 NaN <NA> False >>> kdf.dtypes a Int32 b float64 c string d boolean dtype: object
The following types are supported per the installed pandas:
pandas >= 0.24
Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
pandas >= 1.0
BooleanDtype
StringDtype
pandas >= 1.2
Float32Dtype
Float64Dtype
Binary operations and type casting are supported:
>>> kdf.a + kdf.b 0 5 1 7 2 <NA> 3 <NA> dtype: Int64 >>> kdf + kdf a b 0 2 8 1 4 10 2 <NA> 12 3 6 <NA> >>> kdf.a.astype('Float64') 0 1.0 1 2.0 2 <NA> 3 3.0 Name: a, dtype: Float64
We added the following new features:
General:
ks.date_range (#2081)
ks.date_range
ks.read_orc (#2017)
ks.read_orc
Series:
align (#2019)
align
DataFrame:
to_orc (#2024)
to_orc
Along with the following fixes:
PySpark 3.1.1 Support
Preserve index for statistical functions with axis==1 (#2036)
Use iloc to make sure it retrieves the first element (#2037)
Fix numeric_only to follow pandas (#2035)
Fix DataFrame.merge to work properly (#2060)
Fix astype(str) for some data types (#2040)
Fix binary operations Index by Series (#2046)
Fix bug on pow and rpow (#2047)
Support bool list-like column selection for loc indexer (#2057)
Fix window functions to resolve (#2090)
Refresh GitHub workflow matrix (#2083)
Restructure the hierarchy of Index unit tests (#2080)
Fix to delegate dtypes (#2061)