Version 1.7.0¶
Switch the default plotting backend to Plotly¶
We switched the default plotting backend from Matplotlib to Plotly
(#2029, #2033). In addition, we added more Plotly methods such as
DataFrame.plot.kde
and Series.plot.kde
(#2028).
import databricks.koalas as ks
kdf = ks.DataFrame({
'a': [1, 2, 2.5, 3, 3.5, 4, 5],
'b': [1, 2, 3, 4, 5, 6, 7],
'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]})
kdf.plot.hist()
Plotting backend can be switched to matplotlib
by setting
ks.options.plotting.backend
to matplotlib
.
ks.options.plotting.backend = "matplotlib"
Add Int64Index, Float64Index, DatatimeIndex¶
We added more types of Index
such as Index64Index
,
Float64Index
and DatetimeIndex
(#2025, #2066).
When creating an index, Index
instance is always returned regardless
of the data type.
But now Int64Index
, Float64Index
or DatetimeIndex
is
returned depending on the data type of the index.
>>> type(ks.Index([1, 2, 3]))
<class 'databricks.koalas.indexes.numeric.Int64Index'>
>>> type(ks.Index([1.1, 2.5, 3.0]))
<class 'databricks.koalas.indexes.numeric.Float64Index'>
>>> type(ks.Index([datetime.datetime(2021, 3, 9)]))
<class 'databricks.koalas.indexes.datetimes.DatetimeIndex'>
In addition, we added many properties for DatetimeIndex
such as
year
, month
, day
, hour
, minute
, second
, etc.
(#2074) and added APIs for DatetimeIndex
such as round()
,
floor()
, ceil()
, normalize()
, strftime()
,
month_name()
and day_name()
(#2082, #2086, #2089).
Create Index from Series or Index objects¶
Index can be created by taking Series
or Index
objects (#2071).
>>> kser = ks.Series([1, 2, 3], name="a", index=[10, 20, 30])
>>> ks.Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Int64Index(kser)
Int64Index([1, 2, 3], dtype='int64', name='a')
>>> ks.Float64Index(kser)
Float64Index([1.0, 2.0, 3.0], dtype='float64', name='a')
>>> kser = ks.Series([datetime(2021, 3, 1), datetime(2021, 3, 2)], index=[10, 20])
>>> ks.Index(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
>>> ks.DatetimeIndex(kser)
DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
Extension dtypes support¶
We added basic extension dtypes support (#2039).
>>> kdf = ks.DataFrame(
... {
... "a": [1, 2, None, 3],
... "b": [4.5, 5.2, 6.1, None],
... "c": ["A", "B", "C", None],
... "d": [False, None, True, False],
... }
... ).astype({"a": "Int32", "b": "Float64", "c": "string", "d": "boolean"})
>>> kdf
a b c d
0 1 4.5 A False
1 2 5.2 B <NA>
2 <NA> 6.1 C True
3 3 NaN <NA> False
>>> kdf.dtypes
a Int32
b float64
c string
d boolean
dtype: object
The following types are supported per the installed pandas:
pandas >= 0.24
Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
pandas >= 1.0
BooleanDtype
StringDtype
pandas >= 1.2
Float32Dtype
Float64Dtype
Binary operations and type casting are supported:
>>> kdf.a + kdf.b
0 5
1 7
2 <NA>
3 <NA>
dtype: Int64
>>> kdf + kdf
a b
0 2 8
1 4 10
2 <NA> 12
3 6 <NA>
>>> kdf.a.astype('Float64')
0 1.0
1 2.0
2 <NA>
3 3.0
Name: a, dtype: Float64
Other new features, improvements and bug fixes¶
We added the following new features:
koalas:
Series:
align
(#2019)
DataFrame:
Along with the following fixes:
PySpark 3.1.1 Support
Preserve index for statistical functions with axis==1 (#2036)
Use iloc to make sure it retrieves the first element (#2037)
Fix numeric_only to follow pandas (#2035)
Fix DataFrame.merge to work properly (#2060)
Fix astype(str) for some data types (#2040)
Fix binary operations Index by Series (#2046)
Fix bug on pow and rpow (#2047)
Support bool list-like column selection for loc indexer (#2057)
Fix window functions to resolve (#2090)
Refresh GitHub workflow matrix (#2083)
Restructure the hierarchy of Index unit tests (#2080)
Fix to delegate dtypes (#2061)