Version 1.1.0

API extensions

We added support for API extensions (#1617).

You can register your custom accessors to DataFrame, Seires, and Index.

For example, in your library code:

from databricks.koalas.extensions import register_dataframe_accessor

@register_dataframe_accessor("geo")
class GeoAccessor:

    def __init__(self, koalas_obj):
        self._obj = koalas_obj
        # other constructor logic

    @property
    def center(self):
        # return the geographic center point of this DataFrame
        lat = self._obj.latitude
        lon = self._obj.longitude
        return (float(lon.mean()), float(lat.mean()))

    def plot(self):
        # plot this array's data on a map
        pass
    ...

Then, in a session:

>>> from my_ext_lib import GeoAccessor
>>> kdf = ks.DataFrame({"longitude": np.linspace(0,10),
...                     "latitude": np.linspace(0, 20)})
>>> kdf.geo.center
    (5.0, 10.0)

>>> kdf.geo.plot()
...

See also: https://koalas.readthedocs.io/en/latest/reference/extensions.html

Plotting backend

We introduced plotting.backend configuration (#1639).

Plotly (>=4.8) or other libraries that pandas supports can be used as a plotting backend if they are installed in the environment.

>>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"])
>>> kdf.plot(title="Example Figure")  # defaults to backend="matplotlib"
image

image

>>> fig = kdf.plot(backend="plotly", title="Example Figure", height=500, width=500)
>>> ## same as:
>>> # ks.options.plotting.backend = "plotly"
>>> # fig = kdf.plot(title="Example Figure", height=500, width=500)
>>> fig.show()
image

image

Each backend returns the figure in their own format, allowing for further editing or customization if required.

>>> fig.update_layout(template="plotly_dark")
>>> fig.show()
image

image

Koalas accessor

We introduced koalas accessor and some methods specific to Koalas (#1613, #1628).

DataFrame.apply_batch, DataFrame.transform_batch, and Series.transform_batch are deprecated and moved to koalas accessor.

>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
...     return pdf + 1  # should always return the same length as input.
...
>>> kdf.koalas.transform_batch(pandas_plus)
   a  b
0  2  5
1  3  6
2  4  7
>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_filter(pdf):
...     return pdf[pdf.a > 1]  # allow arbitrary length
...
>>> kdf.koalas.apply_batch(pandas_filter)
   a  b
1  2  5
2  3  6

or

>>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pser):
...     return pser + 1  # should always return the same length as input.
...
>>> kdf.a.koalas.transform_batch(pandas_plus)
0    2
1    3
2    4
Name: a, dtype: int64

See also: https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html

Other new features and improvements

We added the following new features:

DataFrame:

Series:

Other improvements

  • Simplify Series.to_frame. (#1624)

  • Make Window functions create a new DataFrame. (#1623)

  • Fix Series._with_new_scol to use alias. (#1634)

  • Refine concat to handle the same anchor DataFrames properly. (#1627)

  • Add sort parameter to concat. (#1636)

  • Enable to assign list. (#1644)

  • Use SPARK_INDEX_NAME_FORMAT in combine_frames to avoid ambiguity. (#1650)

  • Rename spark columns only when index=False. (#1649)

  • read_csv: Implement reading of number of rows (#1656)

  • Fixed ks.Index.to_series() to work properly with name paramter (#1643)

  • Fix fillna to handle “ffill” and “bfill” properly. (#1654)