databricks.koalas.groupby.GroupBy.apply

GroupBy.apply(func)[source]

Apply function func group-wise and combine the results together.

The function passed to apply must take a DataFrame as its first argument and return a DataFrame. apply will then take care of combining the results back together into a single dataframe. apply is therefore a highly flexible grouping method.

While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform. Koalas offers a wide range of method that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.

Note

this API executes the function once to infer the type which is potentially expensive, for instance, when the dataset is created after aggregations or sorting.

To avoid this, specify return type in func, for instance, as below:

>>> def pandas_div_sum(x) -> ks.DataFrame[float, float]:
...    return x[['B', 'C']] / x[['B', 'C']].sum()

If the return type is specified, the output column names become c0, c1, c2 … cn. These names are positionally mapped to the returned DataFrame in func. See examples below.

Note

the dataframe within func is actually a pandas dataframe. Therefore, any pandas APIs within this function is allowed.

Parameters
funccallable

A callable that takes a DataFrame as its first argument, and returns a dataframe.

Returns
appliedDataFrame

See also

aggregate

Apply aggregate function to the GroupBy object.

Series.apply

Apply a function to a Series.

Examples

>>> df = ks.DataFrame({'A': 'a a b'.split(),
...                    'B': [1, 2, 3],
...                    'C': [4, 6, 5]}, columns=['A', 'B', 'C'])
>>> g = df.groupby('A')

Notice that g has two groups, a and b. Calling apply in various ways, we can get different grouping results:

Below the functions passed to apply takes a DataFrame as its argument and returns a DataFrame. apply combines the result for each group together into a new DataFrame:

>>> def pandas_div_sum(x) -> ks.DataFrame[float, float]:
...    return x[['B', 'C']] / x[['B', 'C']].sum()
>>> g.apply(pandas_div_sum)  # doctest: +NORMALIZE_WHITESPACE
         c0   c1
0  1.000000  1.0
1  0.333333  0.4
2  0.666667  0.6
>>> def plus_max(x) -> ks.DataFrame[str, np.int, np.int]:
...    return x + x.max()
>>> g.apply(plus_max)  # doctest: +NORMALIZE_WHITESPACE
   c0  c1  c2
0  bb   6  10
1  aa   3  10
2  aa   4  12

You can omit the type hint and let Koalas infer its type.

>>> def plus_min(x):
...    return x + x.min()
>>> g.apply(plus_min)  # doctest: +NORMALIZE_WHITESPACE
    A  B   C
0  aa  2   8
1  aa  3  10
2  bb  6  10

In case of Series, it works as below.

>>> def plus_max(x) -> ks.Series[np.int]:
...    return x + x.max()
>>> df.B.groupby(df.A).apply(plus_max)
0    6
1    3
2    4
Name: B, dtype: int32
>>> def plus_min(x):
...    return x + x.min()
>>> df.B.groupby(df.A).apply(plus_min)
0    2
1    3
2    6
Name: B, dtype: int64
Scroll To Top