databricks.koalas.groupby.DataFrameGroupBy.agg

DataFrameGroupBy.agg(func_or_funcs=None, *args, **kwargs) → databricks.koalas.frame.DataFrame

Aggregate using one or more operations over the specified axis.

Parameters
func_or_funcsdict, str or list

a dict mapping from column name (string) to aggregate functions (string or list of strings).

Returns
Series or DataFrame

The return can be:

  • Series : when DataFrame.agg is called with a single function

  • DataFrame : when DataFrame.agg is called with several functions

Return Series or DataFrame.

Notes

agg is an alias for aggregate. Use the alias.

Examples

>>> df = ks.DataFrame({'A': [1, 1, 2, 2],
...                    'B': [1, 2, 3, 4],
...                    'C': [0.362, 0.227, 1.267, -0.562]},
...                   columns=['A', 'B', 'C'])
>>> df
   A  B      C
0  1  1  0.362
1  1  2  0.227
2  2  3  1.267
3  2  4 -0.562

Different aggregations per column

>>> aggregated = df.groupby('A').agg({'B': 'min', 'C': 'sum'})
>>> aggregated[['B', 'C']].sort_index()  
   B      C
A
1  1  0.589
2  3  0.705
>>> aggregated = df.groupby('A').agg({'B': ['min', 'max']})
>>> aggregated.sort_index()  
     B
   min  max
A
1    1    2
2    3    4
>>> aggregated = df.groupby('A').agg('min')
>>> aggregated.sort_index()  
     B      C
A
1    1  0.227
2    3 -0.562
>>> aggregated = df.groupby('A').agg(['min', 'max'])
>>> aggregated.sort_index()  
     B           C
   min  max    min    max
A
1    1    2  0.227  0.362
2    3    4 -0.562  1.267

To control the output names with different aggregations per column, Koalas also supports ‘named aggregation’ or nested renaming in .agg. It can also be used when applying multiple aggregation functions to specific columns.

>>> aggregated = df.groupby('A').agg(b_max=ks.NamedAgg(column='B', aggfunc='max'))
>>> aggregated.sort_index()  
     b_max
A
1        2
2        4
>>> aggregated = df.groupby('A').agg(b_max=('B', 'max'), b_min=('B', 'min'))
>>> aggregated.sort_index()  
     b_max   b_min
A
1        2       1
2        4       3
>>> aggregated = df.groupby('A').agg(b_max=('B', 'max'), c_min=('C', 'min'))
>>> aggregated.sort_index()  
     b_max   c_min
A
1        2   0.227
2        4  -0.562