Version 0.33.0

apply and transform Improvements

We added supports to have positional/keyword arguments for apply, apply_batch, transform, and transform_batch in DataFrame, Series, and GroupBy. (#1484, #1485, #1486)

>>> ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3)
   id
0   4
1   5
2   6
3   7
4   8
5   9
6  10
7  11
8  12
9  13
>>> ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3)
0     6
1     7
2     8
3     9
4    10
5    11
6    12
7    13
8    14
9    15
Name: id, dtype: int64
>>> kdf = ks.DataFrame(
...    {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
...    columns=["a", "b", "c"])
>>> kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2)
    a   b   c
0   5   5   5
1   7   5  11
2   9   7  21
3  11   9  35
4  13  13  53
5  15  19  75

Spark Schema

We add spark_schema and print_schema to know the underlying Spark Schema. (#1446)

>>> kdf = ks.DataFrame({'a': list('abc'),
...                     'b': list(range(1, 4)),
...                     'c': np.arange(3, 6).astype('i1'),
...                     'd': np.arange(4.0, 7.0, dtype='float64'),
...                     'e': [True, False, True],
...                     'f': pd.date_range('20130101', periods=3)},
...                    columns=['a', 'b', 'c', 'd', 'e', 'f'])

>>> # Print the schema out in Spark’s DDL formatted string
>>> kdf.spark_schema().simpleString()
'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> kdf.spark_schema(index_col='index').simpleString()
'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'

>>> # Print out the schema as same as DataFrame.printSchema()
>>> kdf.print_schema()
root
 |-- a: string (nullable = false)
 |-- b: long (nullable = false)
 |-- c: byte (nullable = false)
 |-- d: double (nullable = false)
 |-- e: boolean (nullable = false)
 |-- f: timestamp (nullable = false)

>>> kdf.print_schema(index_col='index')
root
 |-- index: long (nullable = false)
 |-- a: string (nullable = false)
 |-- b: long (nullable = false)
 |-- c: byte (nullable = false)
 |-- d: double (nullable = false)
 |-- e: boolean (nullable = false)
 |-- f: timestamp (nullable = false)

GroupBy Improvements

We fixed many bugs of GroupBy as listed below.

  • Fix groupby when as_index=False. (#1457)

  • Make groupby.apply in pandas<0.25 run the function only once per group. (#1462)

  • Fix Series.groupby on the Series from different DataFrames. (#1460)

  • Fix GroupBy.head to recognize agg_columns. (#1474)

  • Fix GroupBy.filter to follow complex group keys. (#1471)

  • Fix GroupBy.transform to follow complex group keys. (#1472)

  • Fix GroupBy.apply to follow complex group keys. (#1473)

  • Fix GroupBy.fillna to use GroupBy._apply_series_op. (#1481)

  • Fix GroupBy.filter and apply to handle agg_columns. (#1480)

  • Fix GroupBy apply, filter, and head to ignore temp columns when ops from different DataFrames. (#1488)

  • Fix GroupBy functions which need natural orderings to follow the order when opts from different DataFrames. (#1490)

Other new features and improvements

We added the following new feature:

SeriesGroupBy:

Other improvements

  • dtype for DateType should be np.dtype(“object”). (#1447)

  • Make reset_index disallow the same name but allow it when drop=True. (#1455)

  • Fix named aggregation for MultiIndex (#1435)

  • Raise ValueError that is not raised now (#1461)

  • Fix get dummies when uses the prefix parameter whose type is dict (#1478)

  • Simplify DataFrame.columns setter. (#1489)