Version 0.33.0¶

`apply` and `transform` Improvements¶

We added supports to have positional/keyword arguments for apply, apply_batch, transform, and transform_batch in DataFrame, Series, and GroupBy. (#1484, #1485, #1486)

>>> ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3)
   id
 4
 5
 6
 7
 8
 9
10
11
12
13

>>> ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3)
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
Name: id, dtype: int64

>>> kdf = ks.DataFrame(
...    {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
...    columns=["a", "b", "c"])
>>> kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2)
    a   b   c
0   5   5   5
1   7   5  11
2   9   7  21
3  11   9  35
4  13  13  53
5  15  19  75

Spark Schema¶

We add spark_schema and print_schema to know the underlying Spark Schema. (#1446)

>>> kdf = ks.DataFrame({'a': list('abc'),
...                     'b': list(range(1, 4)),
...                     'c': np.arange(3, 6).astype('i1'),
...                     'd': np.arange(4.0, 7.0, dtype='float64'),
...                     'e': [True, False, True],
...                     'f': pd.date_range('20130101', periods=3)},
...                    columns=['a', 'b', 'c', 'd', 'e', 'f'])

>>> # Print the schema out in Spark’s DDL formatted string
>>> kdf.spark_schema().simpleString()
'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> kdf.spark_schema(index_col='index').simpleString()
'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'

>>> # Print out the schema as same as DataFrame.printSchema()
>>> kdf.print_schema()
root
 |-- a: string (nullable = false)
 |-- b: long (nullable = false)
 |-- c: byte (nullable = false)
 |-- d: double (nullable = false)
 |-- e: boolean (nullable = false)
 |-- f: timestamp (nullable = false)

>>> kdf.print_schema(index_col='index')
root
 |-- index: long (nullable = false)
 |-- a: string (nullable = false)
 |-- b: long (nullable = false)
 |-- c: byte (nullable = false)
 |-- d: double (nullable = false)
 |-- e: boolean (nullable = false)
 |-- f: timestamp (nullable = false)

GroupBy Improvements¶

We fixed many bugs of GroupBy as listed below.

Fix groupby when as_index=False. (#1457)
Make groupby.apply in pandas<0.25 run the function only once per group. (#1462)
Fix Series.groupby on the Series from different DataFrames. (#1460)
Fix GroupBy.head to recognize agg_columns. (#1474)
Fix GroupBy.filter to follow complex group keys. (#1471)
Fix GroupBy.transform to follow complex group keys. (#1472)
Fix GroupBy.apply to follow complex group keys. (#1473)
Fix GroupBy.fillna to use GroupBy._apply_series_op. (#1481)
Fix GroupBy.filter and apply to handle agg_columns. (#1480)
Fix GroupBy apply, filter, and head to ignore temp columns when ops from different DataFrames. (#1488)
Fix GroupBy functions which need natural orderings to follow the order when opts from different DataFrames. (#1490)

Other new features and improvements¶

We added the following new feature:

SeriesGroupBy:

filter (#1483)

Other improvements¶

dtype for DateType should be np.dtype(“object”). (#1447)
Make reset_index disallow the same name but allow it when drop=True. (#1455)
Fix named aggregation for MultiIndex (#1435)
Raise ValueError that is not raised now (#1461)
Fix get dummies when uses the prefix parameter whose type is dict (#1478)
Simplify DataFrame.columns setter. (#1489)

Version 1.0.0

Version 0.32.0