Version 0.33.0¶
apply
and transform
Improvements¶
We added supports to have positional/keyword arguments for apply
,
apply_batch
, transform
, and transform_batch
in
DataFrame
, Series
, and GroupBy
. (#1484, #1485, #1486)
>>> ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3)
id
0 4
1 5
2 6
3 7
4 8
5 9
6 10
7 11
8 12
9 13
>>> ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3)
0 6
1 7
2 8
3 9
4 10
5 11
6 12
7 13
8 14
9 15
Name: id, dtype: int64
>>> kdf = ks.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"])
>>> kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2)
a b c
0 5 5 5
1 7 5 11
2 9 7 21
3 11 9 35
4 13 13 53
5 15 19 75
Spark Schema¶
We add spark_schema
and print_schema
to know the underlying
Spark Schema. (#1446)
>>> kdf = ks.DataFrame({'a': list('abc'),
... 'b': list(range(1, 4)),
... 'c': np.arange(3, 6).astype('i1'),
... 'd': np.arange(4.0, 7.0, dtype='float64'),
... 'e': [True, False, True],
... 'f': pd.date_range('20130101', periods=3)},
... columns=['a', 'b', 'c', 'd', 'e', 'f'])
>>> # Print the schema out in Spark’s DDL formatted string
>>> kdf.spark_schema().simpleString()
'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> kdf.spark_schema(index_col='index').simpleString()
'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> # Print out the schema as same as DataFrame.printSchema()
>>> kdf.print_schema()
root
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)
>>> kdf.print_schema(index_col='index')
root
|-- index: long (nullable = false)
|-- a: string (nullable = false)
|-- b: long (nullable = false)
|-- c: byte (nullable = false)
|-- d: double (nullable = false)
|-- e: boolean (nullable = false)
|-- f: timestamp (nullable = false)
GroupBy Improvements¶
We fixed many bugs of GroupBy
as listed below.
Fix groupby when as_index=False. (#1457)
Make groupby.apply in pandas<0.25 run the function only once per group. (#1462)
Fix Series.groupby on the Series from different DataFrames. (#1460)
Fix GroupBy.head to recognize agg_columns. (#1474)
Fix GroupBy.filter to follow complex group keys. (#1471)
Fix GroupBy.transform to follow complex group keys. (#1472)
Fix GroupBy.apply to follow complex group keys. (#1473)
Fix GroupBy.fillna to use GroupBy._apply_series_op. (#1481)
Fix GroupBy.filter and apply to handle agg_columns. (#1480)
Fix GroupBy apply, filter, and head to ignore temp columns when ops from different DataFrames. (#1488)
Fix GroupBy functions which need natural orderings to follow the order when opts from different DataFrames. (#1490)
Other new features and improvements¶
We added the following new feature:
SeriesGroupBy:
filter
(#1483)
Other improvements¶
dtype for DateType should be np.dtype(“object”). (#1447)
Make reset_index disallow the same name but allow it when drop=True. (#1455)
Fix named aggregation for MultiIndex (#1435)
Raise ValueError that is not raised now (#1461)
Fix get dummies when uses the prefix parameter whose type is dict (#1478)
Simplify DataFrame.columns setter. (#1489)