apply
transform
We added supports to have positional/keyword arguments for apply, apply_batch, transform, and transform_batch in DataFrame, Series, and GroupBy. (#1484, #1485, #1486)
apply_batch
transform_batch
DataFrame
Series
GroupBy
>>> ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3) id 0 4 1 5 2 6 3 7 4 8 5 9 6 10 7 11 8 12 9 13
>>> ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3) 0 6 1 7 2 8 3 9 4 10 5 11 6 12 7 13 8 14 9 15 Name: id, dtype: int64
>>> kdf = ks.DataFrame( ... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]}, ... columns=["a", "b", "c"]) >>> kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2) a b c 0 5 5 5 1 7 5 11 2 9 7 21 3 11 9 35 4 13 13 53 5 15 19 75
We add spark_schema and print_schema to know the underlying Spark Schema. (#1446)
spark_schema
print_schema
>>> kdf = ks.DataFrame({'a': list('abc'), ... 'b': list(range(1, 4)), ... 'c': np.arange(3, 6).astype('i1'), ... 'd': np.arange(4.0, 7.0, dtype='float64'), ... 'e': [True, False, True], ... 'f': pd.date_range('20130101', periods=3)}, ... columns=['a', 'b', 'c', 'd', 'e', 'f']) >>> # Print the schema out in Spark’s DDL formatted string >>> kdf.spark_schema().simpleString() 'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>' >>> kdf.spark_schema(index_col='index').simpleString() 'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>' >>> # Print out the schema as same as DataFrame.printSchema() >>> kdf.print_schema() root |-- a: string (nullable = false) |-- b: long (nullable = false) |-- c: byte (nullable = false) |-- d: double (nullable = false) |-- e: boolean (nullable = false) |-- f: timestamp (nullable = false) >>> kdf.print_schema(index_col='index') root |-- index: long (nullable = false) |-- a: string (nullable = false) |-- b: long (nullable = false) |-- c: byte (nullable = false) |-- d: double (nullable = false) |-- e: boolean (nullable = false) |-- f: timestamp (nullable = false)
We fixed many bugs of GroupBy as listed below.
Fix groupby when as_index=False. (#1457)
Make groupby.apply in pandas<0.25 run the function only once per group. (#1462)
Fix Series.groupby on the Series from different DataFrames. (#1460)
Fix GroupBy.head to recognize agg_columns. (#1474)
Fix GroupBy.filter to follow complex group keys. (#1471)
Fix GroupBy.transform to follow complex group keys. (#1472)
Fix GroupBy.apply to follow complex group keys. (#1473)
Fix GroupBy.fillna to use GroupBy._apply_series_op. (#1481)
Fix GroupBy.filter and apply to handle agg_columns. (#1480)
Fix GroupBy apply, filter, and head to ignore temp columns when ops from different DataFrames. (#1488)
Fix GroupBy functions which need natural orderings to follow the order when opts from different DataFrames. (#1490)
We added the following new feature:
SeriesGroupBy:
filter (#1483)
filter
dtype for DateType should be np.dtype(“object”). (#1447)
Make reset_index disallow the same name but allow it when drop=True. (#1455)
Fix named aggregation for MultiIndex (#1435)
Raise ValueError that is not raised now (#1461)
Fix get dummies when uses the prefix parameter whose type is dict (#1478)
Simplify DataFrame.columns setter. (#1489)