Now we added support for non-named Series (#1712). Previously Koalas automatically named a Series "0" if no name is specified or None is set to the name, whereas pandas allows a Series without the name.
"0"
None
For example:
>>> ks.__version__ '1.1.0' >>> kser = ks.Series([1, 2, 3]) >>> kser 0 1 1 2 2 3 Name: 0, dtype: int64 >>> kser.name = None >>> kser 0 1 1 2 2 3 Name: 0, dtype: int64
Now the Series will be non-named.
>>> ks.__version__ '1.2.0' >>> ks.Series([1, 2, 3]) 0 1 1 2 2 3 dtype: int64 >>> kser = ks.Series([1, 2, 3], name="a") >>> kser.name = None >>> kser 0 1 1 2 2 3 dtype: int64
Previously “distributed-sequence” default index had sometimes produced wrong values or even raised an exception. For example, the codes below:
>>> from databricks import koalas as ks >>> ks.options.compute.default_index_type = 'distributed-sequence' >>> ks.range(10).reset_index()
did not work as below:
Traceback (most recent call last): File "<stdin>", line 1, in <module> ... pyspark.sql.utils.PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): ... File "/.../koalas/databricks/koalas/internal.py", line 620, in offset current_partition_offset = sums[id.iloc[0]] KeyError: 103
We investigated and made the default index type more stable (#1701). Now it unlikely causes such situations and it is stable enough.
We changed the testing infrastructure to use pandas’ testing utils for exact check (#1722). Now it compares even index/column types and names so that we will be able to follow pandas more strictly.
We added the following new features:
DataFrame:
last_valid_index (#1705)
last_valid_index
Series:
product (#1677)
product
GroupBy:
cumcount (#1702)
cumcount
Refine Spark I/O. (#1667)
Set partitionBy explicitly in to_parquet.
partitionBy
to_parquet
Add mode and partition_cols to to_csv and to_json.
mode
partition_cols
to_csv
to_json
Fix type hints to use Optional.
Optional
Make read_excel read from DFS if the underlying Spark is 3.0.0 or above. (#1678, #1693, #1694, #1692)
Support callable instances to apply as a function, and fix groupby.apply to keep the index when possible (#1686)
Bug fixing for hasnans when non-DoubleType. (#1681)
Support axis=1 for DataFrame.dropna(). (#1689)
Allow assining index as a column (#1696)
Try to read pandas metadata in read_parquet if index_col is None. (#1695)
Include pandas Index object in dataframe indexing options (#1698)
Unified PlotAccessor for DataFrame and Series (#1662)
PlotAccessor
Fix SeriesGroupBy.nsmallest/nlargest. (#1713)
Fix DataFrame.size to consider its number of columns. (#1715)
Fix first_valid_index() for Empty object (#1704)
Fix index name when groupby.apply returns a single row. (#1719)
Support subtraction of date/timestamp with literals. (#1721)
DataFrame.reindex(fill_value) does not fill existing NaN values (#1723)