Type Hints In Koalas¶

Koalas, by default, infers the schema by taking some top records from the output, in particular, when you use APIs that allow users to apply a function against Koalas DataFrame such as DataFrame.transform(), DataFrame.apply(), DataFrame.koalas.apply_batch(), DataFrame.koalas.apply_batch(), Series.koalas.apply_batch(), etc.

However, this is potentially expensive. If there are several expensive operations such as a shuffle in the upstream of the execution plan, Koalas will end up with executing the Spark job twice, once for schema inference, and once for processing actual data with the schema.

To avoid the consequences, Koalas has its own type hinting style to specify the schema to avoid schema inference. Koalas understands the type hints specified in the return type and converts it as a Spark schema for pandas UDFs used internally. The way of type hinting has been evolved over the time.

In this chapter, it covers the recommended way and the supported ways in details.

Note

The variadic generics support is experimental and unstable in Koalas. The way of typing can change between minor releases without a warning. See also PEP 646 for variadic generics in Python.

Koalas DataFrame and Pandas DataFrame¶

In the early Koalas version, it was introduced to specify a type hint in the function in order to use it as a Spark schema. As an example, you can specify the return type hint as below by using Koalas DataFrame.

>>> def pandas_div(pdf) -> ks.DataFrame[float, float]:
...    # pdf is a pandas DataFrame.
...    return pdf[['B', 'C']] / pdf[['B', 'C']]
...
>>> df = ks.DataFrame({'A': ['a', 'a', 'b'], 'B': [1, 2, 3], 'C': [4, 6, 5]})
>>> df.groupby('A').apply(pandas_div)

The function pandas_div actually takes and outputs a pandas DataFrame instead of Koalas DataFrame. However, Koalas has to force to set the mismatched type hints.

From Koalas 1.0 with Python 3.7+, now you can specify the type hints by using pandas instances.

>>> def pandas_div(pdf) -> pd.DataFrame[float, float]:
...    # pdf is a pandas DataFrame.
...    return pdf[['B', 'C']] / pdf[['B', 'C']]
...
>>> df = ks.DataFrame({'A': ['a', 'a', 'b'], 'B': [1, 2, 3], 'C': [4, 6, 5]})
>>> df.groupby('A').apply(pandas_div)

Likewise, pandas Series can be also used as a type hints:

>>> def sqrt(x) -> pd.Series[float]:
...     return np.sqrt(x)
...
>>> df = ks.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
>>> df.apply(sqrt, axis=0)

Currently, both Koalas and pandas instances can be used to specify the type hints; however, Koalas plans to move gradually towards using pandas instances only as the stability becomes proven.

Type Hinting with Names¶

In Koalas 1.0, the new style of type hinting was introduced to overcome the limitations in the existing type hinting especially for DataFrame. When you use a DataFrame as the return type hint, for example, DataFrame[int, int], there is no way to specify the names of each Series. In the old way, Koalas just generates the column names as c# and this easily leads users to lose or forgot the Series mappings. See the example below:

>>> def transform(pdf) -> pd.DataFrame[int, int]:
...     pdf['A'] = pdf.id + 1
...     return pdf
...
>>> ks.range(5).koalas.apply_batch(transform)

The new style of type hinting in Koalas is similar with the regular Python type hints in variables. The Series name is specified as a string, and the type is specified after a colon. The following example shows a simple case with the Series names, id and A, and int types respectively.

>>> def transform(pdf) -> pd.DataFrame["id": int, "A": int]:
...     pdf['A'] = pdf.id + 1
...     return pdf
...
>>> ks.range(5).koalas.apply_batch(transform)

In addition, Koalas also dynamically supports dtype instance and the column index in pandas so that users can programmatically generate the return type and schema.

>>> def transform(pdf) -> pd.DataFrame[zip(pdf.columns, pdf.dtypes)]:
...    return pdf + 1
...
>>> kdf.koalas.apply_batch(transform)

Likewise, dtype instances from pandas DataFrame can be used alone and let Koalas generate column names.

>>> def transform(pdf) -> pd.DataFrame[pdf.dtypes]:
...     return pdf + 1
...
>>> kdf.koalas.apply_batch(transform)

Type Support In Koalas

From/to other DBMSes