databricks.koalas.DataFrame.koalas.attach_id_column¶

koalas.attach_id_column(id_type: str, column: Union[Any, Tuple]) → DataFrame¶

Attach a column to be used as identifier of rows similar to the default index.

See also Default Index type.

Parameters

id_typestring

The id type.

‘sequence’ : a sequence that increases one by one.

Note

this uses Spark’s Window without specifying partition specification. This leads to move all data into single partition in single machine and could cause serious performance degradation. Avoid this method against very large dataset.
‘distributed-sequence’ : a sequence that increases one by one, by group-by and group-map approach in a distributed manner.
‘distributed’ : a monotonically increasing sequence simply by using PySpark’s monotonically_increasing_id function in a fully distributed manner.

columnstring or tuple of string

The column name.

Returns

DataFrame: The DataFrame attached the column.

Examples

>>> df = ks.DataFrame({"x": ['a', 'b', 'c']})
>>> df.koalas.attach_id_column(id_type="sequence", column="id")
   x  id
0  a   0
1  b   1
2  c   2

>>> df.koalas.attach_id_column(id_type="distributed-sequence", column=0)
   x  0
0  a  0
1  b  1
2  c  2

>>> df.koalas.attach_id_column(id_type="distributed", column=0.0)
... 
   x  0.0
0  a  ...
1  b  ...
2  c  ...

For multi-index columns:

>>> df = ks.DataFrame({("x", "y"): ['a', 'b', 'c']})
>>> df.koalas.attach_id_column(id_type="sequence", column=("id-x", "id-y"))
   x id-x
   y id-y
0  a    0
1  b    1
2  c    2

>>> df.koalas.attach_id_column(id_type="distributed-sequence", column=(0, 1.0))
   x   0
   y 1.0
0  a   0
1  b   1
2  c   2

databricks.koalas.DataFrame.kde databricks.koalas.DataFrame.koalas.apply_batch