databricks.koalas.DataFrame.to_spark

DataFrame.to_spark(index_col: Union[str, List[str], None] = None)[source]

Return the current DataFrame as a Spark DataFrame.

Parameters
index_col: str or list of str, optional, default: None

Column names to be used in Spark to represent Koalas’ index. The index name in Koalas is ignored. By default, the index is always lost.

Examples

By default, this method loses the index as below.

>>> df = ks.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> df.to_spark().show()  
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  4|  7|
|  2|  5|  8|
|  3|  6|  9|
+---+---+---+

If index_col is set, it keeps the index column as specified.

>>> df.to_spark(index_col="index").show()  
+-----+---+---+---+
|index|  a|  b|  c|
+-----+---+---+---+
|    0|  1|  4|  7|
|    1|  2|  5|  8|
|    2|  3|  6|  9|
+-----+---+---+---+

Keeping index column is useful when you want to call some Spark APIs and convert it back to Koalas DataFrame without creating a default index, which can affect performance.

>>> spark_df = df.to_spark(index_col="index")
>>> spark_df = spark_df.filter("a == 2")
>>> spark_df.to_koalas(index_col="index")  
       a  b  c
index
1      2  5  8

In case of multi-index, specify a list to index_col.

>>> new_df = df.set_index("a", append=True)
>>> new_spark_df = new_df.to_spark(index_col=["index_1", "index_2"])
>>> new_spark_df.show()  
+-------+-------+---+---+
|index_1|index_2|  b|  c|
+-------+-------+---+---+
|      0|      1|  4|  7|
|      1|      2|  5|  8|
|      2|      3|  6|  9|
+-------+-------+---+---+

Likewise, can be converted to back to Koalas DataFrame.

>>> new_spark_df.to_koalas(
...     index_col=["index_1", "index_2"])  
                 b  c
index_1 index_2
0       1        4  7
1       2        5  8
2       3        6  9
Scroll To Top