databricks.koalas.sql¶

databricks.koalas.sql(query: str, globals=None, locals=None, **kwargs) → databricks.koalas.frame.DataFrame[source]¶

Execute a SQL query and return the result as a Koalas DataFrame.

This function also supports embedding Python variables (locals, globals, and parameters) in the SQL statement by wrapping them in curly braces. See examples section for details.

In addition to the locals, globals and parameters, the function will also attempt to determine if the program currently runs in an IPython (or Jupyter) environment and to import the variables from this environment. The variables have the same precedence as globals.

The following variable types are supported:

string

int

float

list, tuple, range of above types

Koalas DataFrame

Koalas Series

pandas DataFrame

Parameters

querystr: the SQL query
globalsdict, optional: the dictionary of global variables, if explicitly set by the user
localsdict, optional: the dictionary of local variables, if explicitly set by the user
kwargs: other variables that the user may want to set manually that can be referenced in the query

Returns

Koalas DataFrame

Examples

Calling a built-in SQL function.

>>> ks.sql("select * from range(10) where id > 7")
   id
0   8
1   9

A query can also reference a local variable or parameter by wrapping them in curly braces:

>>> bound1 = 7
>>> ks.sql("select * from range(10) where id > {bound1} and id < {bound2}", bound2=9)
   id
0   8

You can also wrap a DataFrame with curly braces to query it directly. Note that when you do that, the indexes, if any, automatically become top level columns.

>>> mydf = ks.range(10)
>>> x = range(4)
>>> ks.sql("SELECT * from {mydf} WHERE id IN {x}").sort_index()
   id
0   0
1   1
2   2
3   3

Queries can also be arbitrarily nested in functions:

>>> def statement():
...     mydf2 = ks.DataFrame({"x": range(2)})
...     return ks.sql("SELECT * from {mydf2}").sort_index()
>>> statement()
   x
0  0
1  1

Mixing Koalas and pandas DataFrames in a join operation. Note that the index is dropped.

>>> ks.sql('''
...   SELECT m1.a, m2.b
...   FROM {table1} m1 INNER JOIN {table2} m2
...   ON m1.key = m2.key
...   ORDER BY m1.a, m2.b''',
...   table1=ks.DataFrame({"a": [1,2], "key": ["a", "b"]}),
...   table2=pd.DataFrame({"b": [3,4,5], "key": ["a", "b", "b"]})).sort_index()
   a  b
0  1  3
1  2  4
2  2  5

Also, it is possible to query using Series.

>>> myser = ks.Series({'a': [1.0, 2.0, 3.0], 'b': [15.0, 30.0, 45.0]})
>>> ks.sql("SELECT * from {myser}").sort_index()
                    0
0     [1.0, 2.0, 3.0]
1  [15.0, 30.0, 45.0]

databricks.koalas.concat databricks.koalas.broadcast