-
Notifications
You must be signed in to change notification settings - Fork 79
SQL Modules and Variable Expansion
Note: the %sql
magic is being deprecated as of the v1.0 release. Please see the migration guide page here.
The %sql
cell magic serves two purposes:
- executing a SQL query statement immediately (if the
-m/--module
argument is omitted); - defining a SQL module for later execution (if the
-m/--module
argument is used).
Variable substitution is done immediately in the first case using the IPython notebook execution environment.
In the second case a Python module is created and immediately imported, using the name supplied in the -m/--module
argument. This module will have several variables defined in its namespace:
-
_sql_module_arg_parser
is a Pythonargparse
argument parser that is created from any Python variable definitions at the start of the cell; -
_sql_module_main
is the main query statement (datalab.data.SqlStatement
object) to be executed for the cell; i.e. the one with noDEFINE
prefix; -
_sql_module_last
is the last query statement defined in the cell; this is frequently the same as_sql_module_main
- for each SQL query that starts with a
DEFINE <name>
prefix, `<name>' will be bound to that query statement.
The argparser is built by finding the start of the first query statement, and treating all the code preceding it as Python code. This code is executed in new environment that has a few helper functions predefined, and then the resulting environment is introspected to create an argument parser. For example:
x = 3
will result in a -x
argument being defined with a default value of 3.
The special predefined helper functions datestring
and source
provide a way of generating arguments that map to table names or tables that can be bound to variables in the query. datestring
takes a format
argument and an offset
argument (both strings); it gets a datetime, applies the offset
, and uses that to generate a string based on the format
, assigning the result to the variable. When binding to such a variable you would actually supply the initial datetime formatted with "%Y%m%d", or use one of the special strings "today", "yesterday", or "now".
For example:
when = datestring("%Y-%m-%d", "-1d")
would create a --when
argument in the parser that could be applied using:
--when 20160228
to get:
"2016-02-27"
Similarly:
--when=yesterday
would result in a "%Y-%m-%d" formatted date for the day before yesterday.
Offsets can be specified with + or -, followed by a number, and a unit, like 'd' for days, 'h' for hours, etc. The offset can also use a comma-separated list of offsets that will be applied in order.
The logic for handling this is in datalab.data._sql.py
in the function _date
.
The source
function is similar except that after resolving to a string it will then further try to resolve
to a BigQuery table with that name. This makes it useful for writing query statements that act on log files,
for example (e.g. we might use something like:
yesterdays_logs = source("myproject:mydataset.logfile-%Y%m%d", "-1d")
Arguments can be passed to queries as YAML in cells like %bigquery execute
, or as named arguments or a dictionary named values
when instantiating a datalab.bigquery.Query
object. Variable expansion is handled by
datalab.data.SqlModule.expand
(note UDF references are handled separately; see the UDF section in the wiki
more on that). When expanding the variables any values specified explicitly (via YAML etc) are turned into a
pseudo command line that is parsed by the argparser for the SQL module; this will handle any source
or
datestring
arguments as well as supply default values for the arguments not passed explicitly. The resulting
dictionary is then used for variable expansion, including expanding any nested query references.
When using nested queries only the outermost argparser is called which is why an outer query cell needs to redefine variables used in a nested query reference that is defined in a different cell; the outer argparser will not know about those arguments otherwise and will report them as invalid.