This post is intended to familiarize newbies with SQLGlot's abstract syntax trees, how to traverse them, and how to mutate them.
## The tree
SQLGlot parses SQL into an abstract syntax tree (AST).
```python
from sqlglot import parse_one
ast = parse_one("SELECT a FROM (SELECT a FROM x) AS x")
```
An AST is a data structure that represents a SQL statement. The best way to glean the structure of a particular AST is python's builtin `repr` function:
assert ast.args["expressions"][0].args["this"].parent.parent is ast
```
Children can either be:
1. An Expression instance
2. A list of Expression instances
3. Another Python object, such as str or bool. This will always be a leaf node in the tree.
Navigating this tree requires an understanding of the different Expression types. The best way to browse Expression types is directly in the code at [expressions.py](../sqlglot/expressions.py). Let's look at a simplified version of one Expression type:
`arg_types` is a class attribute that specifies the possible children. The `args` keys of an Expression instance correspond to the `arg_types` keys of its class. The values of the `arg_types` dict are `True` if the key is required.
There are some common `arg_types` keys:
- "this": This is typically used for the primary child. In `Column`, "this" is the identifier for the column's name.
- "expression": This is typically used for the secondary child
- "expressions": This is typically used for a primary list of children
There aren't strict rules for when these keys are used, but they help with some of the convenience methods available on all Expression types:
-`Expression.this`: shorthand for `self.args.get("this")`
-`Expression.expression`: similarly, shorthand for the expression arg
-`Expression.expressions`: similarly, shorthand for the expressions list arg
-`Expression.name`: text name for whatever `this` is
`arg_types` don't specify the possible Expression types of children. This can be a challenge when you are writing code to traverse a particular AST and you don't know what to expect. A common trick is to parse an example query and print out the `repr`.
You can traverse an AST using just args, but there are some higher-order functions for programmatic traversal.
> SQLGlot can parse and generate SQL for many different dialects. However, there is only a single set of Expression types for all dialects. We like to say that the AST can represent the _superset_ of all dialects.
> This is because SQLGlot tries to converge dialects on a standard AST. This means you can often write one piece of code that handles multiple dialects.
## Traversing the AST
Analyzing a SQL statement requires traversing this data structure. There are a few ways to do this:
### Args
If you know the structure of an AST, you can use `Expression.args` just like above. However, this can be very limited if you're dealing with arbitrary SQL.
> At first glance, this seems like a great way to find all tables in a query. However, `Table` instances are not always tables in your database. Here's an example where this fails:
`build_scope` returns an instance of the `Scope` class. `Scope` has numerous methods for inspecting a query. The best way to browse these methods is directly in the code at [scope.py](../sqlglot/optimizer/scope.py). You can also look for examples of how Scope is used throughout SQLGlot's [optimizer](../sqlglot/optimizer) module.
> Column `a` might come from table `x` or `y`. In these cases, you must pass the `schema` into `qualify`.
## Mutating the tree
You can also modify an AST or build one from scratch. There are a few ways to do this.
### High-level builder methods
SQLGlot has methods for programmatically building up expressions similar to how you might in an ORM:
```python
ast = (
exp
.select("a", "b")
.from_("x")
.where("b <4")
.limit(10)
)
```
> [!WARNING]
> High-level builder methods will attempt to parse string arguments into Expressions. This can be very convenient, but make sure to keep in mind the dialect of the string. If its written in a specific dialect, you need to set the `dialect` argument.
The best place to browse all the available high-level builder methods and their parameters is, as always, directly in the code at [expressions.py](../sqlglot/expressions.py).
High-level builder methods don't account for all possible expressions you might want to build. In the case where a particular high-level method is missing, use the low-level methods. Here are some examples:
> In general, you should use `Expression.set` and `Expression.append` instead of mutating `Expression.args` directly. `set` and `append` take care to update node references like `parent`.
You can also instantiate AST nodes directly:
```python
col = exp.Column(
this=exp.to_identifier("c")
)
node.append("expressions", col)
```
> [!WARNING]
> Because SQLGlot doesn't verify the types of args, it's easy to instantiate an invalid AST Node that won't generate to SQL properly. Take extra care to inspect the expected types of a node using the methods described above.
> As with the walk methods, `transform` doesn't manage scope. For safely transforming the columns and tables in complex expressions, you should probably use Scope.
2.**low-level builder methods** - use this only when high-level builder methods don't exist for what you're trying to build.
3.**transform** - use this for simple transformations on arbitrary statements.
And, of course, these mechanisms can be mixed and matched. For example, maybe you need to use scope to traverse an arbitrary AST and the high-level builder methods to mutate it in-place.
Still need help? [Get in touch!](../README.md#get-in-touch)