2025-02-13 06:15:54 +01:00
# SQLGlot
2025-02-13 14:42:49 +01:00
SQLGlot is a no dependency Python SQL parser, transpiler, and optimizer. It can be used to format SQL or translate between different dialects like [DuckDB ](https://duckdb.org/ ), [Presto ](https://prestodb.io/ ), [Spark ](https://spark.apache.org/ ), [Snowflake ](https://www.snowflake.com/en/ ), and [BigQuery ](https://cloud.google.com/bigquery/ ). It aims to read a wide variety of SQL inputs and output syntactically correct SQL in the targeted dialects.
2025-02-13 06:15:54 +01:00
It is a very comprehensive generic SQL parser with a robust [test suite ](tests ). It is also quite [performant ](#benchmarks ) while being written purely in Python.
You can easily [customize ](#custom-dialects ) the parser, [analyze ](#metadata ) queries, traverse expression trees, and programmatically [build ](#build-and-modify-sql ) SQL.
Syntax [errors ](#parser-errors ) are highlighted and dialect incompatibilities can warn or raise depending on configurations.
2025-02-13 14:51:09 +01:00
Contributions are very welcome in SQLGlot; read the [contribution guide ](https://github.com/tobymao/sqlglot/blob/main/CONTRIBUTING.md ) to get started!
## Table of Contents
* [Install ](#install )
* [Documentation ](#documentation )
2025-02-13 14:52:26 +01:00
* [Run Tests and Lint ](#run-tests-and-lint )
2025-02-13 14:51:09 +01:00
* [Examples ](#examples )
* [Formatting and Transpiling ](#formatting-and-transpiling )
* [Metadata ](#metadata )
* [Parser Errors ](#parser-errors )
* [Unsupported Errors ](#unsupported-errors )
* [Build and Modify SQL ](#build-and-modify-sql )
* [SQL Optimizer ](#sql-optimizer )
* [AST Introspection ](#ast-introspection )
* [AST Diff ](#ast-diff )
* [Custom Dialects ](#custom-dialects )
* [Benchmarks ](#benchmarks )
* [Optional Dependencies ](#optional-dependencies )
2025-02-13 06:15:54 +01:00
## Install
2025-02-13 14:51:09 +01:00
From PyPI:
2025-02-13 06:15:54 +01:00
```
pip3 install sqlglot
```
2025-02-13 14:51:09 +01:00
Or with a local checkout:
2025-02-13 06:15:54 +01:00
```
pip3 install -e .
```
2025-02-13 14:51:09 +01:00
Requirements for development (optional):
```
pip3 install -r dev-requirements.txt
```
## Documentation
2025-02-13 14:52:26 +01:00
SQLGlot uses [pdocs ](https://pdoc.dev/ ) to serve its API documentation:
2025-02-13 14:51:09 +01:00
```
pdoc sqlglot --docformat google
```
## Run Tests and Lint
```
# set `SKIP_INTEGRATION=1` to skip integration tests
./run_checks.sh
```
2025-02-13 06:15:54 +01:00
## Examples
2025-02-13 14:51:09 +01:00
### Formatting and Transpiling
Easily translate from one dialect to another. For example, date/time functions vary from dialects and can be hard to deal with:
2025-02-13 06:15:54 +01:00
```python
import sqlglot
2025-02-13 14:51:09 +01:00
sqlglot.transpile("SELECT EPOCH_MS(1618088028295)", read="duckdb", write="hive")[0]
2025-02-13 06:15:54 +01:00
```
```sql
2025-02-13 14:51:09 +01:00
'SELECT FROM_UNIXTIME(1618088028295 / 1000)'
2025-02-13 06:15:54 +01:00
```
2025-02-13 14:51:09 +01:00
SQLGlot can even translate custom time formats:
2025-02-13 06:15:54 +01:00
```python
import sqlglot
2025-02-13 14:51:09 +01:00
sqlglot.transpile("SELECT STRFTIME(x, '%y-%-m-%S')", read="duckdb", write="hive")[0]
2025-02-13 06:15:54 +01:00
```
```sql
2025-02-13 14:51:09 +01:00
"SELECT DATE_FORMAT(x, 'yy-M-ss')"
2025-02-13 06:15:54 +01:00
```
2025-02-13 14:53:43 +01:00
As another example, let's suppose that we want to read in a SQL query that contains a CTE and a cast to `REAL` , and then transpile it to Spark, which uses backticks for identifiers and `FLOAT` instead of `REAL` :
2025-02-13 06:15:54 +01:00
```python
import sqlglot
sql = """WITH baz AS (SELECT a, c FROM foo WHERE a = 1) SELECT f.a, b.b, baz.c, CAST("b"."a" AS REAL) d FROM foo f JOIN bar b ON f.a = b.a LEFT JOIN baz ON f.a = baz.a"""
2025-02-13 14:51:09 +01:00
print(sqlglot.transpile(sql, write="spark", identify=True, pretty=True)[0])
2025-02-13 06:15:54 +01:00
```
```sql
WITH `baz` AS (
SELECT
`a` ,
`c`
FROM `foo`
WHERE
`a` = 1
)
SELECT
`f` .`a` ,
`b` .`b` ,
`baz` .`c` ,
CAST(`b` .`a` AS FLOAT) AS `d`
FROM `foo` AS `f`
JOIN `bar` AS `b`
ON `f` .`a` = `b` .`a`
LEFT JOIN `baz`
ON `f` .`a` = `baz` .`a`
```
2025-02-13 14:52:26 +01:00
Comments are also preserved in a best-effort basis when transpiling SQL code:
```python
sql = """
/* multi
line
comment
*/
SELECT
tbl.cola /* comment 1 */ + tbl.colb /* comment 2 */,
CAST(x AS INT), # comment 3
y -- comment 4
FROM
bar /* comment 5 */,
tbl # comment 6
"""
print(sqlglot.transpile(sql, read='mysql', pretty=True)[0])
```
```sql
/* multi
line
comment
*/
SELECT
tbl.cola /* comment 1 */ + tbl.colb /* comment 2 */,
CAST(x AS INT), -- comment 3
y -- comment 4
FROM bar /* comment 5 */, tbl /* comment 6*/
```
2025-02-13 14:51:09 +01:00
### Metadata
2025-02-13 06:15:54 +01:00
2025-02-13 14:51:09 +01:00
You can explore SQL with expression helpers to do things like find columns and tables:
2025-02-13 06:15:54 +01:00
```python
from sqlglot import parse_one, exp
# print all column references (a and b)
for column in parse_one("SELECT a, b + 1 AS c FROM d").find_all(exp.Column):
print(column.alias_or_name)
# find all projections in select statements (a and c)
for select in parse_one("SELECT a, b + 1 AS c FROM d").find_all(exp.Select):
for projection in select.expressions:
print(projection.alias_or_name)
# find all tables (x, y, z)
for table in parse_one("SELECT * FROM x JOIN y JOIN z").find_all(exp.Table):
print(table.name)
```
2025-02-13 14:51:09 +01:00
### Parser Errors
A syntax error will result in a parser error:
2025-02-13 06:15:54 +01:00
```python
2025-02-13 14:51:09 +01:00
import sqlglot
sqlglot.transpile("SELECT foo( FROM bar")
2025-02-13 06:15:54 +01:00
```
2025-02-13 14:51:09 +01:00
```
2025-02-13 06:15:54 +01:00
sqlglot.errors.ParseError: Expecting ). Line 1, Col: 13.
2025-02-13 14:51:09 +01:00
select foo( FROM bar
~~~~
```
2025-02-13 06:15:54 +01:00
2025-02-13 14:51:09 +01:00
### Unsupported Errors
Presto `APPROX_DISTINCT` supports the accuracy argument which is not supported in Hive:
2025-02-13 06:15:54 +01:00
```python
2025-02-13 14:51:09 +01:00
import sqlglot
sqlglot.transpile("SELECT APPROX_DISTINCT(a, 0.1) FROM foo", read="presto", write="hive")
2025-02-13 06:15:54 +01:00
```
```sql
2025-02-13 14:51:09 +01:00
APPROX_COUNT_DISTINCT does not support accuracy
'SELECT APPROX_COUNT_DISTINCT(a) FROM foo'
2025-02-13 06:15:54 +01:00
```
2025-02-13 14:51:09 +01:00
### Build and Modify SQL
SQLGlot supports incrementally building sql expressions:
2025-02-13 06:15:54 +01:00
```python
from sqlglot import select, condition
where = condition("x=1").and_("y=1")
select("*").from_("y").where(where).sql()
```
2025-02-13 14:51:09 +01:00
2025-02-13 06:15:54 +01:00
```sql
2025-02-13 14:51:09 +01:00
'SELECT * FROM y WHERE x = 1 AND y = 1'
2025-02-13 06:15:54 +01:00
```
You can also modify a parsed tree:
```python
from sqlglot import parse_one
parse_one("SELECT x FROM y").from_("z").sql()
```
2025-02-13 14:51:09 +01:00
2025-02-13 06:15:54 +01:00
```sql
2025-02-13 14:51:09 +01:00
'SELECT x FROM y, z'
2025-02-13 06:15:54 +01:00
```
There is also a way to recursively transform the parsed tree by applying a mapping function to each tree node:
```python
from sqlglot import exp, parse_one
expression_tree = parse_one("SELECT a FROM x")
def transformer(node):
if isinstance(node, exp.Column) and node.name == "a":
return parse_one("FUN(a)")
return node
transformed_tree = expression_tree.transform(transformer)
transformed_tree.sql()
```
2025-02-13 14:51:09 +01:00
2025-02-13 06:15:54 +01:00
```sql
2025-02-13 14:51:09 +01:00
'SELECT FUN(a) FROM x'
2025-02-13 06:15:54 +01:00
```
2025-02-13 14:51:09 +01:00
### SQL Optimizer
2025-02-13 06:15:54 +01:00
2025-02-13 14:51:09 +01:00
SQLGlot can rewrite queries into an "optimized" form. It performs a variety of [techniques ](sqlglot/optimizer/optimizer.py ) to create a new canonical AST. This AST can be used to standardize queries or provide the foundations for implementing an actual engine. For example:
2025-02-13 06:15:54 +01:00
```python
import sqlglot
from sqlglot.optimizer import optimize
2025-02-13 14:51:09 +01:00
print(
optimize(
sqlglot.parse_one("""
SELECT A OR (B OR (C AND D))
FROM x
WHERE Z = date '2021-01-01' + INTERVAL '1' month OR 1 = 0
"""),
schema={"x": {"A": "INT", "B": "INT", "C": "INT", "D": "INT", "Z": "STRING"}}
).sql(pretty=True)
)
```
```
2025-02-13 06:15:54 +01:00
SELECT
(
2025-02-13 14:51:09 +01:00
"x"."A" OR "x"."B" OR "x"."C"
) AND (
"x"."A" OR "x"."B" OR "x"."D"
2025-02-13 06:15:54 +01:00
) AS "_col_0"
FROM "x" AS "x"
WHERE
"x"."Z" = CAST('2021-02-01' AS DATE)
```
2025-02-13 14:51:09 +01:00
### AST Introspection
2025-02-13 06:15:54 +01:00
2025-02-13 14:51:09 +01:00
You can see the AST version of the sql by calling `repr` :
2025-02-13 06:15:54 +01:00
```python
from sqlglot import parse_one
2025-02-13 14:51:09 +01:00
print(repr(parse_one("SELECT a + 1 AS z")))
```
2025-02-13 06:15:54 +01:00
2025-02-13 14:51:09 +01:00
```python
2025-02-13 06:15:54 +01:00
(SELECT expressions:
(ALIAS this:
(ADD this:
(COLUMN this:
(IDENTIFIER this: a, quoted: False)), expression:
(LITERAL this: 1, is_string: False)), alias:
(IDENTIFIER this: z, quoted: False)))
```
2025-02-13 14:51:09 +01:00
### AST Diff
2025-02-13 06:15:54 +01:00
2025-02-13 14:51:09 +01:00
SQLGlot can calculate the difference between two expressions and output changes in a form of a sequence of actions needed to transform a source expression into a target one:
2025-02-13 06:15:54 +01:00
```python
from sqlglot import diff, parse_one
diff(parse_one("SELECT a + b, c, d"), parse_one("SELECT c, a - b, d"))
2025-02-13 14:51:09 +01:00
```
2025-02-13 06:15:54 +01:00
2025-02-13 14:51:09 +01:00
```python
2025-02-13 06:15:54 +01:00
[
Remove(expression=(ADD this:
(COLUMN this:
(IDENTIFIER this: a, quoted: False)), expression:
(COLUMN this:
(IDENTIFIER this: b, quoted: False)))),
Insert(expression=(SUB this:
(COLUMN this:
(IDENTIFIER this: a, quoted: False)), expression:
(COLUMN this:
(IDENTIFIER this: b, quoted: False)))),
Move(expression=(COLUMN this:
(IDENTIFIER this: c, quoted: False))),
Keep(source=(IDENTIFIER this: b, quoted: False), target=(IDENTIFIER this: b, quoted: False)),
...
]
```
2025-02-13 14:51:09 +01:00
### Custom Dialects
2025-02-13 06:15:54 +01:00
2025-02-13 14:51:09 +01:00
[Dialects ](sqlglot/dialects ) can be added by subclassing `Dialect` :
2025-02-13 06:15:54 +01:00
```python
from sqlglot import exp
from sqlglot.dialects.dialect import Dialect
from sqlglot.generator import Generator
from sqlglot.tokens import Tokenizer, TokenType
class Custom(Dialect):
class Tokenizer(Tokenizer):
QUOTES = ["'", '"']
IDENTIFIERS = ["`"]
KEYWORDS = {
**Tokenizer.KEYWORDS,
"INT64": TokenType.BIGINT,
"FLOAT64": TokenType.DOUBLE,
}
class Generator(Generator):
TRANSFORMS = {exp.Array: lambda self, e: f"[{self.expressions(e)}]"}
TYPE_MAPPING = {
exp.DataType.Type.TINYINT: "INT64",
exp.DataType.Type.SMALLINT: "INT64",
exp.DataType.Type.INT: "INT64",
exp.DataType.Type.BIGINT: "INT64",
exp.DataType.Type.DECIMAL: "NUMERIC",
exp.DataType.Type.FLOAT: "FLOAT64",
exp.DataType.Type.DOUBLE: "FLOAT64",
exp.DataType.Type.BOOLEAN: "BOOL",
exp.DataType.Type.TEXT: "STRING",
}
2025-02-13 14:51:09 +01:00
print(Dialect["custom"])
```
2025-02-13 06:15:54 +01:00
2025-02-13 14:51:09 +01:00
```python
< class ' __main__ . Custom ' >
2025-02-13 06:15:54 +01:00
```
## Benchmarks
[Benchmarks ](benchmarks ) run on Python 3.10.5 in seconds.
2025-02-13 14:53:43 +01:00
| Query | sqlglot | sqlfluff | sqltree | sqlparse | moz_sql_parser | sqloxide |
| --------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- |
| tpch | 0.01308 (1.0) | 1.60626 (122.7) | 0.01168 (0.893) | 0.04958 (3.791) | 0.08543 (6.531) | 0.00136 (0.104) |
| short | 0.00109 (1.0) | 0.14134 (129.2) | 0.00099 (0.906) | 0.00342 (3.131) | 0.00652 (5.970) | 8.76621 (0.080) |
| long | 0.01399 (1.0) | 2.12632 (151.9) | 0.01126 (0.805) | 0.04410 (3.151) | 0.06671 (4.767) | 0.00107 (0.076) |
| crazy | 0.03969 (1.0) | 24.3777 (614.1) | 0.03917 (0.987) | 11.7043 (294.8) | 1.03280 (26.02) | 0.00625 (0.157) |
2025-02-13 06:15:54 +01:00
## Optional Dependencies
2025-02-13 14:51:09 +01:00
SQLGlot uses [dateutil ](https://github.com/dateutil/dateutil ) to simplify literal timedelta expressions. The optimizer will not simplify expressions like the following if the module cannot be found:
2025-02-13 06:15:54 +01:00
```sql
x + interval '1' month
```