sqlglot/README.md

# SQLGlot

SQLGlot is a no dependency Python SQL parser, transpiler, and optimizer. It can be used to format SQL or translate between different dialects like [DuckDB](https://duckdb.org/), [Presto](https://prestodb.io/), [Spark](https://spark.apache.org/), [Snowflake](https://www.snowflake.com/en/), and [BigQuery](https://cloud.google.com/bigquery/). It aims to read a wide variety of SQL inputs and output syntactically correct SQL in the targeted dialects.

It is a very comprehensive generic SQL parser with a robust [test suite](tests). It is also quite [performant](#benchmarks) while being written purely in Python.

You can easily [customize](#custom-dialects) the parser, [analyze](#metadata) queries, traverse expression trees, and programmatically [build](#build-and-modify-sql) SQL.

Syntax [errors](#parser-errors) are highlighted and dialect incompatibilities can warn or raise depending on configurations.

Contributions are very welcome in SQLGlot; read the [contribution guide](https://github.com/tobymao/sqlglot/blob/main/CONTRIBUTING.md) to get started!

## Table of Contents

* [Install](#install)
* [Documentation](#documentation)
* [Run Tests and Lint](#run-tests-and-lint)
* [Examples](#examples)
   * [Formatting and Transpiling](#formatting-and-transpiling)
   * [Metadata](#metadata)
   * [Parser Errors](#parser-errors)
   * [Unsupported Errors](#unsupported-errors)
   * [Build and Modify SQL](#build-and-modify-sql)
   * [SQL Optimizer](#sql-optimizer)
   * [AST Introspection](#ast-introspection)
   * [AST Diff](#ast-diff)
   * [Custom Dialects](#custom-dialects)
* [Benchmarks](#benchmarks)
* [Optional Dependencies](#optional-dependencies)

## Install

From PyPI:

```
pip3 install sqlglot
```

Or with a local checkout:

```
pip3 install -e .
```

Requirements for development (optional):

```
pip3 install -r dev-requirements.txt
```

## Documentation

SQLGlot uses [pdocs](https://pdoc.dev/) to serve its API documentation:

```
pdoc sqlglot --docformat google
```

## Run Tests and Lint

```
# set `SKIP_INTEGRATION=1` to skip integration tests
./run_checks.sh
```


## Examples

### Formatting and Transpiling

Easily translate from one dialect to another. For example, date/time functions vary from dialects and can be hard to deal with:

```python
import sqlglot
sqlglot.transpile("SELECT EPOCH_MS(1618088028295)", read="duckdb", write="hive")[0]
```

```sql
'SELECT FROM_UNIXTIME(1618088028295 / 1000)'
```

SQLGlot can even translate custom time formats:

```python
import sqlglot
sqlglot.transpile("SELECT STRFTIME(x, '%y-%-m-%S')", read="duckdb", write="hive")[0]
```

```sql
"SELECT DATE_FORMAT(x, 'yy-M-ss')"
```

As another example, let's suppose that we want to read in a SQL query that contains a CTE and a cast to `REAL`, and then transpile it to Spark, which uses backticks for identifiers and `FLOAT` instead of `REAL`:

```python
import sqlglot

sql = """WITH baz AS (SELECT a, c FROM foo WHERE a = 1) SELECT f.a, b.b, baz.c, CAST("b"."a" AS REAL) d FROM foo f JOIN bar b ON f.a = b.a LEFT JOIN baz ON f.a = baz.a"""
print(sqlglot.transpile(sql, write="spark", identify=True, pretty=True)[0])
```

```sql
WITH `baz` AS (
  SELECT
    `a`,
    `c`
  FROM `foo`
  WHERE
    `a` = 1
)
SELECT
  `f`.`a`,
  `b`.`b`,
  `baz`.`c`,
  CAST(`b`.`a` AS FLOAT) AS `d`
FROM `foo` AS `f`
JOIN `bar` AS `b`
  ON `f`.`a` = `b`.`a`
LEFT JOIN `baz`
  ON `f`.`a` = `baz`.`a`
```

Comments are also preserved in a best-effort basis when transpiling SQL code:

```python
sql = """
/* multi
   line
   comment
*/
SELECT
  tbl.cola /* comment 1 */ + tbl.colb /* comment 2 */,
  CAST(x AS INT), # comment 3
  y               -- comment 4
FROM
  bar /* comment 5 */,
  tbl #          comment 6
"""

print(sqlglot.transpile(sql, read='mysql', pretty=True)[0])
```

```sql
/* multi
   line
   comment
*/
SELECT
  tbl.cola /* comment 1 */ + tbl.colb /* comment 2 */,
  CAST(x AS INT), -- comment 3
  y -- comment 4
FROM bar /* comment 5 */, tbl /*          comment 6*/
```


### Metadata

You can explore SQL with expression helpers to do things like find columns and tables:

```python
from sqlglot import parse_one, exp

# print all column references (a and b)
for column in parse_one("SELECT a, b + 1 AS c FROM d").find_all(exp.Column):
  print(column.alias_or_name)

# find all projections in select statements (a and c)
for select in parse_one("SELECT a, b + 1 AS c FROM d").find_all(exp.Select):
  for projection in select.expressions:
    print(projection.alias_or_name)

# find all tables (x, y, z)
for table in parse_one("SELECT * FROM x JOIN y JOIN z").find_all(exp.Table):
  print(table.name)
```

### Parser Errors

A syntax error will result in a parser error:

```python
import sqlglot
sqlglot.transpile("SELECT foo( FROM bar")
```

```
sqlglot.errors.ParseError: Expecting ). Line 1, Col: 13.
  select foo( FROM bar
              ~~~~
```

### Unsupported Errors

Presto `APPROX_DISTINCT` supports the accuracy argument which is not supported in Hive:

```python
import sqlglot
sqlglot.transpile("SELECT APPROX_DISTINCT(a, 0.1) FROM foo", read="presto", write="hive")
```

```sql
APPROX_COUNT_DISTINCT does not support accuracy
'SELECT APPROX_COUNT_DISTINCT(a) FROM foo'
```

### Build and Modify SQL

SQLGlot supports incrementally building sql expressions:

```python
from sqlglot import select, condition

where = condition("x=1").and_("y=1")
select("*").from_("y").where(where).sql()
```

```sql
'SELECT * FROM y WHERE x = 1 AND y = 1'
```

You can also modify a parsed tree:

```python
from sqlglot import parse_one
parse_one("SELECT x FROM y").from_("z").sql()
```

```sql
'SELECT x FROM y, z'
```

There is also a way to recursively transform the parsed tree by applying a mapping function to each tree node:

```python
from sqlglot import exp, parse_one

expression_tree = parse_one("SELECT a FROM x")

def transformer(node):
    if isinstance(node, exp.Column) and node.name == "a":
        return parse_one("FUN(a)")
    return node

transformed_tree = expression_tree.transform(transformer)
transformed_tree.sql()
```

```sql
'SELECT FUN(a) FROM x'
```

### SQL Optimizer

SQLGlot can rewrite queries into an "optimized" form. It performs a variety of [techniques](sqlglot/optimizer/optimizer.py) to create a new canonical AST. This AST can be used to standardize queries or provide the foundations for implementing an actual engine. For example:

```python
import sqlglot
from sqlglot.optimizer import optimize

print(
    optimize(
        sqlglot.parse_one("""
            SELECT A OR (B OR (C AND D))
            FROM x
            WHERE Z = date '2021-01-01' + INTERVAL '1' month OR 1 = 0
        """),
        schema={"x": {"A": "INT", "B": "INT", "C": "INT", "D": "INT", "Z": "STRING"}}
    ).sql(pretty=True)
)
```

```
SELECT
  (
    "x"."A" OR "x"."B" OR "x"."C"
  ) AND (
    "x"."A" OR "x"."B" OR "x"."D"
  ) AS "_col_0"
FROM "x" AS "x"
WHERE
  "x"."Z" = CAST('2021-02-01' AS DATE)
```

### AST Introspection

You can see the AST version of the sql by calling `repr`:

```python
from sqlglot import parse_one
print(repr(parse_one("SELECT a + 1 AS z")))
```

```python
(SELECT expressions:
  (ALIAS this:
    (ADD this:
      (COLUMN this:
        (IDENTIFIER this: a, quoted: False)), expression:
      (LITERAL this: 1, is_string: False)), alias:
    (IDENTIFIER this: z, quoted: False)))
```

### AST Diff

SQLGlot can calculate the difference between two expressions and output changes in a form of a sequence of actions needed to transform a source expression into a target one:

```python
from sqlglot import diff, parse_one
diff(parse_one("SELECT a + b, c, d"), parse_one("SELECT c, a - b, d"))
```

```python
[
  Remove(expression=(ADD this:
    (COLUMN this:
      (IDENTIFIER this: a, quoted: False)), expression:
    (COLUMN this:
      (IDENTIFIER this: b, quoted: False)))),
  Insert(expression=(SUB this:
    (COLUMN this:
      (IDENTIFIER this: a, quoted: False)), expression:
    (COLUMN this:
      (IDENTIFIER this: b, quoted: False)))),
  Move(expression=(COLUMN this:
    (IDENTIFIER this: c, quoted: False))),
  Keep(source=(IDENTIFIER this: b, quoted: False), target=(IDENTIFIER this: b, quoted: False)),
  ...
]
```

### Custom Dialects

[Dialects](sqlglot/dialects) can be added by subclassing `Dialect`:

```python
from sqlglot import exp
from sqlglot.dialects.dialect import Dialect
from sqlglot.generator import Generator
from sqlglot.tokens import Tokenizer, TokenType


class Custom(Dialect):
    class Tokenizer(Tokenizer):
        QUOTES = ["'", '"']
        IDENTIFIERS = ["`"]

        KEYWORDS = {
            **Tokenizer.KEYWORDS,
            "INT64": TokenType.BIGINT,
            "FLOAT64": TokenType.DOUBLE,
        }

    class Generator(Generator):
        TRANSFORMS = {exp.Array: lambda self, e: f"[{self.expressions(e)}]"}

        TYPE_MAPPING = {
            exp.DataType.Type.TINYINT: "INT64",
            exp.DataType.Type.SMALLINT: "INT64",
            exp.DataType.Type.INT: "INT64",
            exp.DataType.Type.BIGINT: "INT64",
            exp.DataType.Type.DECIMAL: "NUMERIC",
            exp.DataType.Type.FLOAT: "FLOAT64",
            exp.DataType.Type.DOUBLE: "FLOAT64",
            exp.DataType.Type.BOOLEAN: "BOOL",
            exp.DataType.Type.TEXT: "STRING",
        }

print(Dialect["custom"])
```

```python
<class '__main__.Custom'>
```

## Benchmarks

[Benchmarks](benchmarks) run on Python 3.10.5 in seconds.

|           Query |         sqlglot |        sqlfluff |         sqltree |        sqlparse |  moz_sql_parser |        sqloxide |
| --------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- |
|            tpch |   0.01308 (1.0) | 1.60626 (122.7) | 0.01168 (0.893) | 0.04958 (3.791) | 0.08543 (6.531) | 0.00136 (0.104) |
|           short |   0.00109 (1.0) | 0.14134 (129.2) | 0.00099 (0.906) | 0.00342 (3.131) | 0.00652 (5.970) | 8.76621 (0.080) |
|            long |   0.01399 (1.0) | 2.12632 (151.9) | 0.01126 (0.805) | 0.04410 (3.151) | 0.06671 (4.767) | 0.00107 (0.076) |
|           crazy |   0.03969 (1.0) | 24.3777 (614.1) | 0.03917 (0.987) | 11.7043 (294.8) | 1.03280 (26.02) | 0.00625 (0.157) |


## Optional Dependencies

SQLGlot uses [dateutil](https://github.com/dateutil/dateutil) to simplify literal timedelta expressions. The optimizer will not simplify expressions like the following if the module cannot be found:

```sql
x + interval '1' month
```