Databricks: Add support for OPTIMIZE, PARTITIONED BY, and STRUCT syntax by funcpp · Pull Request #2170 · apache/datafusion-sqlparser-rs

funcpp · 2026-01-22T06:01:31Z

Summary

This PR adds several Databricks Delta Lake SQL syntax features:

1. OPTIMIZE statement support

Adds support for the Databricks OPTIMIZE statement syntax:

OPTIMIZE table_name [WHERE predicate] [ZORDER BY (col1, col2, ...)]

Reference: https://docs.databricks.com/en/sql/language-manual/delta-optimize.html

Key difference from ClickHouse: Databricks omits the TABLE keyword after OPTIMIZE.

2. PARTITIONED BY with optional column types

Databricks allows partition columns to reference existing table columns without specifying types:

CREATE TABLE t (col1 STRING, col2 INT) PARTITIONED BY (col1)
CREATE TABLE t (name STRING) PARTITIONED BY (year INT, month INT)

Reference: https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html

3. STRUCT type with colon syntax

Databricks uses Hive-style colon separator for struct field definitions:

STRUCT<field_name: field_type, ...>
ARRAY<STRUCT<finish_flag: STRING, survive_flag: STRING, score: INT>>

Reference: https://docs.databricks.com/en/sql/language-manual/data-types/struct-type.html

The colon is optional per the spec, so both field: type and field type syntaxes are now accepted.

Changes

Extended OptimizeTable AST to support Databricks-specific fields (predicate, zorder, has_table_keyword)
Added parse_column_def_for_partition() to handle optional column types in PARTITIONED BY
Added DatabricksDialect to STRUCT type parsing
Modified parse_struct_field_def() to accept optional colon separator

Test plan

Added tests for OPTIMIZE statement variations
Added tests for PARTITIONED BY with/without column types
Added tests for STRUCT type with colon syntax
Verified existing ClickHouse and BigQuery tests still pass
All tests pass (cargo test)

Add support for Databricks Delta Lake OPTIMIZE statement syntax: - OPTIMIZE table_name [WHERE predicate] [ZORDER BY (col1, ...)] This extends the existing OptimizeTable AST to support both ClickHouse and Databricks syntax by adding: - has_table_keyword: distinguishes OPTIMIZE TABLE (ClickHouse) from OPTIMIZE (Databricks) - predicate: optional WHERE clause for partition filtering - zorder: optional ZORDER BY clause for data colocation

Databricks allows partition columns to be specified without types when referencing columns already defined in the table specification: CREATE TABLE t (col1 STRING, col2 INT) PARTITIONED BY (col1) CREATE TABLE t (name STRING) PARTITIONED BY (year INT, month INT) This change introduces parse_column_def_for_partition() which makes the data type optional by checking if the next token is a comma or closing paren (indicating no type follows the column name).

Add support for Databricks/Hive-style STRUCT field syntax using colons: STRUCT<field_name: field_type, ...> Changes: - Add DatabricksDialect to STRUCT type parsing (alongside BigQuery/Generic) - Modify parse_struct_field_def to handle optional colon separator between field name and type, supporting both: - BigQuery style: STRUCT<field_name field_type> - Databricks/Hive style: STRUCT<field_name: field_type> This enables parsing complex nested types like: ARRAY<STRUCT<finish_flag: STRING, survive_flag: STRING, score: INT>>

andygrove · 2026-01-23T01:01:56Z

LGTM @funcpp, but could you fix the clippy issue

funcpp · 2026-01-23T01:14:17Z

LGTM @funcpp, but could you fix the clippy issue

@andygrove sure, I fixed it.

andygrove

LGTM. I haven't been working in this repo lately, so let's wait for another maintainer to review before merging.

funcpp · 2026-01-23T04:22:43Z

LGTM. I haven't been working in this repo lately, so let's wait for another maintainer to review before merging.

Thanks! @iffyio could you please take a look?

iffyio

Thanks @funcpp! The changes look reasonable to me overall, left some minor comments.

FYI for future PRs please try to split the diff into separate PRs for each feature being added/fixed, it makes the PR easier to review

iffyio · 2026-01-24T11:35:01Z

src/parser/mod.rs

+    /// [Databricks](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html)
    pub fn parse_optimize_table(&mut self) -> Result<Statement, ParserError> {
-        self.expect_keyword_is(Keyword::TABLE)?;
+        // Check for TABLE keyword (ClickHouse uses it, Databricks does not)


Suggested change

// Check for TABLE keyword (ClickHouse uses it, Databricks does not)

thinking we can skip the comment so that it doesnt become stale/incomplete if other dialects support the feature

iffyio · 2026-01-24T11:36:58Z

src/parser/mod.rs

+        let data_type = match self.peek_token().token {
+            Token::Comma | Token::RParen => DataType::Unspecified,
+            _ => self.parse_data_type()?,
+        };


I think we can instead do something like

let data_type = self.maybe_parse(|parser| parser.parse_data_type())?;

iffyio · 2026-01-24T11:37:17Z

src/parser/mod.rs

        }
    }

+    /// Parse column definition for PARTITIONED BY clause.


Suggested change

/// Parse column definition for PARTITIONED BY clause.

/// Parse column definition for `PARTITIONED BY` clause.

iffyio · 2026-01-24T11:37:39Z

src/parser/mod.rs

+    /// ```
+    ///
+    /// See [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html)
+    fn parse_column_def_for_partition(&mut self) -> Result<ColumnDef, ParserError> {


Suggested change

fn parse_column_def_for_partition(&mut self) -> Result<ColumnDef, ParserError> {

fn parse_partitioned_by_column_def(&mut self) -> Result<ColumnDef, ParserError> {

iffyio · 2026-01-24T11:41:35Z

src/parser/mod.rs

+    /// ```
+    ///
+    /// See [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html)
+    fn parse_column_def_for_partition(&mut self) -> Result<ColumnDef, ParserError> {


Is the reason for this rewrite to support optional datatype? If so I think it would make sense that we introduce a new type that reflects this instead of reusing the Unspecified datatype which has a different meaning? e.g. that this function returns

struct PartitionedByColumnDef { name, data_type: Option<DataType>}`

wdyt?

Since DataType::Unspecified already exists and is used by SQLite for similar "type-less" scenarios, I thought reusing it would keep things simpler without adding another AST type. Do two cases have different meaning?

Oh I see they're indeed the same cases. It would seem that ColumnDef should have an Optional datatype instead of us introducing a special datatype but that issue is unrelated to this PR.

For this PR though this new function seems to duplicate parse_column_def (minus the aforementioned optional datatype case) so that it would be good to reuse it instead. I can imagine something like?

pub fn parse_column_def(&mut self) -> Result<ColumnDef, ParserError> { parse_column_def_inner(true) } fn parse_column_def_inner(&mut self) -> Result<ColumnDef, ParserError> { // ... let data_type = if self.is_column_type_sqlite_unspecified() { DataType::Unspecified } else { self .maybe_parse(|parser| parser.parse_data_type())? .unwrap_or(DataType::Unspecified) }; // ... }

Makes sense! I've extracted the shared logic into parse_column_def_inner(optional_data_type: bool) so that parse_column_def delegates to it with false (type required) and the PARTITIONED BY path calls it with true (type optional via maybe_parse). This removes the duplicate function while reusing the full column def parsing logic including options.

@iffyio could you please review? 👀

iffyio · 2026-01-24T11:48:37Z

src/ast/mod.rs

-        /// Optional cluster identifier.
+        /// Whether the `TABLE` keyword was present (ClickHouse uses `OPTIMIZE TABLE`, Databricks uses `OPTIMIZE`).
+        has_table_keyword: bool,
+        /// Optional cluster identifier (ClickHouse).


Suggested change

/// Optional cluster identifier (ClickHouse).

/// Optional cluster identifier.

/// [Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/optimize)

iffyio · 2026-01-24T11:48:49Z

src/ast/mod.rs

+        /// Optional partition spec (ClickHouse).
        partition: Option<Partition>,
-        /// Whether `FINAL` was specified.
+        /// Whether `FINAL` was specified (ClickHouse).


Suggested change

/// Whether `FINAL` was specified (ClickHouse).

/// Whether `FINAL` was specified.

/// [Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/optimize)

iffyio · 2026-01-24T11:49:01Z

src/ast/mod.rs

+        /// Whether `FINAL` was specified (ClickHouse).
        include_final: bool,
-        /// Optional deduplication settings.
+        /// Optional deduplication settings (ClickHouse).


Suggested change

/// Optional deduplication settings (ClickHouse).

/// Optional deduplication settings.

/// [Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/optimize)

iffyio · 2026-01-24T11:49:30Z

src/ast/mod.rs

-        /// Optional deduplication settings.
+        /// Optional deduplication settings (ClickHouse).
        deduplicate: Option<Deduplicate>,
+        /// Optional WHERE predicate (Databricks).


Suggested change

/// Optional WHERE predicate (Databricks).

/// Optional WHERE predicate.

/// [Databricks](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html)

iffyio · 2026-01-24T11:49:47Z

src/ast/mod.rs

        deduplicate: Option<Deduplicate>,
+        /// Optional WHERE predicate (Databricks).
+        predicate: Option<Expr>,
+        /// Optional ZORDER BY columns (Databricks).


Suggested change

/// Optional ZORDER BY columns (Databricks).

/// Optional ZORDER BY columns.

/// [Databricks](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html)

- Update doc comments to use [Dialect](link) format - Remove dialect-specific inline comments - Rename parse_column_def_for_partition to parse_partitioned_by_column_def - Use maybe_parse instead of manual token checking - Use backticks for SQL keywords in doc comments

iffyio · 2026-02-06T11:24:54Z

@funcpp could you take a look when you have some time to resolve the conflicts on the branch?

funcpp added 3 commits January 22, 2026 12:53

funcpp force-pushed the feature/databricks-delta-support branch from 1dd3157 to 77f4fae Compare January 22, 2026 06:06

Apply cargo fmt formatting

aa54943

funcpp force-pushed the feature/databricks-delta-support branch from 77f4fae to aa54943 Compare January 23, 2026 01:11

andygrove approved these changes Jan 23, 2026

View reviewed changes

iffyio reviewed Jan 24, 2026

View reviewed changes

funcpp requested a review from iffyio January 26, 2026 03:10

Reuse parse_column_def for PARTITIONED BY column parsing

414fc39

	/// Parse column definition for PARTITIONED BY clause.
	/// Parse column definition for `PARTITIONED BY` clause.

	fn parse_column_def_for_partition(&mut self) -> Result<ColumnDef, ParserError> {
	fn parse_partitioned_by_column_def(&mut self) -> Result<ColumnDef, ParserError> {

	/// Optional cluster identifier (ClickHouse).
	/// Optional cluster identifier.
	/// [Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/optimize)

	/// Whether `FINAL` was specified (ClickHouse).
	/// Whether `FINAL` was specified.
	/// [Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/optimize)

	/// Optional deduplication settings (ClickHouse).
	/// Optional deduplication settings.
	/// [Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/optimize)

	/// Optional WHERE predicate (Databricks).
	/// Optional WHERE predicate.
	/// [Databricks](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html)

	/// Optional ZORDER BY columns (Databricks).
	/// Optional ZORDER BY columns.
	/// [Databricks](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html)

Conversation

funcpp commented Jan 22, 2026

Summary

1. OPTIMIZE statement support

2. PARTITIONED BY with optional column types

3. STRUCT type with colon syntax

Changes

Test plan

Uh oh!

andygrove commented Jan 23, 2026

Uh oh!

funcpp commented Jan 23, 2026

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

funcpp commented Jan 23, 2026

Uh oh!

iffyio left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iffyio commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants