Databricks: Add support for OPTIMIZE, PARTITIONED BY, and STRUCT syntax#2170
Databricks: Add support for OPTIMIZE, PARTITIONED BY, and STRUCT syntax#2170funcpp wants to merge 6 commits intoapache:mainfrom
Conversation
Add support for Databricks Delta Lake OPTIMIZE statement syntax: - OPTIMIZE table_name [WHERE predicate] [ZORDER BY (col1, ...)] This extends the existing OptimizeTable AST to support both ClickHouse and Databricks syntax by adding: - has_table_keyword: distinguishes OPTIMIZE TABLE (ClickHouse) from OPTIMIZE (Databricks) - predicate: optional WHERE clause for partition filtering - zorder: optional ZORDER BY clause for data colocation
Databricks allows partition columns to be specified without types when referencing columns already defined in the table specification: CREATE TABLE t (col1 STRING, col2 INT) PARTITIONED BY (col1) CREATE TABLE t (name STRING) PARTITIONED BY (year INT, month INT) This change introduces parse_column_def_for_partition() which makes the data type optional by checking if the next token is a comma or closing paren (indicating no type follows the column name).
Add support for Databricks/Hive-style STRUCT field syntax using colons: STRUCT<field_name: field_type, ...> Changes: - Add DatabricksDialect to STRUCT type parsing (alongside BigQuery/Generic) - Modify parse_struct_field_def to handle optional colon separator between field name and type, supporting both: - BigQuery style: STRUCT<field_name field_type> - Databricks/Hive style: STRUCT<field_name: field_type> This enables parsing complex nested types like: ARRAY<STRUCT<finish_flag: STRING, survive_flag: STRING, score: INT>>
1dd3157 to
77f4fae
Compare
|
LGTM @funcpp, but could you fix the clippy issue |
77f4fae to
aa54943
Compare
@andygrove sure, I fixed it. |
andygrove
left a comment
There was a problem hiding this comment.
LGTM. I haven't been working in this repo lately, so let's wait for another maintainer to review before merging.
Thanks! @iffyio could you please take a look? |
src/parser/mod.rs
Outdated
| /// [Databricks](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html) | ||
| pub fn parse_optimize_table(&mut self) -> Result<Statement, ParserError> { | ||
| self.expect_keyword_is(Keyword::TABLE)?; | ||
| // Check for TABLE keyword (ClickHouse uses it, Databricks does not) |
There was a problem hiding this comment.
| // Check for TABLE keyword (ClickHouse uses it, Databricks does not) |
thinking we can skip the comment so that it doesnt become stale/incomplete if other dialects support the feature
src/parser/mod.rs
Outdated
| let data_type = match self.peek_token().token { | ||
| Token::Comma | Token::RParen => DataType::Unspecified, | ||
| _ => self.parse_data_type()?, | ||
| }; |
There was a problem hiding this comment.
I think we can instead do something like
let data_type = self.maybe_parse(|parser| parser.parse_data_type())?;
src/parser/mod.rs
Outdated
| } | ||
| } | ||
|
|
||
| /// Parse column definition for PARTITIONED BY clause. |
There was a problem hiding this comment.
| /// Parse column definition for PARTITIONED BY clause. | |
| /// Parse column definition for `PARTITIONED BY` clause. |
src/parser/mod.rs
Outdated
| /// ``` | ||
| /// | ||
| /// See [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html) | ||
| fn parse_column_def_for_partition(&mut self) -> Result<ColumnDef, ParserError> { |
There was a problem hiding this comment.
| fn parse_column_def_for_partition(&mut self) -> Result<ColumnDef, ParserError> { | |
| fn parse_partitioned_by_column_def(&mut self) -> Result<ColumnDef, ParserError> { |
src/parser/mod.rs
Outdated
| /// ``` | ||
| /// | ||
| /// See [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html) | ||
| fn parse_column_def_for_partition(&mut self) -> Result<ColumnDef, ParserError> { |
There was a problem hiding this comment.
Is the reason for this rewrite to support optional datatype? If so I think it would make sense that we introduce a new type that reflects this instead of reusing the Unspecified datatype which has a different meaning? e.g. that this function returns
struct PartitionedByColumnDef { name, data_type: Option<DataType>}`wdyt?
There was a problem hiding this comment.
Since DataType::Unspecified already exists and is used by SQLite for similar "type-less" scenarios, I thought reusing it would keep things simpler without adding another AST type. Do two cases have different meaning?
There was a problem hiding this comment.
Oh I see they're indeed the same cases. It would seem that ColumnDef should have an Optional datatype instead of us introducing a special datatype but that issue is unrelated to this PR.
For this PR though this new function seems to duplicate parse_column_def (minus the aforementioned optional datatype case) so that it would be good to reuse it instead. I can imagine something like?
pub fn parse_column_def(&mut self) -> Result<ColumnDef, ParserError> {
parse_column_def_inner(true)
}
fn parse_column_def_inner(&mut self) -> Result<ColumnDef, ParserError> {
// ...
let data_type = if self.is_column_type_sqlite_unspecified() {
DataType::Unspecified
} else {
self
.maybe_parse(|parser| parser.parse_data_type())?
.unwrap_or(DataType::Unspecified)
};
// ...
}There was a problem hiding this comment.
Makes sense! I've extracted the shared logic into parse_column_def_inner(optional_data_type: bool) so that parse_column_def delegates to it with false (type required) and the PARTITIONED BY path calls it with true (type optional via maybe_parse). This removes the duplicate function while reusing the full column def parsing logic including options.
src/ast/mod.rs
Outdated
| /// Optional cluster identifier. | ||
| /// Whether the `TABLE` keyword was present (ClickHouse uses `OPTIMIZE TABLE`, Databricks uses `OPTIMIZE`). | ||
| has_table_keyword: bool, | ||
| /// Optional cluster identifier (ClickHouse). |
There was a problem hiding this comment.
| /// Optional cluster identifier (ClickHouse). | |
| /// Optional cluster identifier. | |
| /// [Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/optimize) |
src/ast/mod.rs
Outdated
| /// Optional partition spec (ClickHouse). | ||
| partition: Option<Partition>, | ||
| /// Whether `FINAL` was specified. | ||
| /// Whether `FINAL` was specified (ClickHouse). |
There was a problem hiding this comment.
| /// Whether `FINAL` was specified (ClickHouse). | |
| /// Whether `FINAL` was specified. | |
| /// [Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/optimize) |
src/ast/mod.rs
Outdated
| /// Whether `FINAL` was specified (ClickHouse). | ||
| include_final: bool, | ||
| /// Optional deduplication settings. | ||
| /// Optional deduplication settings (ClickHouse). |
There was a problem hiding this comment.
| /// Optional deduplication settings (ClickHouse). | |
| /// Optional deduplication settings. | |
| /// [Clickhouse](https://clickhouse.com/docs/en/sql-reference/statements/optimize) |
src/ast/mod.rs
Outdated
| /// Optional deduplication settings. | ||
| /// Optional deduplication settings (ClickHouse). | ||
| deduplicate: Option<Deduplicate>, | ||
| /// Optional WHERE predicate (Databricks). |
There was a problem hiding this comment.
| /// Optional WHERE predicate (Databricks). | |
| /// Optional WHERE predicate. | |
| /// [Databricks](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html) |
src/ast/mod.rs
Outdated
| deduplicate: Option<Deduplicate>, | ||
| /// Optional WHERE predicate (Databricks). | ||
| predicate: Option<Expr>, | ||
| /// Optional ZORDER BY columns (Databricks). |
There was a problem hiding this comment.
| /// Optional ZORDER BY columns (Databricks). | |
| /// Optional ZORDER BY columns. | |
| /// [Databricks](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html) |
- Update doc comments to use [Dialect](link) format - Remove dialect-specific inline comments - Rename parse_column_def_for_partition to parse_partitioned_by_column_def - Use maybe_parse instead of manual token checking - Use backticks for SQL keywords in doc comments
|
@funcpp could you take a look when you have some time to resolve the conflicts on the branch? |
Summary
This PR adds several Databricks Delta Lake SQL syntax features:
1. OPTIMIZE statement support
Adds support for the Databricks
OPTIMIZEstatement syntax:OPTIMIZE table_name [WHERE predicate] [ZORDER BY (col1, col2, ...)]Reference: https://docs.databricks.com/en/sql/language-manual/delta-optimize.html
Key difference from ClickHouse: Databricks omits the
TABLEkeyword afterOPTIMIZE.2. PARTITIONED BY with optional column types
Databricks allows partition columns to reference existing table columns without specifying types:
Reference: https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html
3. STRUCT type with colon syntax
Databricks uses Hive-style colon separator for struct field definitions:
Reference: https://docs.databricks.com/en/sql/language-manual/data-types/struct-type.html
The colon is optional per the spec, so both
field: typeandfield typesyntaxes are now accepted.Changes
OptimizeTableAST to support Databricks-specific fields (predicate,zorder,has_table_keyword)parse_column_def_for_partition()to handle optional column types in PARTITIONED BYDatabricksDialectto STRUCT type parsingparse_struct_field_def()to accept optional colon separatorTest plan
OPTIMIZEstatement variationsPARTITIONED BYwith/without column typesSTRUCTtype with colon syntaxcargo test)