Categories: FAANG

Intelligent code conversion: Databricks Spark SQL to BigQuery SQL via Gemini

image1 LrbXXzcmax 1000x1000 1

As data platforms evolve and businesses diversify their cloud ecosystems, the need to migrate SQL workloads between engines is becoming increasingly common. Recently, I had the opportunity to work on translating a set of Databricks SQL queries to BigQuery SQL — a task that is deceptively complex due to differences in syntax, functions, and execution behavior.

To streamline the process, we turned to Google Gemini, a powerful AI assistant, to help bridge the gap between the two SQL dialects. In this blog post, I’ll walk you through the process, challenges we faced, how Gemini helped, and key takeaways from the experience.

The translation tightrope: Why it’s tricky

To boost operational efficiency and cut costs, we migrate analytics workloads from Databricks SQL (on Delta Lake tables) to Google BigQuery. This required rewriting numerous queries, from simple aggregations to intricate CTEs and window functions.

Databricks, with its powerful Spark SQL capabilities, and BigQuery, a serverless and highly scalable data warehouse, are both titans in the data world. However, their SQL dialects, while sharing common ANSI SQL foundations, have distinct variations. Translating between the two manually was possible, but would have been time-consuming and error-prone. This is where Google Gemini played a crucial role.

Below are some of the data type mappings between Databricks and Bigquery :

Category	Databricks SQL Data Type	BigQuery SQL Data Type	Description
Integer Types	TINYINT	INT64	8-bit integer
	SMALLINT	INT64	16-bit integer
	INT or INTEGER	INT64	32-bit integer
	BIGINT	INT64	64-bit integer
Floating-Point Types	FLOAT or REAL	FLOAT64	64-bit floating point
	DOUBLE	FLOAT64	64-bit floating point (equivalent to BigQuery’s FLOAT64)
Decimal/Exact Types	DECIMAL or NUMERIC	NUMERIC or BIGNUMERIC	Fixed-point decimal with user-defined precision and scale. BigQuery has an extended BIGNUMERIC for larger precision.
Boolean Types	BOOLEAN	BOOL	True or False
String Types	STRING or VARCHAR	STRING	Variable-length string
	CHAR	Not Supported	Fixed-length string is not directly supported in BigQuery; use STRING instead.
Date and Time Types	DATE	DATE	Calendar date (year, month, day)
	TIMESTAMP	TIMESTAMP	Timestamp with time zone information
	DATETIME	DATETIME	Timestamp without time zone

Syntax difference in Databricks and BigQuery

First_Value :

Databricks

code_block: <ListValue: [StructValue([(‘code’, ‘first_value(expr[, ignoreNull]) [FILTER ( WHERE cond ) ]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed755d9e5b0>)])]>

BigQuery

code_block: <ListValue: [StructValue([(‘code’, ‘FIRST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])rnOVER over_clausernrnover_clause:rn { named_window | ( [ window_specification ] ) }rnrnwindow_specification:rn [ named_window ]rn [ PARTITION BY partition_expression [, …] ]rn ORDER BY expression [ { ASC | DESC } ] [, …]rn [ window_frame_clause ]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed7392cc820>)])]>

In particular, working with H3 geospatial functions can often present unique translation hurdles. Our resources provide clear mappings, like these:

Databricks Function	BigQuery Equivalent Function	Description
h3_boundaryasgeojson (h3CellIdExpr)	ST_ASGEOJSON(jslibs.h3.ST_H3_ BOUNDARY(h3CellIdExpr))	Returns the polygonal boundary of the input H3 cell in GeoJSON format.
h3_boundaryaswkb (h3CellIdExpr)	ST_ASBINARY(jslibs.h3.ST_H3_ BOUNDARY(h3CellIdExpr))	Returns the polygonal boundary of the input H3 cell in WKB format.
h3_boundaryaswkt (h3CellIdExpr)	ST_ASTEXT(jslibs.h3.ST_H3_ BOUNDARY(h3CellIdExpr))	Returns the polygonal boundary of the input H3 cell in WKT format.

Providing precise details for complex functions like these is crucial. In fact, we’ve found that by detailing these H3 translations, even advanced AI models like Gemini can generate more accurate and reliable BigQuery SQL from your original Databricks Spark SQL, ensuring your geospatial analyses remain intact.

Architecture overview

Before diving into the translation logic, let me show you how the pieces fit together.

Pipeline components

Source SQL Storage:

All original Databricks SQL files were stored in Google Cloud Storage .

Function mapping guide:

A curated guide that maps Databricks-specific SQL functions (e.g., First_value, UCase,etc) to their BigQuery equivalents (FIRST_VALUE, UPPER, TIMESTAMP etc.)
This guide included examples and syntax rules, which were used as input context for Gemini.

Few-shot examples:

I selected a set of hand-translated queries to serve as high-quality training prompts to improve Gemini’s consistency.

Retrieval-Augmented Generation (RAG) layer:

Before querying Gemini, I leveraged the Vertex AI RAG Engine to retrieve relevant function mappings and example translations. This ensured Gemini had grounded knowledge, improving the accuracy of the output. The RAG-enriched prompt was then sent to Gemini for translation, and the returned SQL was optionally post-processed to fix edge cases.

This ensured Gemini had grounded knowledge, improving the accuracy of the output.

Gemini API integration:

The RAG-enriched prompt was sent to Gemini for translation.
Returned SQL was optionally post-processed to fix edge cases.

Validation layer:

Translated SQL queries were validated by executing them in a BigQuery dry run mode to detect syntax issues.

Architecture diagram

Lessons learned

RAG + Gemini = Smart SQL translation: Grounding Gemini with real-world examples and mapping logic made it significantly more accurate.

A comprehensive function mapping guide is essential: Invest time in building a robust function mapping resource.

Thorough validation is the key: Use BigQuery’s dry run and information schema to ensure translated queries are safe and optimized.

Ready to streamline your SQL migrations?

Stop wrestling with SQL syntax and start leveraging the power of your data, wherever it resides. With the Gemini model, we can streamline your Databricks Spark SQL to BigQuery SQL translation process, making it faster, more reliable, and far less painful.

Dive in and accelerate your journey to cross-platform data success. Click on this link to get more details, and take it forward!

AI Generated Robotic Content