Categories: FAANG

Intelligent code conversion: Databricks Spark SQL to BigQuery SQL via Gemini

As data platforms evolve and businesses diversify their cloud ecosystems, the need to migrate SQL workloads between engines is becoming increasingly common. Recently, I had the opportunity to work on translating a set of Databricks SQL queries to BigQuery SQL — a task that is deceptively complex due to differences in syntax, functions, and execution behavior.

To streamline the process, we turned to Google Gemini, a powerful AI assistant, to help bridge the gap between the two SQL dialects. In this blog post, I’ll walk you through the process, challenges we faced, how Gemini helped, and key takeaways from the experience.

The translation tightrope: Why it’s tricky

To boost operational efficiency and cut costs, we migrate analytics workloads from Databricks SQL (on Delta Lake tables) to Google BigQuery. This required rewriting numerous queries, from simple aggregations to intricate CTEs and window functions.

Databricks, with its powerful Spark SQL capabilities, and BigQuery, a serverless and highly scalable data warehouse, are both titans in the data world. However, their SQL dialects, while sharing common ANSI SQL foundations, have distinct variations. Translating between the two manually was possible, but would have been time-consuming and error-prone. This is where Google Gemini played a crucial role.

Below are some of the data type mappings between Databricks and Bigquery :

Category

Databricks SQL Data Type

BigQuery SQL Data Type

Description

Integer Types

TINYINT

INT64

8-bit integer

SMALLINT

INT64

16-bit integer

INT or INTEGER

INT64

32-bit integer

BIGINT

INT64

64-bit integer

Floating-Point Types

FLOAT or REAL

FLOAT64

64-bit floating point

DOUBLE

FLOAT64

64-bit floating point (equivalent to BigQuery’s FLOAT64)

Decimal/Exact Types

DECIMAL or NUMERIC

NUMERIC or BIGNUMERIC

Fixed-point decimal with user-defined precision and scale. BigQuery has an extended BIGNUMERIC for larger precision.

Boolean Types

BOOLEAN

BOOL

True or False

String Types

STRING or VARCHAR

STRING

Variable-length string

CHAR

Not Supported

Fixed-length string is not directly supported in BigQuery; use STRING instead.

Date and Time Types

DATE

DATE

Calendar date (year, month, day)

TIMESTAMP

TIMESTAMP

Timestamp with time zone information

DATETIME

DATETIME

Timestamp without time zone

Syntax difference in Databricks and BigQuery

First_Value :

Databricks

code_block
<ListValue: [StructValue([(‘code’, ‘first_value(expr[, ignoreNull]) [FILTER ( WHERE cond ) ]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed755d9e5b0>)])]>

BigQuery

code_block
<ListValue: [StructValue([(‘code’, ‘FIRST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])rnOVER over_clausernrnover_clause:rn { named_window | ( [ window_specification ] ) }rnrnwindow_specification:rn [ named_window ]rn [ PARTITION BY partition_expression [, …] ]rn ORDER BY expression [ { ASC | DESC } ] [, …]rn [ window_frame_clause ]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed7392cc820>)])]>

In particular, working with H3 geospatial functions can often present unique translation hurdles. Our resources provide clear mappings, like these:

Databricks Function

BigQuery Equivalent Function

Description

h3_boundaryasgeojson
(h3CellIdExpr)

ST_ASGEOJSON(jslibs.h3.ST_H3_
BOUNDARY(h3CellIdExpr))

Returns the polygonal boundary of the input H3 cell in GeoJSON format.

h3_boundaryaswkb
(h3CellIdExpr)

ST_ASBINARY(jslibs.h3.ST_H3_
BOUNDARY(h3CellIdExpr))

Returns the polygonal boundary of the input H3 cell in WKB format.

h3_boundaryaswkt
(h3CellIdExpr)

ST_ASTEXT(jslibs.h3.ST_H3_
BOUNDARY(h3CellIdExpr))

Returns the polygonal boundary of the input H3 cell in WKT format.

Providing precise details for complex functions like these is crucial. In fact, we’ve found that by detailing these H3 translations, even advanced AI models like Gemini can generate more accurate and reliable BigQuery SQL from your original Databricks Spark SQL, ensuring your geospatial analyses remain intact.

Architecture overview

Before diving into the translation logic, let me show you how the pieces fit together.

Pipeline components

Source SQL Storage:

  • All original Databricks SQL files were stored in Google Cloud Storage .

Function mapping guide:

  • A curated guide that maps Databricks-specific SQL functions (e.g., First_value, UCase,etc) to their BigQuery equivalents (FIRST_VALUE, UPPER, TIMESTAMP etc.)

  • This guide included examples and syntax rules, which were used as input context for Gemini.

Few-shot examples:

  • I selected a set of hand-translated queries to serve as high-quality training prompts to improve Gemini’s consistency.

Retrieval-Augmented Generation (RAG) layer:

  • Before querying Gemini, I leveraged the Vertex AI RAG Engine to retrieve relevant function mappings and example translations. This ensured Gemini had grounded knowledge, improving the accuracy of the output. The RAG-enriched prompt was then sent to Gemini for translation, and the returned SQL was optionally post-processed to fix edge cases.

This ensured Gemini had grounded knowledge, improving the accuracy of the output.

Gemini API integration:

  • The RAG-enriched prompt was sent to Gemini for translation.

  • Returned SQL was optionally post-processed to fix edge cases.

Validation layer:

  • Translated SQL queries were validated by executing them in a BigQuery dry run mode to detect syntax issues.

Architecture diagram

Lessons learned

  1. RAG + Gemini = Smart SQL translation: Grounding Gemini with real-world examples and mapping logic made it significantly more accurate.

  1. A comprehensive function mapping guide is essential: Invest time in building a robust function mapping resource.

  1. Thorough validation is the key: Use BigQuery’s dry run and information schema to ensure translated queries are safe and optimized.

Ready to streamline your SQL migrations?

Stop wrestling with SQL syntax and start leveraging the power of your data, wherever it resides. With the Gemini model, we can streamline your Databricks Spark SQL to BigQuery SQL translation process, making it faster, more reliable, and far less painful.

Dive in and accelerate your journey to cross-platform data success. Click on this link to get more details, and take it forward!

AI Generated Robotic Content

Recent Posts

Just tried animating a Pokémon TCG card with AI – Wan 2.2 blew my mind

Hey folks, I’ve been playing around with animating Pokémon cards, just for fun. Honestly I…

6 hours ago

Busted by the em dash — AI’s favorite punctuation mark, and how it’s blowing your cover

AI is brilliant at polishing and rephrasing. But like a child with glitter glue, you…

7 hours ago

Scientists Have Identified the Origin of an Extraordinarily Powerful Outer Space Radio Wave

In March 2025 the Earth was hit by a fast radio burst as energetic as…

7 hours ago

Robots can now learn to use tools—just by watching us

Despite decades of progress, most robots are still programmed for specific, repetitive tasks. They struggle…

7 hours ago

Sharing that workflow [Remake Attempt]

I took a stab at recreating that person's work but including a workflow. Workflow download…

1 day ago

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering…

1 day ago