The new Tower of Babel? Using multilingual embeddings and vector search in BigQuery

In today’s globalized marketplace, finding and understanding reviews in a customer’s preferred language across multiple languages can be challenging. BigQuery is designed for managing and analyzing large datasets, including reviews. In this blog post, we present a solution that uses BigQuery multilingual embeddings, vector index and vector search, to let customers search for products or business reviews in their preferred language and receive results in that same language. These technologies convert text data into numerical vectors, allowing for advanced search capabilities that surpass traditional keyword matching, thereby enhancing the accuracy and relevance of search results.

Simplifying the retrieval results for users and introducing an additional level of refinement, our solution also uses the Translation API, which is seamlessly integrated within BigQuery, to translate reviews from various languages into the language of the user’s choice. This way, businesses can easily analyze and gain insights from reviews written in different languages, and users can access and understand reviews in their preferred language.

The architecture diagram below provides a visual representation of this solution.


Multilingual Review Insights with BigQuery, Multilingual Embeddings, Vector Search and Translation API

To illustrate, we extracted Google Local review data (including ratings, text, etc.) and business metadata (such as address, category, etc.) for Texas businesses through September 2021. This dataset includes reviews written in various languages. For customers who prefer to read reviews in their own language, our solution enables them to pose questions in their native language and receive the most relevant reviews in their preferred language, even if those reviews were initially written in a different language.

For instance, to explore Texas bakeries, we posed the question “Where can I find authentic Egg Tarts and Cantonese-style buns in Houston?” These two bakery items are distinctive and widely available in Asia but less common in Houston, making it challenging to locate pertinent reviews among thousands of business profiles. With our solution, users can ask the question in Chinese, and receive the most relevant results in Chinese, even if the reviews were originally written in English, Japanese, and so forth. Irrespective of the language used in the reviews, this solution aggregates the most relevant information and translates the reviews into the language requested by the user, significantly enhancing the user’s ability to extract valuable insights from reviews authored by individuals speaking different languages.

Before Translation:


After Translation in BigQuery: In the demo below, presented as a GIF, we showcase the search functionality in three languages: 

  • Chinese

  • English

  • Spanish

BigQuery built-in functions that were used for this solution is shown below:

<ListValue: [StructValue([(‘code’, ‘Generate Embeddings for Source data:rnrnCREATE OR REPLACE TABLE `` ASrn(SELECT *rnFROM ML.GENERATE_EMBEDDING(rn MODEL ``,rn (SELECT CONCAT(extracted_text,’,’,rating,’,’,category) AS contentrn FROM `` )rn)rn);rnrnrnCreate Vector Index for Vector Search:rnCREATE OR REPLACE VECTOR INDEX multilingual_review_indexrnON ``(ml_generate_embedding_result)rnOPTIONS(index_type = ‘IVF’,rn distance_type = ‘COSINE’,rn ivf_options = ‘{“num_lists”:500}’)rnrnrnCheck information schema that vector indexes are createdrnSELECT table_name, index_name, index_status,rncoverage_percentage, last_refresh_time, disable_reasonrnFROM ``rnrnrnVector Search for your question rnSELECT query.query, base.content, base.rating, base.categoryrnFROM VECTOR_SEARCH(rnTABLE ``, ‘ml_generate_embedding_result’,rn(rnSELECT ml_generate_embedding_result, content AS queryrnFROM ML.GENERATE_EMBEDDING(rnMODEL ``,rn(SELECT “休士頓哪裡有正宗的葡式蛋撻和港式麵包?” AS content))rn),rntop_k => 10, options => ‘{“fraction_lists_to_search”: 0.08}’)rnrnrnTranslation API to detect source language:rnSELECTrn ml_translate_result.languages[0].language_code AS target_language_codern FROMrn ML.TRANSLATE(MODEL ``, (rn SELECT “休士頓哪裡有正宗的葡式蛋撻和港式麵包?” AS text_content),rn STRUCT(“detect_language” AS translate_mode))rnrnrnTranslation API to translate reviews:rnSELECTrn text_content AS `Original Text`,rn “zh-CN” AS `Destination Language`,rn STRING(ml_translate_result.translations[0].translated_text) AS Translationrn FROM ML.TRANSLATE(rn MODEL ``,rn (select ‘{txt_}’ as text_content),rn STRUCT(‘translate_text’ AS translate_mode, ‘{lang_}’ AS target_language_code))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3edf66d6cc10>)])]>

Demonstration of the solution:


Multilingual Search on Review Datasets: Ask Questions and Get Results in Your preferred language with the Power of BigQuery!

Customers can search for and read reviews in their preferred language without language barriers; you could then extend the solution with Gemini to summarize or classify the searched reviews. You can also extend this solution to any product, business reviews or multilingual datasets simply by adding a search feature, thereby allowing users to get their questions answered in their language of choice. Give it a try and imagine how you can develop other valuable data and AI tools using BigQuery!