Categories: FAANG

Boost your log analysis with BigQuery vector search and LLMs

We recently launched BigQuery vector search, which enables semantic similarity search in BigQuery. This powerful capability can be used to analyze logs and asset metadata stored in BigQuery tables. For example, given a suspected log entry, your Site Reliability Engineering (SRE) or incident response (IR) team can search for semantically similar logs to validate whether that’s a true anomaly. These searches can be done proactively to auto-triage such anomalies, saving your users time on otherwise lengthy investigations. Furthermore, these relevant search results can provide more complete context for your large language model (LLM) to identify relationships, patterns and nuances that can otherwise be harder to capture with traditional schema-based querying alone. This can accelerate many user workflows — from threat detection and investigation, to network forensics, business insights and application troubleshooting.

In this blog, we provide an overview of how to use BigQuery’s new vector search capabilities against logs and asset metadata. First, we’ll present key components of this solution. Then we’ll go over a few real-world customer use cases on how to leverage BigQuery vector search over your logs and assets. We’ll also include pointers on how to cost-effectively deploy this at scale.

Vector search vs. text search

While traditional text search relies on matching exact words, or tokens, vector search identifies semantically similar records, by finding closely matching vector embeddings. A vector embedding is a high-dimensional numerical vector that encapsulates the semantic meaning of a given data record (text, image, audio or video file). In the case of logs or assets, the data record is either a log entry (e.g., audit, network, application) or a resource metadata (e.g., asset or policy config change).

BigQuery users can configure and manage search indexes and vector indexes for fast and efficient retrieval. While full-text search indexes yield deterministic results (full recall), vector search with a vector index returns approximate results. This is due to using the Approximate Nearest Neighbor (ANN) search technique, which offers better performance and lower cost in exchange for lower recall.

While text search helps with needle-in-haystack type searches, vector search helps with similarity and outlier-type searches. Both of these search techniques are commonly used for log analytics to answer questions such as the following: 

Questions using text search

Questions using vector search

  • Show all audit logs with the exact same principal email address.
  • Show all network logs with a specific source IP address.
  • Show application logs with a specific error message to help troubleshoot an outage.
  • Search data associated with a user to comply with regulations like GDPR.
  • Given this logged action, confirm if it’s abnormal, indicating a potential threat or accidental misconfiguration.

  • Given these network logs, confirm if this is abnormal traffic, indicating a misconfig or security issue.

  • Given this application usage log, confirm if this is an abnormal usage pattern, providing usage insights or indicating potential API abuse.

  • Given a resource or policy configuration change, confirm if it is abnormal, signaling  potential resource drift or a policy compliance issue.

  • Given a set of allowed rules, confirm if a resource or policy configuration doesn’t match any of the rules, and should therefore be flagged.

For the rest of this blog, we’ll focus on a few illustrative examples on how you can use vector search on logs data, including retrieval-augmented generation (RAG).

From semi-structured back to unstructured data

In addition to the classic challenges of big data (notably the 4 Vs of data — volume, variety, velocity and variability), analyzing logs involves making sense of their complex semi-structured format. Logs are often composed of deeply nested fields and non-uniform payloads like JSON. A typical organization can have billions of log entries with hundreds or thousands of different log payload formats.

To handle the massive volume of logs and to make them embedding-friendly, this solution employs two key steps as depicted in the above architecture diagram:

  1. Aggregation reduces data volume by combining multiple log entries into one. So instead of analyzing individual log entries, user actions can be grouped into a single entry per hour/day. For instance, a particular action done 20 times by a user on a given day (and therefore logged 20 times), can be coalesced into one record, achieving 20:1 reduction in space.

  2. NL Conversion: Additionally, converting verbose semi-structured JSON log payloads into simple natural language log summaries further reduces data volume. This conversion makes it easier to generate embeddings from the textual data, and to subsequently query with vector search. For instance, a 2KB audit log entry can be condensed into a 200-character sentence that semantically explains what this audit log entry means, achieving 10:1 additional reduction in space.

As an example, here are SQL samples (in Dataform SQLX) to process Cloud Audit Logs:

The natural language log summaries, alongside their embeddings, allow a subsequent LLM application to better understand and capture nuances that are otherwise harder to represent using structured or semi-structured raw data alone. We will demonstrate this in the next section. The raw logs can still be retained and referenced in the source log bucket or BigQuery dataset.

It’s important to note that the exact parameters for aggregation and conversion are ultimately dependent on your specific log type and use case. For example, the aggregation interval (10 min vs. 1 hr vs. 1 day) depends on your search lag-time requirements and allowable time-precision loss. Similarly, the conversion step requires carefully selecting the most relevant log fields to extract — this is typically done by a subject matter expert who’s familiar with that particular log type. Careful planning and configuration of both aggregation and conversion steps will help you strike the right balance between reducing log volume and maintaining the ability to identify meaningful patterns and insights.

Real-world applications

Now, let’s demonstrate how to use BigQuery SQL function VECTOR_SEARCH in SQL queries that you can try on your own. The examples below assume the logs are already preprocessed (aggregated, converted, and embedded) into a table of historical logs called actions_summary_embeddings, as implemented in this sample Dataform repo.

  • Example 1 uses vector search to validate potential anomalies against the historical logs. The test dataset (potential anomalies) is stored in table actions_summary_embeddings_suspected.

  • Example 2 uses vector search to find rule violations throughout our historical logs. Notice how the rule is expressed in natural language, and embedded dynamically.

  • Examples 3 and 4 are more complex uses of vector search: example 3 uses subsequent logic to triage anomalies based on nearest neighbor results like number and distance of matches, while example 4 uses nearest neighbor results to augment the LLM in-context prompt (RAG) and have the LLM such as Gemini reason and triage anomalies for us.

Note: Solutions built based on embeddings, vector search, and LLM inference are inherently probabilistic and approximate. Consider evaluating the recall rate of your vector searches and other metrics based on your higher-level task’s objectives and risk tolerance.

1) Validate anomalies using vector search (outlier detection)

code_block
<ListValue: [StructValue([(‘code’, “SELECTrn query.content as suspicious_action,rn base.content as past_similar_action,rn base.day as past_day, base.counter as past_counter,distancernFROM VECTOR_SEARCH(rn TABLE `[MY_PROJECT].[MY_DATASET].actions_summary_embeddings`,rn ‘text_embedding’,rn TABLE `[MY_PROJECT].[MY_DATASET].actions_summary_embeddings_suspected`,rn top_k => 5rn)rnWHERErn base.content != query.content — remove potential duplicatesrn AND base.day <= query.day;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e59cf8a0040>)])]>

Example results:

2) Search policy violations using vector search (similarity search)

code_block
<ListValue: [StructValue([(‘code’, ‘WITH disallowed_rules AS (rn SELECTrn ml_generate_embedding_result AS text_embedding, contentrn FROM ML.GENERATE_EMBEDDING(rn MODEL `[MY_PROJECT].[MY_DATASET].text_embedding`,rn (SELECT “Principal X deleted or disabled a BigQuery dataset or table using Cloud Console” as content),rn STRUCT(TRUE AS flatten_json_output)rn )rn)rnSELECTrn query.content as rule_disallowed,rn base.content as non_compliant_action,rn base.day as day,rn base.counter as counter,rn distancernFROM (rn SELECT * FROM VECTOR_SEARCH(rn TABLE `[MY_PROJECT].[MY_DATASET].actions_summary_embeddings`,rn ‘text_embedding’,rn (SELECT * FROM disallowed_rules),rn top_k => 3rn )rn)rnORDER BY day DESC;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e59cf8a04f0>)])]>

Example results:

3) Triage anomalies using vector search with rule-based logic

code_block
<ListValue: [StructValue([(‘code’, “SELECTrn suspicious_action,rn ANY_VALUE(day) as day,rn — Modify max distance threshold for your own use case and matching requirement (strict vs loose).rn — The distance between vector embeddings depends on the data and the embedding model used.rn — The max distance heuristic value used here is specific to this audit logs example and implementation (embeddings were generated with textembedding-gecko@003 model).rn SUM(IF(distance < 0.06, past_counter, 0)) as prior_total_occurences,rn — Flag High risk if there’s zero prior similar occurrence,rn — Medium risk if there’s one, Low risk for two or morern CASE SUM(IF(distance < 0.06, counter, 0))rn WHEN 0 THEN ‘HIGH’rn WHEN 1 THEN ‘MEDIUM’rn ELSE ‘LOW’rn END AS risk_factorrnFROM ( /*rn —-rn Copy/Paste same SQL query with VECTOR_SEARCH from Example 1rn —-rn*/ )rnGROUP BYrn suspicious_actionrnORDER BYrn prior_total_occurences ASC,rn day DESC;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e59cf8a06d0>)])]>

Example results:

4) Triage anomalies using vector search with LLM reasoning (RAG)

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT prompt, ml_generate_text_llm_result AS generatedrnFROM ML.GENERATE_TEXT(rn MODEL `[MY_PROJECT].[MY_DATASET].gemini_model`,rn (rn SELECT CONCAT(rn “You are a cloud administrator and a log forensics expert. “,rn “Determine if the following suspected administrative action is a high or low risk given the following prior history of valid administrative actions.”,rn “Include your reasons in a bullet list.”, “\n”,rn FORMAT(“New suspected administrative action:\n%s\n”, (SELECT content FROM suspected_admin_action LIMIT 1)),rn “Previous administrative actions:\n”,rn STRING_AGG(FORMAT(“- %s”, base.content),’.\n’)rn ) AS prompt,rn FROM ( /*rn —-rn Copy/Paste same SQL query with VECTOR_SEARCH from Example 1rn —-rn */ )rn ),rn STRUCT(rn 600 AS max_output_tokens, — increase tokens to account for explanationrn 0.1 AS temperature, — more deterministic responsern TRUE AS flatten_json_outputrn )rn;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e59cf8a0e20>)])]>

Example output (exported to Google Sheets) showing the prompt enriched with vector search results and the generated LLM response:

Generated LLM response copied below for clarity:

**Low risk.**
– The principal (rarsan@google.com) has performed the same operation (google.logging.v2.ConfigServiceV2.DeleteSink) on the same logging sink (<redacted>) in the same project (<redacted>) multiple times in the past using Terraform.
– The IP address (<redacted>) used in the suspected administrative action is not significantly different from the IP addresses (<redacted> and <redacted>) used in previous administrative actions.
– The frequency of the suspected administrative action (1 time) is consistent with the frequency of previous administrative actions (1 time, 3 times, 1 time, and 1 time).

A note about scale, cost and performance

Vector search, like full-text search, employs a specialized index to enhance performance. By using a vector index, vector search queries can efficiently search embeddings, leading to faster response times and reduced data scanning or slot usage. However, this increased efficiency comes with the trade-off of potentially returning less precise results.

For example, consider a project with 50GB worth of audit logs over the last 12 months. Assuming 2KB average payload of an audit log entry, this translates to about 25M log records. By employing the aforementioned methods of reducing log volume, we can achieve a two-to-three orders of magnitude reduction in volume:

  • 25M log entries @ 2KB/each → 1.25M (aggregated) logs @ 200 bytes/each

  • 50GB raw logs → ~240MB log embeddings

That’s roughly the size of your vector index. This means fast (1-2 seconds) and cost-effective semantic search over the 12-months lookback.

Now, let’s assume the 50GB of logs are ingested daily. For a one-year retention, this translates to a total of 17TB of raw log data, but only 81.72GB of log embeddings. This is far less data volume to store and search in BigQuery using a fully-managed vector index.

Try BigQuery vector search today

In summary, BigQuery has evolved into a comprehensive platform offering not just robust data analytics, but also integrated full-text and vector search. This is particularly useful in the context of log analytics including detecting and investigating anomalies. Semantic search, powered by vector search, broadens the types of questions you can ask of your data and accelerates the time to get to insights. Additionally, by leveraging vector search for information retrieval within a RAG architecture, you can develop sophisticated LLM-powered applications that keep pace with the increasing volume of logs and potential security threats.

To experiment with BigQuery vector search over your audit logs, try this Log Anomaly Detection & Investigation notebook, which provides a step-by-step walkthrough of log preparation, querying, and visualization. To deploy the solution described in this blog, you can use this BigQuery Dataform repo which automates the end-to-end data pipeline. All the steps including log aggregation, conversion and embeddings generation are codified in SQL queries and periodically executed using Dataform scheduled workflow. Using Dataform is the fastest way to customize, deploy and evaluate this solution for your use case and for your logs in log analytics or BigQuery.

AI Generated Robotic Content

Recent Posts

5 Tools for Visualizing Machine Learning Models

Machine learning (ML) models are built upon data.

23 hours ago

AI Systems Governance through the Palantir Platform

Editor’s note: This is the second post in a series that explores a range of…

23 hours ago

Introducing Configurable Metaflow

David J. Berg*, David Casler^, Romain Cledat*, Qian Huang*, Rui Lin*, Nissan Pow*, Nurcan Sonmez*,…

23 hours ago

Arm lawsuit against Qualcomm ends in mistrial and favorable ruling for Qualcomm

Qualcomm did not violate a license with Arm when it acquired Nuvia for $1.4 billion,…

24 hours ago

2024 Was the Year the Bottom Fell Out of the Games Industry

From layoffs to the return of Gamergate, video games—and the people who make and play…

24 hours ago

Machine psychology: A bridge to general AI?

Artificial intelligence that is as intelligent as humans may become possible thanks to psychological learning…

24 hours ago