Categories: FAANG

How to do multivariate time series forecasting in BigQuery ML

1 Multivariate Time Series Forecasting.max 1000x1000 1

Companies across industries rely heavily on time series forecasting to project product demand, forecast sales, project online subscription/cancellation, and for many other use cases. This makes time series forecasting one of the most popular models in BigQuery ML.

What is multivariate time series forecasting? For example, if you want to forecast ice cream sales, it is helpful to forecast using the external covariant “weather” along with the target metric “past sales.” Multivariate time series forecasting in BigQuery lets you create more accurate forecasting models without having to move data out of BigQuery.

When it comes to time series forecasting, covariates or features besides the target time series are often used to provide better forecasting. Up until now, BigQuery ML has only supported univariate time series modeling using the ARIMA_PLUS model (documentation). It is one of the most popular BigQuery ML models.

While ARIMA_PLUS is widely used, forecasting using only the target variable is sometimes not sufficient. Some patterns inside the time series strongly depend on other features. We see strong customer demand for multivariate time series forecasting support that allows you to forecast using covariate and features.

We recently announced the public preview of multivariate time series forecasting with external regressors. We are introducing a new model type ARIMA_PLUS_XREG, where the XREG refers to external regressors or side features. You can use the SELECT statement to choose side features with the target time series. This new model leverages the BigQuery ML linear regression model to include the side features and the BigQuery ML ARIMA_PLUS model to model the linear regression residuals.

The ARIMA_PLUS_XREG model supports the following capabilities:

Automatic feature engineering for numerical, categorical, and array features.
All the model capabilities of the ARIMA_PLUS model, such as detecting seasonal trends, holidays, etc.

Headlight, an AI-powered ad agency, is using a multivariate forecasting model to determine conversion volumes for down-funnel metrics like subscriptions, cancellations, etc. based on cohort age. You can check out the customer video and demo here.

The following sections show some examples of the new ARIMA_PLUS_XREG model in BigQuery ML. In this example, we explore the bigquery-public-data.epa_historical_air_quality dataset, which has daily air quality and weather information. We use the model to forecast the PM_2.5¹ , based on its historical data and some covariates, such as temperature and wind speed.

An example: forecast Seattle’s air quality with weather information

Step 1. Create the dataset

The PM_2.5, temperature, and wind speed data are in separate tables. To simplify the queries, create a new table by joining those tables into a new table “bqml_test.seattle_air_quality_daily,” with the following columns:

date: the date of the observation
PM_2.5: the average PM_2.5 value for each day
wind_speed: the average wind speed for each day
temperature: the highest temperature for each day

The new table has daily data from 2009-08-11 to 2022-01-31.

code_block: [StructValue([(u’code’, u”CREATE TABLE `bqml_test.seattle_air_quality_daily`rnASrnWITHrn pm25_daily AS (rn SELECTrn avg(arithmetic_mean) AS pm25, date_local AS datern FROMrn `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary`rn WHERErn city_name = ‘Seattle’rn AND parameter_name = ‘Acceptable PM2.5 AQI & Speciation Mass’rn GROUP BY date_localrn ),rn wind_speed_daily AS (rn SELECTrn avg(arithmetic_mean) AS wind_speed, date_local AS datern FROMrn `bigquery-public-data.epa_historical_air_quality.wind_daily_summary`rn WHERErn city_name = ‘Seattle’ AND parameter_name = ‘Wind Speed – Resultant’rn GROUP BY date_localrn ),rn temperature_daily AS (rn SELECTrn avg(first_max_value) AS temperature, date_local AS datern FROMrn `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary`rn WHERErn city_name = ‘Seattle’ AND parameter_name = ‘Outdoor Temperature’rn GROUP BY date_localrn )rnSELECTrn pm25_daily.date AS date, pm25, wind_speed, temperaturernFROM pm25_dailyrnJOIN wind_speed_daily USING (date)rnJOIN temperature_daily USING (date)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5f549d9a10>)])]

Here is a preview of the data:

Step 2. Create Model

The “CREATE MODEL” query of the new multivariate model, ARIMA_PLUS_XREG, is very similar to the current ARIMA_PLUS model. The major differences are the MODEL_TYPE and inclusion of feature columns in the SELECT statement.

code_block: [StructValue([(u’code’, u”CREATE OR REPLACErn MODELrn `bqml_test.seattle_pm25_xreg_model`rn OPTIONS (rn MODEL_TYPE = ‘ARIMA_PLUS_XREG’,rn time_series_timestamp_col = ‘date’,rn time_series_data_col = ‘pm25’)rnASrnSELECTrn date,rn pm25,rn temperature,rn wind_speedrnFROMrn `bqml_test.seattle_air_quality_daily`rnWHERErn datern BETWEEN DATE(‘2012-01-01’)rn AND DATE(‘2020-12-31′)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5f56c47b10>)])]

Step 3. Forecast the future data

With the created model, you can use the ML.FORECAST function to forecast the future data. Compared to the ARIMA_PLUS model, you have to specify the future covariates as an input.

code_block: [StructValue([(u’code’, u”SELECTrn *rnFROMrn ML.FORECAST(rn MODELrn `bqml_test.seattle_pm25_xreg_model`,rn STRUCT(30 AS horizon),rn (rn SELECTrn date,rn temperature,rn wind_speedrn FROMrn `bqml_test.seattle_air_quality_daily`rn WHERErn date > DATE(‘2020-12-31′)rn ))”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5f45f1a390>)])]

After running the above query, you can see the forecasting results:

Step 4. Evaluate the model

You can use the ML.EVALUATE function to evaluate the forecasting errors. You can set perform_aggregation to “TRUE” to get the aggregated error metric or “FALSE” to see the per timestamp errors.

code_block: [StructValue([(u’code’, u”SELECTrn *rnFROMrn ML.EVALUATE(rn MODEL `bqml_test.seattle_pm25_xreg_model`,rn (rn SELECTrn date,rn pm25,rn temperature,rn wind_speedrn FROMrn `bqml_test.seattle_air_quality_daily`rn WHERErn date > DATE(‘2020-12-31′)rn ),rn STRUCT(rn TRUE AS perform_aggregation,rn 30 AS horizon))”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e5f3abab450>)])]

The evaluation result of ARIMA_PLUS_XREG is as follows:

As a comparison, we also show the univariate forecasting ARIMA_PLUS result in the following table:

Compared to ARIMA_PLUS, ARIMA_PLUS_XREG performs better on all measured metrics on this specific dataset and date range.

Conclusion

In the previous example, we demonstrated how to create a multivariate time series forecasting model, forecast future values using the model, and evaluate the forecasted results. The ML.ARIMA_EVALUATE and ML.ARIMA_COEFFICIENTS table value functions are also helpful for investigating your model. Based on the feedback from users, the model does the following to improve user productivity.

It shortens the time spent preprocessing data and lets users keep their data in BigQuery when doing machine learning.
It reduces overhead for the users who know SQL to do machine learning work in BigQuery.

For more information about the ARIMA_PLUS_XREG model, please see thedocumentation here.

What’s Next?

In this blogpost, we described the BigQuery ML Multivariate Time Series Forecast model, which is now available for public preview. We also showed a code demo for a data scientist, data engineer, or data analyst to enable the multivariate time series forecast model.

The following features are coming soon:

Large-scale multivariate time series, i.e., training millions of models for millions of multivariate time series in a single CREATE MODEL statement
Multivariate time series anomaly detection

^{Thanks to Xi Cheng, Honglin Zheng, Jiashang Liu, Amir Hormati, Mingge Deng and Abhinav Khushraj from the BigQuery ML team. Also thanks to Weijie Shen from the Google Resource Efficiency Data Science team.}

^{1. A measure of air pollution from fine particulate matter}

AI Generated Robotic Content