Amp, the new live radio app from Amazon, is a reinvention of radio featuring human-curated live audio shows. It’s designed to provide a seamless customer experience to listeners and creators by debuting interactive live audio shows from your favorite artists, radio DJs, podcasters, and friends.
However, as a new product in a new space for Amazon, Amp needed more relevant data to inform their decision-making process. Amp wanted a scalable data and analytics platform to enable easy access to data and perform machine leaning (ML) experiments for live audio transcription, content moderation, feature engineering, and a personal show recommendation service, and to inspect or measure business KPIs and metrics.
This post is the first in a two-part series. Part 1 shows how data was collected and processed using the data and analytics platform, and Part 2 shows how the data was used to create show recommendations using Amazon SageMaker, a fully managed ML service. The personalized show recommendation list service has shown a 3% boost to customer engagement metrics tracked (such as liking a show, following a creator, or enabling upcoming show notifications) since its launch in May 2022.
The data sources for Amp can be broadly categorized as either streaming (near-real time) or batch (point in time). The source data is emitted from Amp-owned systems or other Amazon systems. The two different data types are as follows:
The following diagram illustrates the high-level architecture.
The Amp data and analytics platform can be broken down into three high-level systems:
In the following sections, we discuss each component in more detail.
Amp created a serverless streaming ingestion pipeline capable of tapping into data from sources without the need for infrastructure management, as shown in the following diagram.
The pipeline was able to ingest the Amp show catalog data (what shows are available on Amp) and pass it to the data lake for two different use cases: one for near-real-time analytics, and one for batch analytics.
As part of the ingestion pipeline, the Amp team has an Amazon Simple Queue Service (Amazon SQS) queue that receives messages from an upstream Amazon Simple Notification Service (Amazon SNS) topic that contains information on changes to shows in the catalog. These changes could be the addition of new shows or adjustments to existing ones that have been scheduled.
When the message is received by the SQS queue, it triggers the AWS Lambda function to make an API call to the Amp catalog service. The Lambda function retrieves the desired show metadata, filters the metadata, and then sends the output metadata to Amazon Kinesis Data Streams. Amazon Kinesis Data Firehose receives the records from the data stream. Kinesis Data Firehose then invokes a secondary Lambda function to perform a data transformation that flattens the JSON records received and writes the transformed records to an Amazon Simple Storage Service (Amazon S3) data lake for consumption by Amp stakeholders.
Kinesis Data Firehose enabled buffering and writing data to Amazon S3 every 60 seconds. This helped Amp teams make near-real-time programming decisions that impacted external customers.
The streaming ingestion pipeline supported the following objectives: performance, availability, scalability, and flexibility to send data to multiple downstream applications or services:
Amp created a transient batch (point in time) ingestion pipeline capable of data ingestion, processing and transformation, and storage, as shown in the following diagram.
A transient extract, transform, and load (ETL) and extract, load, and transform (ELT) job approach was implemented because of the batch nature of these workloads and unknown data volumes. As a part of the workflow automation, Amazon SQS was used to trigger a Lambda function. The Lambda function then activated the AWS Glue crawler to infer the schema and data types. The crawler wrote the schema metadata to the AWS Glue Data Catalog, providing a unified metadata store for data sharing.
The ETL and ELT jobs were required to run on either a set schedule or event-driven workflow. To handle these needs, Amp used Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Apache Airflow is an open-source Python-based workflow management platform. Amazon MWAA is a fully managed service that automatically handles scaling. It provides sequencing, error handling, retry logic, and state. With Amazon MWAA, Amp was able to take advantage of the the benefits of Airflow for job orchestration while not having to manage or maintain dedicated Airflow servers. Additionally, by using Amazon MWAA, Amp was able to create a code repository and workflow pipeline stored in Amazon S3 that Amazon MWAA could access. The pipeline allowed Amp data engineers to easily deploy Airflow DAGs or PySpark scripts across multiple environments.
Amp used Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS) to configure and manage containers for their data processing and transformation jobs. Due to the unique nature of the Amp service, the initial expected data volumes that would be processed were relatively unknown. To provide flexibility as the service evolved, the team decided to go with Amazon EMR on EKS to eliminate any unnecessary operational overheard required to bootstrap and scale Amazon EMR for data processing. This approach allowed them to run transient hybrid EMR clusters backed by a mix of AWS Fargate and Amazon Elastic Compute Cloud (Amazon EC2) nodes, where all system tasks and workloads were offloaded to Fargate, while Amazon EC2 handled all the Apache Spark processing and transformation. This provided the flexibility to have a cluster with one node running, while the Amazon EKS auto scaler dynamically instantiated and bootstrapped any additional EC2 nodes that were required for the job. When the job was complete, they were automatically deleted by the cluster auto scaler. This pattern eliminated the need for the team to manage any of the cluster bootstrap actions or scaling required to respond to evolving workloads.
Amazon S3 was used as the central data lake, and data was stored in Apache Parquet (Parquet) format. Parquet is a columnar format, which speeds up data retrieval and provides efficient data compression. Amazon S3 provided the flexibility, scalability, and security needs for Amp. With Amazon S3, the Amp team was able to centralize data storage in one location and federate access to the data virtually across any service or tool within or outside of AWS. The data lake was split into two S3 buckets: one for raw data ingestion and one for transformed data output. Amazon EMR performed the transformation from raw data to transformed data. With Amazon S3 as the central data lake, Amp was able to securely expose and share the data with other teams across Amp and Amazon.
To simplify data definition, table access provisioning, and the addition and removal of tables, they used AWS Glue crawlers and the AWS Glue Data Catalog. Because Amp is a new service and constantly evolving, the team needed a way to easily define, access, and manage the tables in the data lake. The crawlers handled data definition (including schema changes) and the addition and removal of tables, while the Data Catalog served as a unified metadata store.
The following diagram illustrates the architecture for the BI and analytics component.
Amp chose to store the data in the S3 data lake, and not in the data warehouse. This enabled them to access it in a unified manner through the AWS Glue Data Catalog and provided greater flexibility for data consumers. This resulted in faster data access across a variety of services or tools. With data being stored in Amazon S3, it also reduced data warehouse infrastructure costs, because the costs are a function of the compute type and the amount of data stored.
The Amazon Redshift RA3 node type was used as the compute layer to enable stakeholders to query data stored in Amazon S3. Amazon Redshift RA3 nodes decouple storage and compute, and are designed for an access pattern through the AWS Glue Data Catalog. RA3 nodes introduce Amazon Redshift Managed Storage, which is Amazon S3 backed. The combination of these features enabled Amp to right-size the clusters and provide better query performance for their customers while minimizing costs.
Amazon Redshift configuration was automated using a Lambda function, which connected to a given cluster and ran parameterized SQL statements. The SQL statements contained the logic to deploy schemas, user groups, and users, while AWS Secrets Manager was used to automatically generate, store, and rotate Amazon Redshift user passwords. The underlying configuration variables were stored in Amazon DynamoDB. The Lambda function retrieved the variables and requested temporary Amazon Redshift credentials to perform the configuration. This process enabled the Amp team to set up Amazon Redshift clusters in a consistent manner.
Amp was able to achieve the following business outcomes:
By implementing a scalable, cost-efficient architecture, Amp was able to achieve the following:
In this post, we saw how Amp created their data analytics platform using user behavioral data from streaming and batch data sources. The key factors that drove the implementation were the need to provide a flexible, scalable, cost-efficient, and effort-efficient data analytics platform. Design choices were made evaluating various AWS services.
Part 2 of this series shows how we used this data and built out the personalized show recommendation list using SageMaker.
As next steps, we recommend doing a deep dive into each stage of your data pipeline system and making design choices that would be cost-effective and scalable for your needs. For more information, you can also check out other customer use cases in the AWS Analytics Blog.
If you have feedback about this post, submit it in the comments section.
Podcasts are a fun and easy way to learn about machine learning.
TL;DR We asked o1 to share its thoughts on our recent LNM/LMM post. https://www.artificial-intelligence.show/the-ai-podcast/o1s-thoughts-on-lnms-and-lmms What…
Palantir and Grafana Labs’ Strategic PartnershipIntroductionIn today’s rapidly evolving technological landscape, government agencies face the…
Amazon SageMaker Pipelines includes features that allow you to streamline and automate machine learning (ML)…
When it comes to AI, large language models (LLMs) and machine learning (ML) are taking…
Cohere's Command R7B uses RAG, features a context length of 128K, supports 23 languages and…