ml 157041
This post was written with NVIDIA and the authors would like to thank Adi Margolin, Eliuth Triana, and Maryam Motamedi for their collaboration.
Organizations today face the challenge of processing large volumes of audio data–from customer calls and meeting recordings to podcasts and voice messages–to unlock valuable insights. Automatic Speech Recognition (ASR) is a critical first step in this process, converting speech to text so that further analysis can be performed. However, running ASR at scale is computationally intensive and can be expensive. This is where asynchronous inference on Amazon SageMaker AI comes in. By deploying state-of-the-art ASR models (like NVIDIA Parakeet models) on SageMaker AI with asynchronous endpoints, you can handle large audio files and batch workloads efficiently. With asynchronous inference, long-running requests can be processed in the background (with results delivered later); it also supports auto-scaling to zero when there’s no work and handles spikes in demand without blocking other jobs.
In this blog post, we’ll explore how to host the NVIDIA Parakeet ASR model on SageMaker AI and integrate it into an asynchronous pipeline for scalable audio processing. We’ll also highlight the benefits of Parakeet’s architecture and the NVIDIA Riva toolkit for speech AI, and discuss how to use NVIDIA NIM for deployment on AWS.
NVIDIA offers a comprehensive suite of speech AI technologies, combining high-performance models with efficient deployment solutions. At its core, the Parakeet ASR model family represents state-of-the-art speech recognition capabilities, achieving industry-leading accuracy with low word error rates (WERs) . The model’s architecture uses the Fast Conformer encoder with the CTC or transducer decoder, enabling 2.4× faster processing than standard Conformers while maintaining accuracy.
NVIDIA speech NIM is a collection of GPU-accelerated microservices for building customizable speech AI applications. NVIDIA Speech models deliver accurate transcription accuracy and natural, expressive voices in over 36 languages–ideal for customer service, contact centers, accessibility, and global enterprise workflows. Developers can fine-tune and customize models for specific languages, accents, domains, and vocabularies, supporting accuracy and brand voice alignment.
Seamless integration with LLMs and the NVIDIA Nemo Retriever make NVIDIA models ideal for agentic AI applications, helping your organization stand out with more secure, high-performing, voice AI. The NIM framework delivers these services as containerized solutions, making deployment straightforward through Docker containers that include the necessary dependencies and optimizations.
This combination of high-performance models and deployment tools provides organizations with a complete solution for implementing speech recognition at scale.
The architecture illustrated in the diagram showcases a comprehensive asynchronous inference pipeline designed specifically for ASR and summarization workloads. The solution provides a robust, scalable, and cost-effective processing pipeline.
The architecture consists of five key components working together to create an efficient audio processing pipeline. At its core, the SageMaker AI asynchronous endpoint hosts the Parakeet ASR model with auto scaling capabilities that can scale to zero when idle for cost optimization.
In this section, we will provide the detailed walkthrough of the solution implementation.
To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with least-privilege permissions to manage resources created. For details, refer to Create an AWS account. You might need to request a service quota increase for the corresponding SageMaker async hosting instances. In this example, we need one ml.g5.xlarge SageMaker async hosting instance and a ml.g5.xlarge SageMaker notebook instance. You can also choose a different integrated development environment (IDE), but make sure the environment contains GPU compute resources for local testing.
When you deploy a custom model like Parakeet, SageMaker has a couple of options:
We’ll provide examples for all three approaches.
NVIDIA NIM provides a streamlined approach to deploying optimized AI models through containerized solutions. Our implementation takes this concept further by creating a unified SageMaker AI endpoint that intelligently routes between HTTP and gRPC protocols to help maximize both performance and capabilities while simplifying the deployment process.
Innovative dual-protocol architecture
The key innovation is the combined HTTP + gRPC architecture that exposes a single SageMaker AI endpoint with intelligent routing capabilities. This design addresses the common challenge of choosing between protocol efficiency and feature completeness by automatically selecting the optimal transport method. The HTTP route is optimized for simple transcription tasks with files under 5MB, providing faster processing and lower latency for common use cases. Meanwhile, the gRPC route supports larger files (SageMaker AI real-time endpoints support a max payload of 25MB) and advanced features like speaker diarization with precise word-level timing information. The system’s auto-routing functionality analyzes incoming requests to determine file size and requested features, then automatically selects the most appropriate protocol without requiring manual configuration. For applications that need explicit control, the endpoint also supports forced routing through /invocations/http for simple transcription or /invocations/grpc when speaker diarization is required. This flexibility allows both automated optimization and fine-grained control based on specific application requirements.
Advanced speech recognition and speaker diarization capabilities
The NIM container enables a comprehensive audio processing pipeline that seamlessly combines speech recognition with speaker identification through the NVIDIA Riva integrated capabilities. The container handles audio preprocessing, including format conversion and segmentation, while ASR and speaker diarization processes run concurrently on the same audio stream. Results are automatically aligned using overlapping time segments, with each transcribed segment receiving appropriate speaker labels (for example, Speaker_0, Speaker_1). The inference handler processes audio files through the complete pipeline, initializing both ASR and speaker diarization services, running them in parallel, and aligning transcription segments with speaker labels. The output includes the full transcription, timestamped segments with speaker attribution, confidence scores, and total speaker count in a structured JSON format.
Implementation and deployment
The implementation extends NVIDIA parakeet-1-1b-ctc-en-us NIM container as the foundation, adding a Python aiohttp server that seamlessly manages the complete NIM lifecycle by automatically starting and monitoring the service. The server handles protocol adaptation by translating SageMaker inference requests to appropriate NIM APIs, implements the intelligent routing logic that analyzes request characteristics, and provides comprehensive error handling with detailed error messages and fallback mechanisms for robust production deployment. The containerized solution streamlines deployment through standard Docker and AWS CLI commands, featuring a pre-configured Docker file with the necessary dependencies and optimizations. The system accepts multiple input formats including multipart form-data (recommended for maximum compatibility), JSON with base64 encoding for simple integration scenarios, and raw binary uploads for direct audio processing.
For detailed implementation instructions and working examples, teams can reference the complete implementation and deployment notebook in the AWS samples repository, which provides comprehensive guidance on deploying Parakeet ASR with NIM on SageMaker AI using the bring your own container (BYOC) approach. For organizations with specific architectural preferences, separate HTTP-only and gRPC-only implementations are also available, providing simpler deployment models for teams with well-defined use cases while the combined implementation offers maximum flexibility and automatic optimization.
AWS customers can deploy these models either as production-grade NVIDIA NIM containers directly from SageMaker Marketplace or JumpStart, or open source NVIDIA models available on Hugging Face, which can be deployed through custom containers on SageMaker or Amazon Elastic Kubernetes Service (Amazon EKS). This allows organizations to choose between fully managed, enterprise-tier endpoints with auto-scaling and security, or flexible open-source development for research or constrained use cases.
LMI containers are designed to simplify hosting large models on AWS. These containers include optimized inference engines like vLLM, FasterTransformer, or TensorRT-LLM that can automatically handle things like model parallelism, quantization, and batching for large models. The LMI container is essentially a pre-configured Docker image that runs an inference server (for example a Python server with these optimizations) and allows you to specify model parameters by using environment variables.
To use the LMI container for Parakeet, we would typically:
AWS LMI containers deliver high performance and scalability through advanced optimization techniques, including continuous batching, tensor parallelism, and state-of-the-art quantization methods. LMI containers integrate multiple inference backends (vLLM, TensorRT-LLM through a single unified configuration), helping users seamlessly experiment and switch between frameworks to find the optimal performance stack for your specific use case.
SageMaker offers PyTorch Deep Learning Containers (DLCs) that come with PyTorch and many common libraries pre-installed. In this example, we demonstrated how to extend our prebuilt container to install necessary packages for the model. You can download the model directly from Hugging Face during the endpoint creation or download the Parakeet model artifacts, packaging it with necessary configuration files into a model.tar.gz archive, and uploading it to Amazon S3. Along with the model artifacts, an inference.py script is required as the entry point script to define model loading and inference logic, including audio preprocessing and transcription handling. When using the SageMaker Python SDK to create a PyTorchModel, the SDK will automatically repackage the model archive to include the inference script under /opt/ml/model/code/inference.py, while keeping model artifacts in /opt/ml/model/ on the endpoint. Once the endpoint is deployed successfully, it can be invoked through the predict API by sending audio files as byte streams to get transcription results.
For the SageMaker real-time endpoint, we currently allow a maximum of 25MB for payload size. Make sure you have set up the container to also allow the maximum request size. However, if you are planning to use the same model for the asynchronous endpoint, the maximum file size that the async endpoint supports is 1GB and the response time is up to 1 hour. Accordingly, you should setup the container to be prepared for this payload size and timeout. When using the PyTorch containers, here are some key configuration parameters to consider:
In the example notebook, we also showcase how to leverage the SageMaker local session provided by the SageMaker Python SDK. It helps you create estimators and run training, processing, and inference jobs locally using Docker containers instead of managed AWS infrastructure, providing a fast way to test and debug your machine learning scripts before scaling to production.
Before deploying this solution, make sure you have:
The solution deployment begins with provisioning the necessary AWS resources using Infrastructure as Code (IaC) principles. AWS CDK creates the foundational components including:
Clone the repository and install dependencies:
Update the SageMaker endpoint configuration in bin/aws-blog-sagemaker.ts:
If you have followed the notebook to deploy the endpoint, you should have created the two SNS topics. Otherwise, make sure you create the correct SNS topics using CLI:
Before you deploy the AWS CloudFormation template, make sure Docker is running.
After successful deployment, note the output values:
Update the upload_audio_invoke_lambda.sh
Run the Script:
AWS_PROFILE=default ./scripts/upload_audio_invoke_lambda.sh
This script will:
You can check the result in DynamoDB table using the following command:
Check processing status in the DynamoDB table:
The core processing workflow follows an event-driven pattern:
Initial processing and metadata extraction: When audio files are uploaded to S3, the triggered Lambda function analyzes the file metadata, validates format compatibility, and creates detailed invocation records in DynamoDB. This facilitates comprehensive tracking from the moment audio content enters the system.
Asynchronous Speech Recognition: Audio files are processed through the SageMaker endpoint using optimized ASR models. The asynchronous process can handle various file sizes and durations without timeout concerns. Each processing request is assigned a unique identifier for tracking purposes.
Success path processing: Upon successful transcription, the system automatically initiates the summarization workflow. The transcribed text is sent to Amazon Bedrock, where advanced language models generate contextually appropriate summaries based on configurable parameters such as summary length, focus areas, and output format.
Error handling and recovery: Failed processing attempts trigger dedicated Lambda functions that log detailed error information, update processing status, and can initiate retry logic for transient failures. This robust error handling results in minimal data loss and provides clear visibility into processing issues.
Customer service analytics: Organizations can process thousands of customer service call recordings to generate transcriptions and summaries, enabling sentiment analysis, quality assurance, and insights extraction at scale.
Meeting and conference processing: Enterprise teams can automatically transcribe and summarize meeting recordings, creating searchable archives and actionable summaries for participants and stakeholders.
Media and content processing: Media companies can process podcast episodes, interviews, and video content to generate transcriptions and summaries for improved accessibility and content discoverability.
Compliance and legal documentation: Legal and compliance teams can process recorded depositions, hearings, and interviews to create accurate transcriptions and summaries for case preparation and documentation.
Once you have used the solution, remove the SageMaker endpoints to prevent incurring additional costs. You can use the provided code to delete real-time and asynchronous inference endpoints, respectively:
You should also delete all the resources created by the CDK stack.
The integration of powerful NVIDIA speech AI technologies with AWS cloud infrastructure creates a comprehensive solution for large-scale audio processing. By combining Parakeet ASR’s industry-leading accuracy and speed with NVIDIA Riva’s optimized deployment framework on the Amazon SageMaker asynchronous inference pipeline, organizations can achieve both high-performance speech recognition and cost-effective scaling. The solution leverages the managed services of AWS (SageMaker AI, Lambda, S3, and Bedrock) to create an automated, scalable pipeline for processing audio content. With features like auto scaling to zero, comprehensive error handling, and real-time monitoring through DynamoDB, organizations can focus on extracting business value from their audio content rather than managing infrastructure complexity. Whether processing customer service calls, meeting recordings, or media content, this architecture delivers reliable, efficient, and cost-effective audio processing capabilities. To experience the full potential of this solution, we encourage you to explore the solution and reach out to us if you have any specific business requirements and would like to customise the solution for your use case.
submitted by /u/wtf_nabil [link] [comments]
Language models can generate text and reason impressively, yet they remain isolated by default.
Language models prompted with a user description or persona are being used to predict the…
Under the Hood of NVIDIA and PalantirTurning Enterprise Data into Decision IntelligenceOn Tuesday, October 28 in…
Welcome to The Blueprint, a new feature where we highlight how Google Cloud customers are…
In an industry where model size is often seen as a proxy for intelligence, IBM…