This post is co-authored with Travis Mehlinger and Karthik Raghunathan from Cisco.
Webex by Cisco is a leading provider of cloud-based collaboration solutions which includes video meetings, calling, messaging, events, polling, asynchronous video and customer experience solutions like contact center and purpose-built collaboration devices. Webex’s focus on delivering inclusive collaboration experiences fuels our innovation, which leverages AI and Machine Learning, to remove the barriers of geography, language, personality, and familiarity with technology. Its solutions are underpinned with security and privacy by design. Webex works with the world’s leading business and productivity apps – including AWS.
Cisco’s Webex AI (WxAI) team plays a crucial role in enhancing these products with AI-driven features and functionalities, leveraging LLMs to improve user productivity and experiences. In the past year, the team has increasingly focused on building artificial intelligence (AI) capabilities powered by large language models (LLMs) to improve productivity and experience for users. Notably, the team’s work extends to Webex Contact Center, a cloud-based omni-channel contact center solution that empowers organizations to deliver exceptional customer experiences. By integrating LLMs, WxAI team enables advanced capabilities such as intelligent virtual assistants, natural language processing, and sentiment analysis, allowing Webex Contact Center to provide more personalized and efficient customer support. However, as these LLM models grew to contain hundreds of gigabytes of data, WxAI team faced challenges in efficiently allocating resources and starting applications with the embedded models. To optimize its AI/ML infrastructure, Cisco migrated its LLMs to Amazon SageMaker Inference, improving speed, scalability, and price-performance.
This blog post highlights how Cisco implemented faster autoscaling release reference. For more details on Cisco’s Use Cases, Solution & Benefits see How Cisco accelerated the use of generative AI with Amazon SageMaker Inference.
In this post, we will discuss the following:
Webex is applying generative AI to its contact center solutions, enabling more natural, human-like conversations between customers and agents. The AI can generate contextual, empathetic responses to customer inquiries, as well as automatically draft personalized emails and chat messages. This helps contact center agents work more efficiently while maintaining a high level of customer service.
Initially, WxAI embedded LLM models directly into the application container images running on Amazon Elastic Kubernetes Service (Amazon EKS). However, as the models grew larger and more complex, this approach faced significant scalability and resource utilization challenges. Operating the resource-intensive LLMs through the applications required provisioning substantial compute resources, which slowed down processes like allocating resources and starting applications. This inefficiency hampered WxAI’s ability to rapidly develop, test, and deploy new AI-powered features for the Webex portfolio.
To address these challenges, WxAI team turned to SageMaker Inference – a fully managed AI inference service that allows seamless deployment and scaling of models independently from the applications that use them. By decoupling the LLM hosting from the Webex applications, WxAI could provision the necessary compute resources for the models without impacting the core collaboration and communication capabilities.
“The applications and the models work and scale fundamentally differently, with entirely different cost considerations, by separating them rather than lumping them together, it’s much simpler to solve issues independently.”
– Travis Mehlinger, Principal Engineer at Cisco.
This architectural shift has enabled Webex to harness the power of generative AI across its suite of collaboration and customer engagement solutions.
Today Sagemaker endpoint uses autoscaling with invocation per instance. However, it takes ~6 minutes to detect need for autoscaling.
Cisco Webex AI team wanted to improve their inference auto scaling times, so they worked with Amazon SageMaker to improve inference.
Amazon SageMaker’s real-time inference endpoint offers a scalable, managed solution for hosting Generative AI models. This versatile resource can accommodate multiple instances, serving one or more deployed models for instant predictions. Customers have the flexibility to deploy either a single model or multiple models using SageMaker InferenceComponents on the same endpoint. This approach allows for efficient handling of diverse workloads and cost-effective scaling.
To optimize real-time inference workloads, SageMaker employs application automatic scaling (auto scaling). This feature dynamically adjusts both the number of instances in use and the quantity of model copies deployed (when using inference components), responding to real-time changes in demand. When traffic to the endpoint surpasses a predefined threshold, auto scaling increases the available instances and deploys additional model copies to meet the heightened demand. Conversely, as workloads decrease, the system automatically removes unnecessary instances and model copies, effectively reducing costs. This adaptive scaling ensures that resources are optimally utilized, balancing performance needs with cost considerations in real-time.
Working with Cisco, Amazon SageMaker releases new sub-minute high-resolution pre-defined metric type SageMakerVariantConcurrentRequestsPerModelHighResolution
for faster autoscaling and reduced detection time. This newer high-resolution metric has shown to reduce scaling detection times by up to 6x (compared to existing SageMakerVariantInvocationsPerInstance
metric) and thereby improving overall end-to-end inference latency by up to 50%, on endpoints hosting Generative AI models like Llama3-8B.
With this new release, SageMaker real-time endpoints also now emits new ConcurrentRequestsPerModel
and ConcurrentRequestsPerModelCopy
CloudWatch metrics as well, which are more suited for monitoring and scaling Amazon SageMaker endpoints hosting LLMs and FMs.
Cisco evaluated Amazon SageMaker’s new pre-defined metric types for faster autoscaling on their Generative AI workloads. They observed up to a 50% latency improvement in end-to-end inference latency by using the new SageMakerequestsPerModelHighResolution
metric, compared to the existing SageMakerVariantInvocationsPerInstance
metric.
The setup involved using their Generative AI models, on SageMaker’s real-time inference endpoints. SageMaker’s autoscaling feature dynamically adjusted both the number of instances and the quantity of model copies deployed to meet real-time changes in demand. The new high-resolution SageMakerVariantConcurrentRequestsPerModelHighResolution
metric reduced scaling detection times by up to 6x, enabling faster autoscaling and lower latency.
In addition, SageMaker now emits new CloudWatch metrics, including ConcurrentRequestsPerModel
and ConcurrentRequestsPerModelCopy
, which are better suited for monitoring and scaling endpoints hosting large language models (LLMs) and foundation models (FMs). This enhanced autoscaling capability has been a game-changer for Cisco, helping to improve the performance and efficiency of their critical Generative AI applications.
“We are really pleased with the performance improvements we’ve seen from Amazon SageMaker’s new autoscaling metrics. The higher-resolution scaling metrics have significantly reduced latency during initial load and scale-out on our Gen AI workloads. We’re excited to do a broader rollout of this feature across our infrastructure”
– Travis Mehlinger, Principal Engineer at Cisco.
Cisco further plans to work with SageMaker inference to drive improvements in rest of the variables that impact autoscaling latencies. Like model download and load times.
Cisco’s Webex AI team is continuing to leverage Amazon SageMaker Inference to power generative AI experiences across its Webex portfolio. Evaluation with faster autoscaling from SageMaker has shown Cisco up to 50% latency improvements in its GenAI inference endpoints. As WxAI team continues to push the boundaries of AI-driven collaboration, its partnership with Amazon SageMaker will be crucial in informing upcoming improvements and advanced GenAI inference capabilities. With this new feature Cisco looks forward to further optimizing its AI Inference performance by rolling it broadly in multiple regions and delivering even more impactful generative AI features to its customers.
Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved…
Today we are announcing the general availability of Amazon Bedrock Prompt Management, with new features…
The rise of Natural Language Processing (NLP) combined with traditional Structured Query Language (SQL) has…
Through his wealth and cultural influence, Elon Musk undoubtedly strengthened the Trump campaign. WIRED unpacks…
The growing use of artificial intelligence (AI)-based models is placing greater demands on the electronics…
This post is co-written with Steven Craig from Hearst. To maintain their competitive edge, organizations…