Categories: FAANG

Amazon SageMaker AI introduces EAGLE based adaptive speculative decoding to accelerate generative AI inference

Screenshot 2025 11 19 at 111940 PM

Generative AI models continue to expand in scale and capability, increasing the demand for faster and more efficient inference. Applications need low latency and consistent performance without compromising output quality. Amazon SageMaker AI introduces new enhancements to its inference optimization toolkit that bring EAGLE based adaptive speculative decoding to more model architectures. These updates make it easier to accelerate decoding, optimize performance using your own data and deploy higher-throughput models using the familiar SageMaker AI workflow.

EAGLE, short for Extrapolation Algorithm for Greater Language-model Efficiency, is a technique that speeds up large language model decoding by predicting future tokens directly from the hidden layers of the model. When you guide optimization using your own application data, the improvements align with the actual patterns and domains you serve, producing faster inference that reflects your real workloads rather than generic benchmarks. Based on the model architecture, SageMaker AI trains EAGLE 3 or EAGLE 2 heads.

Note that this training and optimization is not limited to just a one time optimization operation. You can start by utilizing the datasets provided by SageMaker for the initial training, but as you continue to gather and collect your own data you can also fine-tune using your own curated dataset for highly adaptive, workload-specific performance. An example would be utilizing a tool such as Data Capture to curate your own dataset over time from real-time requests that are hitting your hosted model. This can be an iterative feature with multiple cycles of training to continuously improve performance.

In this post we’ll explain how to use EAGLE 2 and EAGLE 3 speculative decoding in Amazon SageMaker AI.

Solution overview

SageMaker AI now offers native support for both EAGLE 2 and EAGLE 3 speculative decoding, enabling each model architecture to apply the technique that best matches its internal design. For your base LLM, you can utilize either SageMaker JumpStart models or bring your own model artifacts to S3 from other model hubs, such as HuggingFace.

Speculative decoding is a widely employed technique for accelerating inference in LLMs without compromising quality. This method involves using a smaller draft model to generate preliminary tokens, which are then verified by the target LLM. The extent of the speedup achieved through speculative decoding is heavily dependent on the selection of the draft model.

The sequential nature of modern LLMs makes them expensive and slow, and speculative decoding has proven to be an effective solution to this problem. Methods like EAGLE improve upon this by reusing features from the target model, leading to better results. However, a current trend in the LLM community is to increase training data to boost model intelligence without adding inference costs. Unfortunately, this approach has limited benefits for EAGLE. This limitation is due to EAGLE’s constraints on feature prediction. To address this, EAGLE-3 is introduced, which predicts tokens directly instead of features and combines features from multiple layers using a technique called training-time testing. These changes significantly improve performance and allow the model to fully benefit from increased training data.

To give customers maximum flexibility, SageMaker supports every major workflow for building or refining an EAGLE model. You can train an EAGLE model entirely from scratch using the SageMaker curated open dataset, or train it from scratch with your own data to align speculative behavior with your traffic patterns. You can also start from an existing EAGLE base model: either retraining it with the default open dataset for a fast, high-quality baseline, or fine-tuning that base model with your own dataset for highly adaptive, workload-specific performance. In addition, SageMaker JumpStart provides fully pre-trained EAGLE models so you can begin optimizing immediately without preparing any artifacts.

The solution spans six supported architectures and includes a pre-trained, pre-cached EAGLE base to accelerate experimentation. SageMaker AI also supports widely used training data formats, specifically ShareGPT and OpenAI chat and completions, so existing corpora can be used directly. Customers can also provide the data captured using their own SageMaker AI endpoints provided the data is in the above specified formats. Whether you rely on the SageMaker open dataset or bring your own, optimization jobs typically deliver around a 2.5x thoughput over standard decoding while adapting naturally to the nuances of your specific use case.

All optimization jobs automatically produce benchmark results giving you clear visibility into latency and throughput improvements. You can run the entire workflow using SageMaker Studio or the AWS CLI and you deploy the optimized model through the same interface you already use for standard SageMaker AI inference.

SageMaker AI currently supports LlamaForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, Qwen2ForCausalLM and GptOssForCausalLM with EAGLE 3, and Qwen3NextForCausalLM with EAGLE 2. You can use one optimization pipeline across a mix of architectures while still gaining the benefits of model-specific behavior.

How EAGLE works inside the model

Speculative decoding can be thought of like a seasoned chief scientist guiding the flow of discovery. In traditional setups, a smaller “assistant” model runs ahead, quickly sketching out several possible token continuations, while the larger model examines and corrects those suggestions. This pairing reduces the number of slow, sequential steps by verifying multiple drafts at once.

EAGLE streamlines this process even further. Instead of depending on an external assistant, the model effectively becomes its own lab partner: it inspects its internal hidden-layer representations to anticipate several future tokens in parallel. Because these predictions arise from the model’s own learned structure, they tend to be more accurate upfront, leading to deeper speculative steps, fewer rejections, and smoother throughput.

By removing the overhead of coordinating a secondary model and enabling highly parallel verification, this approach alleviates memory bandwidth bottlenecks and delivers notable speedups, often around 2.5x, while maintaining the same output quality the baseline model would produce.

Running optimization jobs from the SDK or CLI

You can interface with the Optimization Toolkit using the AWS Python Boto3 SDK, Studio UI. In this section we explore utilizing the AWS CLI, the same API calls will map over to the Boto3 SDK. Here, the core API calls for endpoint creation remain the same: create_model, create_endpoint_config, and create_endpoint. The workflow we showcase here begins with model registration using the create_model API call. With the create_model API call you can specify your serving container and stack. You don’t need to create a SageMaker model object and can specify the model data in the Optimization Job API call as well.

For the EAGLE heads optimization, we specify the model data by pointing towards to the Model Data Source parameter, at the moment specification of the HuggingFace Hub Model ID is not supported. Pull your artifacts and upload them to an S3 bucket and specify it in the Model Data Source parameter. By default checks are done to verify that the appropriate files are uploaded so you have the standard model data expected for LLMs:

# traditional model data needed
model/
  config.json
  tokenizer.json
  tokenizer_config.json
  special_tokens_map.json
  generation_config.json
  vocab.json
  model.safetensors
  model.safetensors.index.json

Let’s look at a few paths here:

Using your own model data with your own EAGLE curated dataset
Bringing your own trained EAGLE that you may want to train more
Bring your own model data and use SageMaker AI built-in datasets

1. Using your own model data with your own EAGLE curated dataset

We can start an optimization job with the create-optimization-job API call. Here is an example with a Qwen3 32B model. Note that you can bring your own data or also use the built-in SageMaker provided datasets. First we can create a SageMaker Model object that specifies the S3 bucket with our model artifacts:

aws sagemaker --region us-west-2 create-model  
--model-name <target-model-name>  
--primary-container '{ "Image": "763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}", 
"ModelDataSource": { "S3DataSource": { "S3Uri": "Enter model path", 
"S3DataType": "S3Prefix", "CompressionType": "None" } } }'  --execution-role-arn "Enter Execution Role ARN"

Our optimization call then pulls down these model artifacts when you specify the SageMaker Model and a TrainingDataSource parameter as the following:

aws sagemaker --region us-west-2 create-optimization-job 
    --optimization-job-name <job-name> 
    --account-id <account-id> 
    --deployment-instance-type ml.p5.48xlarge 
    --max-instance-count 10 
    --model-source '{
        "SageMakerModel": { "ModelName": "Created Model name" }
    }' 
    --optimization-configs'{
            "ModelSpeculativeDecodingConfig": {
                "Technique": "EAGLE",
                "TrainingDataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "Enter custom train data location"
                }
            }
        }' 
    --output-config '{
        "S3OutputLocation": "Enter optimization output location"
    }' 
    --stopping-condition '{"MaxRuntimeInSeconds": 432000}' 
    --role-arn "Enter Execution Role ARN"

2. Bringing your own trained EAGLE that you may want to train more

For your own trained EAGLE you can specify another parameter in the create_model API call where you point towards your EAGLE artifacts, optionally you can also specify a SageMaker JumpStart Model ID to pull down the packaged model artifacts.

# Enable additional model data source with EAGLE artifacts
aws sagemaker --region us-west-2 create-model  
--model-name <target-model-name>  
--primary-container '{ "Image": "763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}", 
"ModelDataSource": { "S3DataSource": { "S3Uri": "<model path>", 
"S3DataType": "S3Prefix", "CompressionType": "None" } }, 
"AdditionalModelDataSources": [ { "ChannelName": "eagle_model", 
"S3DataSource": { "S3Uri": "<pre-trained EAGLE path>", 
"S3DataType": "S3Prefix", "CompressionType": "None" } } ] }'  --execution-role-arn "Enter Execution Role ARN"

Similarly the optimization API then inherits this model object with the necessary model data:

aws sagemaker --region us-west-2 create-optimization-job 
 --account-id <account-id> 
 --optimization-job-name <job-name> 
 --deployment-instance-type ml.p5.48xlarge 
 --max-instance-count 10 
 --model-source '{
 "SageMakerModel": {
    "ModelName": "Created Model Name"
    }
 }' 
 --optimization-configs '{
    "ModelSpeculativeDecodingConfig": {
    "Technique": "EAGLE",
    "TrainingDataSource": {
    "S3Uri": "Enter training data path",
    "S3DataType": "S3Prefix"
    }
   }
 }' 
 --output-config '{
    "SageMakerModel": {
    "ModelName": "Model Name"
   },
   "S3OutputLocation": "Enter output data location"
 }' 
 --stopping-condition '{"MaxRuntimeInSeconds": 432000}' 
 --role-arn "Enter Execution Role ARN"

3. Bring your own model data and use SageMaker built-in datasets

Optionally, we can utilize the SageMaker provided datasets:

# SageMaker Provided Optimization Datasets 
gsm8k_training.jsonl (https://huggingface.co/datasets/openai/gsm8k)
magicoder.jsonl (https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K)
opencodeinstruct.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
swebench_oracle_train.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
ultrachat_0_8k_515292.jsonl (https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)

After completion, SageMaker AI stores evaluation metrics in S3 and records the optimization lineage in Studio. You can deploy the optimized model to an inference endpoint with either the create_endpoint API call or in the UI.

Benchmarks

To benchmark this further we compared three states:

No EAGLE: Base model without EAGLE as a baseline
Base EAGLE: EAGLE training using built-in datasets provided by SageMaker AI
Trained EAGLE: EAGLE training using built-in datasets provided by SageMaker AI and retraining with own custom dataset

The numbers displayed below are for qwen3-32B across metrics such as Time to First Token (TTFT) and overall throughput.

Configuration	Concurrency	TTFT (ms)	TPOT (ms)	ITL (ms)	Request Throughput	Output Throughput (tokens/sec)	OTPS per request (tokens/sec)
No EAGLE	4	168.04	45.95	45.95	0.04	86.76	21.76
No EAGLE	8	219.53	51.02	51.01	0.08	156.46	19.6
Base EAGLE	1	89.76	21.71	53.01	0.02	45.87	46.07
Base EAGLE	2	132.15	20.78	50.75	0.05	95.73	48.13
Base EAGLE	4	133.06	20.11	49.06	0.1	196.67	49.73
Base EAGLE	8	154.44	20.58	50.15	0.19	381.86	48.59
Trained EAGLE	1	83.6	17.32	46.37	0.03	57.63	57.73
Trained EAGLE	2	129.07	18	48.38	0.05	110.86	55.55
Trained EAGLE	4	133.11	18.46	49.43	0.1	214.27	54.16
Trained EAGLE	8	151.19	19.15	51.5	0.2	412.25	52.22

Pricing considerations

Optimization jobs run on SageMaker AI training instances, you will be billed depending on the instance type and job duration. Deployment of the resulting optimized model uses standard SageMaker AI Inference pricing.

Conclusion

EAGLE based adaptive speculative decoding gives you a faster and more effective path to improve generative AI inference performance on Amazon SageMaker AI. By working inside the model rather than relying on a separate draft network, EAGLE accelerates decoding, increases throughput and maintains generation quality. When you optimize using your own dataset, the improvements reflect the unique behavior of your applications, resulting in better end-to-end performance. With built-in dataset support, benchmark automation and streamlined deployment, the inference optimization toolkit helps you deliver low-latency generative applications at scale.

About the authors

Kareem Syed-Mohammed is a Product Manager at AWS. He is focuses on enabling generative AI model development and governance on SageMaker HyperPod. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

Xu Deng is a Software Engineer Manager with the SageMaker team. He focuses on helping customers build and optimize their AI/ML inference experience on Amazon SageMaker. In his spare time, he loves traveling and snowboarding.

Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on SageMaker. In his spare time, he loves traveling and writing.

Vinay Arora is a Specialist Solution Architect for Generative AI at AWS, where he collaborates with customers in designing cutting-edge AI solutions leveraging AWS technologies. Prior to AWS, Vinay has over two decades of experience in finance—including roles at banks and hedge funds—he has built risk models, trading systems, and market data platforms. Vinay holds a master’s degree in computer science and business management.

Siddharth Shah is a Principal Engineer at AWS SageMaker, specializing in large-scale model hosting and optimization for Large Language Models. He previously worked on the launch of Amazon Textract, performance improvements in the model-hosting platform, and expedited retrieval systems for Amazon S3 Glacier. Outside of work, he enjoys hiking, video games, and hobby robotics.

Andy Peng is a builder with curiosity, motivated by scientific research and product innovation. He helped build key initiatives that span AWS SageMaker and Bedrock, Amazon S3, AWS App Runner, AWS Fargate, Alexa Health & Wellness, and AWS Payments, from 0-1 incubation to 10x scaling. Open-source enthusiast.

Johna Liu is a Software Development Engineer on the Amazon SageMaker team, where she builds and explores AI/LLM-powered tools that enhance efficiency and enable new capabilities. Outside of work, she enjoys tennis, basketball and baseball.

Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring new Seattle restaurants, traveling, and spending time with family and friends.

Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 2

July 10, 2024

In "FAANG"

Llama 3.3 70B now available in Amazon SageMaker JumpStart

December 17, 2024

In "FAANG"

Speculative Streaming: Fast LLM Inference Without Auxiliary Models

This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP) workshop at NeurIPS 2024. Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves…

October 30, 2024

In "FAANG"

AI Generated Robotic Content