Categories: FAANG

Enhancing Just Walk Out technology with multi-modal AI

ML 17134 image003

Since its launch in 2018, Just Walk Out technology by Amazon has transformed the shopping experience by allowing customers to enter a store, pick up items, and leave without standing in line to pay. You can find this checkout-free technology in over 180 third-party locations worldwide, including travel retailers, sports stadiums, entertainment venues, conference centers, theme parks, convenience stores, hospitals, and college campuses. Just Walk Out technology’s end-to-end system automatically determines which products each customer chose in the store and provides digital receipts, eliminating the need for checkout lines.

In this post, we showcase the latest generation of Just Walk Out technology by Amazon, powered by a multi-modal foundation model (FM). We designed this multi-modal FM for physical stores using a transformer-based architecture similar to that underlying many generative artificial intelligence (AI) applications. The model will help retailers generate highly accurate shopping receipts using data from multiple inputs including a network of overhead video cameras, specialized weight sensors on shelves, digital floor plans, and catalog images of products. To put it in plain terms, a multi-modal model means using data from multiple inputs.

Our research and development (R&D) investments in state-of-the-art multi-modal FMs enables the Just Walk Out system to be deployed in a wide range of shopping situations with greater accuracy and at lower cost. Similar to large language models (LLMs) that generate text, the new Just Walk Out system is designed to generate an accurate sales receipt for every shopper visiting the store.

The challenge: Tackling complicated long-tail shopping scenarios

Because of their innovative checkout-free environment, Just Walk Out stores presented us with a unique technical challenge. Retailers and shoppers as well as Amazon demand nearly 100 percent checkout accuracy, even in the most complex shopping situations. These include unusual shopping behaviors that can create a long and complicated sequence of activities requiring additional effort to analyze what happened.

Previous generations of the Just Walk Out system utilized a modular architecture; it tackled complex shopping situations by breaking down the shopper’s visit into discrete tasks, such as detecting shopper interactions, tracking items, identifying products, and counting what is selected. These individual components were then integrated into sequential pipelines to enable the overall system functionality. While this approach produced highly accurate receipts, significant engineering efforts are required to address challenges in new, previously unencountered situations and complex shopping scenarios. This limitation restricted the scalability of this approach.

The solution: Just Walk Out multi-modal AI

To meet these challenges, we introduced a new multi-modal FM that we designed specifically for retail store environments, enabling Just Walk Out technology to handle complex real-world shopping scenarios. The new multi-modal FM further enhances the Just Walk Out system’s capabilities by generalizing more effectively to new store formats, products, and customer behaviors, which is crucial for scaling up Just Walk Out technology.

The incorporation of continuous learning enables the model training to automatically adapt and learn from new challenging scenarios as they arise. This self-improving capability helps ensure the system maintains high performance, even as shopping environments continue to evolve.

Through this combination of end-to-end learning and enhanced generalization, the Just Walk Out system can tackle a wider range of dynamic and complex retail settings. Retailers can confidently deploy this technology, knowing it will provide a frictionless checkout-free experience for their customers.

The following video shows our system’s architecture in action.

Key elements of our Just Walk Out multi-modal AI model include:

Flexible data inputs –The system tracks how users interact with products and fixtures, such as shelves or fridges. It primarily relies on multi-view video feeds as inputs, using weight sensors solely to track small items. The model maintains a digital 3D representation of the store and can access catalog images to identify products, even if the shopper returns items to the shelf incorrectly.
Multi-modal AI tokens to represent shoppers’ journeys – The multi-modal data inputs are processed by the encoders, which compress them into transformer tokens, the basic unit of input for the receipt model. This allows the model to interpret hand movements, differentiate between items, and accurately count the number of items picked up or returned to the shelf with speed and precision.
Continuously updating receipts – The system uses tokens to create digital receipts for each shopper. It can differentiate between different shopper sessions and dynamically updates each receipt as they pick up or return items.

Training the Just Walk Out FM

By feeding vast amounts of multi-modal data into the Just Walk Out FM, we found it could consistently generate—or, technically, “predict”— accurate receipts for shoppers. To improve accuracy, we designed over 10 auxiliary tasks, such as detection, tracking, image segmentation, grounding (linking abstract concepts to real-world objects), and activity recognition. All of these are learned within a single model, enhancing the model’s ability to handle new, never-before-seen store formats, products, and customer behaviors. This is crucial for bringing Just Walk Out technology to new locations.

AI model training—in which curated data is fed to selected algorithms—helps the system refine itself to produce accurate results. We quickly discovered we could accelerate the training of our model by using a data flywheel that continuously mines and labels high-quality data in a self-reinforcing cycle. The system is designed to integrate these progressive improvements with minimal manual intervention. The following diagram illustrates the process.

To train an FM effectively, we invested in a robust infrastructure that can efficiently process the massive amounts of data needed to train high-capacity neural networks that mimic human decision-making. We built the infrastructure for our Just Walk Out model with the help of several Amazon Web Services (AWS) services, including Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.

Here are some key steps we followed in training our FM:

Selecting challenging data sources – To train our AI model for Just Walk Out technology, we focus on training data from especially difficult shopping scenarios that test the limits of our model. Although these complex cases constitute only a small fraction of shopping data, they are the most valuable for helping the model learn from its mistakes.
Leveraging auto labeling – To increase operational efficiency, we developed algorithms and models that automatically attach meaningful labels to the data. In addition to receipt prediction, our automated labeling algorithms cover the auxiliary tasks, ensuring the model gains comprehensive multi-modal understanding and reasoning capabilities.
Pre-training the model – Our FM is pre-trained on a vast collection of multi-modal data across a diverse range of tasks, which enhances the model’s ability to generalize to new store environments never encountered before.
Fine-tuning the model – Finally, we refined the model further and used quantization techniques to create a smaller, more efficient model that uses edge computing.

As the data flywheel continues to operate, it will progressively identify and incorporate more high-quality, challenging cases to test the robustness of the model. These additional difficult samples are then fed into the training set, further enhancing the model’s accuracy and applicability across new physical store environments.

Conclusion

In this post, we showed how our multi-modal, AI system represents significant new possibilities for Just Walk Out technology. With our innovative approach, we are moving away from modular AI systems that rely on human-defined subcomponents and interfaces. Instead, we’re building simpler and more scalable AI systems that can be trained end-to-end. Although we’ve just scratched the surface, multi-modal AI has raised the bar for our already highly accurate receipt system and will enable us to improve the shopping experience at more Just Walk Out technology stores around the world.

Visit About Amazon to read the official announcement about the new multi-modal AI system and learn more about the latest improvements in Just Walk Out technology.

To find where you can find Just Walk Out technology locations, visit Just Walk Out technology locations near you. Learn more about how to power your store or venue with Just Walk Out technology by Amazon on the Just Walk Out technology product page.

Visit Build and scale the next wave of AI innovation on AWS to learn more about how AWS can reinvent customer experiences with the most comprehensive set of AI and ML services.

About the Authors

Tian Lan is a Principal Scientist at AWS. He currently leads the research efforts in developing the next-generation Just Walk Out 2.0 technology, transforming it into an end-to-end learned, store domain–focused multi-modal foundation model.

Chris Broaddus is a Senior Manager at AWS. He currently manages all the research efforts for Just Walk Out technology, including the multi-modal AI model and other projects, such as deep learning for human pose estimation and Radio Frequency Identification (RFID) receipt prediction.

Generative AI and multi-modal agents in AWS: The key to unlocking new value in financial markets

September 20, 2023

In "FAANG"

Programmatically creating an IDP solution with Amazon Bedrock Data Automation

December 25, 2025

In "FAANG"

Matrix3D: Large Photogrammetry Model All-in-One

We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D’s…

May 10, 2025

In "FAANG"

AI Generated Robotic Content