Since its launch in 2018, Just Walk Out technology by Amazon has transformed the shopping experience by allowing customers to enter a store, pick up items, and leave without standing in line to pay. You can find this checkout-free technology in over 180 third-party locations worldwide, including travel retailers, sports stadiums, entertainment venues, conference centers, theme parks, convenience stores, hospitals, and college campuses. Just Walk Out technology’s end-to-end system automatically determines which products each customer chose in the store and provides digital receipts, eliminating the need for checkout lines.
In this post, we showcase the latest generation of Just Walk Out technology by Amazon, powered by a multi-modal foundation model (FM). We designed this multi-modal FM for physical stores using a transformer-based architecture similar to that underlying many generative artificial intelligence (AI) applications. The model will help retailers generate highly accurate shopping receipts using data from multiple inputs including a network of overhead video cameras, specialized weight sensors on shelves, digital floor plans, and catalog images of products. To put it in plain terms, a multi-modal model means using data from multiple inputs.
Our research and development (R&D) investments in state-of-the-art multi-modal FMs enables the Just Walk Out system to be deployed in a wide range of shopping situations with greater accuracy and at lower cost. Similar to large language models (LLMs) that generate text, the new Just Walk Out system is designed to generate an accurate sales receipt for every shopper visiting the store.
Because of their innovative checkout-free environment, Just Walk Out stores presented us with a unique technical challenge. Retailers and shoppers as well as Amazon demand nearly 100 percent checkout accuracy, even in the most complex shopping situations. These include unusual shopping behaviors that can create a long and complicated sequence of activities requiring additional effort to analyze what happened.
Previous generations of the Just Walk Out system utilized a modular architecture; it tackled complex shopping situations by breaking down the shopper’s visit into discrete tasks, such as detecting shopper interactions, tracking items, identifying products, and counting what is selected. These individual components were then integrated into sequential pipelines to enable the overall system functionality. While this approach produced highly accurate receipts, significant engineering efforts are required to address challenges in new, previously unencountered situations and complex shopping scenarios. This limitation restricted the scalability of this approach.
To meet these challenges, we introduced a new multi-modal FM that we designed specifically for retail store environments, enabling Just Walk Out technology to handle complex real-world shopping scenarios. The new multi-modal FM further enhances the Just Walk Out system’s capabilities by generalizing more effectively to new store formats, products, and customer behaviors, which is crucial for scaling up Just Walk Out technology.
The incorporation of continuous learning enables the model training to automatically adapt and learn from new challenging scenarios as they arise. This self-improving capability helps ensure the system maintains high performance, even as shopping environments continue to evolve.
Through this combination of end-to-end learning and enhanced generalization, the Just Walk Out system can tackle a wider range of dynamic and complex retail settings. Retailers can confidently deploy this technology, knowing it will provide a frictionless checkout-free experience for their customers.
The following video shows our system’s architecture in action.
Key elements of our Just Walk Out multi-modal AI model include:
By feeding vast amounts of multi-modal data into the Just Walk Out FM, we found it could consistently generate—or, technically, “predict”— accurate receipts for shoppers. To improve accuracy, we designed over 10 auxiliary tasks, such as detection, tracking, image segmentation, grounding (linking abstract concepts to real-world objects), and activity recognition. All of these are learned within a single model, enhancing the model’s ability to handle new, never-before-seen store formats, products, and customer behaviors. This is crucial for bringing Just Walk Out technology to new locations.
AI model training—in which curated data is fed to selected algorithms—helps the system refine itself to produce accurate results. We quickly discovered we could accelerate the training of our model by using a data flywheel that continuously mines and labels high-quality data in a self-reinforcing cycle. The system is designed to integrate these progressive improvements with minimal manual intervention. The following diagram illustrates the process.
To train an FM effectively, we invested in a robust infrastructure that can efficiently process the massive amounts of data needed to train high-capacity neural networks that mimic human decision-making. We built the infrastructure for our Just Walk Out model with the help of several Amazon Web Services (AWS) services, including Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.
To train an FM effectively, we invested in a robust infrastructure that can efficiently process the massive amounts of data needed to train high-capacity neural networks that mimic human decision-making. We built the infrastructure for our Just Walk Out model with the help of several Amazon Web Services (AWS) services, including Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.
Here are some key steps we followed in training our FM:
As the data flywheel continues to operate, it will progressively identify and incorporate more high-quality, challenging cases to test the robustness of the model. These additional difficult samples are then fed into the training set, further enhancing the model’s accuracy and applicability across new physical store environments.
In this post, we showed how our multi-modal, AI system represents significant new possibilities for Just Walk Out technology. With our innovative approach, we are moving away from modular AI systems that rely on human-defined subcomponents and interfaces. Instead, we’re building simpler and more scalable AI systems that can be trained end-to-end. Although we’ve just scratched the surface, multi-modal AI has raised the bar for our already highly accurate receipt system and will enable us to improve the shopping experience at more Just Walk Out technology stores around the world.
Visit About Amazon to read the official announcement about the new multi-modal AI system and learn more about the latest improvements in Just Walk Out technology.
To find where you can find Just Walk Out technology locations, visit Just Walk Out technology locations near you. Learn more about how to power your store or venue with Just Walk Out technology by Amazon on the Just Walk Out technology product page.
Visit Build and scale the next wave of AI innovation on AWS to learn more about how AWS can reinvent customer experiences with the most comprehensive set of AI and ML services.
Understanding what's happening behind large language models (LLMs) is essential in today's machine learning landscape.
AI accelerationists have won as a consequence of the election, potentially sidelining those advocating for…
L'Oréal's first professional hair dryer combines infrared light, wind, and heat to drastically reduce your…
TL;DR A conversation with 4o about the potential demise of companies like Anthropic. As artificial…
Whether a company begins with a proof-of-concept or live deployment, they should start small, test…
Digital tools are not always superior. Here are some WIRED-tested agendas and notebooks to keep…