This three-part series demonstrates how to use graph neural networks (GNNs) and Amazon Neptune to generate movie recommendations using the IMDb and Box Office Mojo Movies/TV/OTT licensable data package, which provides a wide range of entertainment metadata, including over 1 billion user ratings; credits for more than 11 million cast and crew members; 9 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention.
In Part 1, we discussed the applications of GNNs, and how to transform and prepare our IMDb data for querying. In this post, we discuss the process of using Neptune to generate embeddings used to conduct our out-of-catalog search in Part 3 . We also go over Amazon Neptune ML, the machine learning (ML) feature of Neptune, and the code we use in our development process. In Part 3 , we walk through how to apply our knowledge graph embeddings to an out-of-catalog search use case.
Large connected datasets often contain valuable information that can be hard to extract using queries based on human intuition alone. ML techniques can help find hidden correlations in graphs with billions of relationships. These correlations can be helpful for recommending products, predicting credit worthiness, identifying fraud, and many other use cases.
Neptune ML makes it possible to build and train useful ML models on large graphs in hours instead of weeks. To accomplish this, Neptune ML uses GNN technology powered by Amazon SageMaker and the Deep Graph Library (DGL) (which is open-source). GNNs are an emerging field in artificial intelligence (for an example, see A Comprehensive Survey on Graph Neural Networks). For a hands-on tutorial about using GNNs with the DGL, see Learning graph neural networks with Deep Graph Library.
In this post, we show how to use Neptune in our pipeline to generate embeddings.
The following diagram depicts the overall flow of IMDb data from download to embedding generation.
We use the following AWS services to implement the solution:
In this post, we walk you through the following high-level steps:
We use the following commands as part of implementing this solution:
We use neptune_ml export
to check the status or start a Neptune ML export process, and neptune_ml training
to start and check the status of a Neptune ML model training job.
For more information about these and other commands, refer to Using Neptune workbench magics in your notebooks.
To follow along with this post, you should have the following:
Before we begin, you’ll need to set up your environment by setting the following variables: s3_bucket_uri
and processed_folder
. s3_bucket_uri
is the name of the bucket used in Part 1 and processed_folder
is the Amazon S3 location for the output from the export job .
In Part 1, we created a SageMaker notebook and export service to export our data from the Neptune DB cluster to Amazon S3 in the required format.
Now that our data is loaded and the export service is created, we need to create an export job start it. To do this, we use NeptuneExportApiUri
and create parameters for the export job. In the following code, we use the variables expo
and export_params
. Set expo
to your NeptuneExportApiUri
value, which you can find on the Outputs tab of your CloudFormation stack. For export_params
, we use the endpoint of your Neptune cluster and provide the value for outputS3path
, which is the Amazon S3 location for the output from the export job.
To submit the export job use the following command:
To check the status of the export job use the following command:
After your job is complete, set the processed_folder
variable to provide the Amazon S3 location of the processed results:
Now that the export is done, we create a data processing job to prepare the data for the Neptune ML training process. This can be done a few different ways. For this step, you can change the job_name
and modelType
variables, but all other parameters must remain the same. The main portion of this code is the modelType
parameter, which can either be heterogeneous graph models (heterogeneous
) or knowledge graphs (kge
).
The export job also includes training-data-configuration.json
. Use this file to add or remove any nodes or edges that you don’t want to provide for training (for example, if you want to predict the link between two nodes, you can remove that link in this configuration file). For this blog post we use the original configuration file. For additional information, see Editing a training configuration file.
Create your data processing job with the following code:
To check the status of the export job use the following command:
After the processing job is complete, we can begin our training job, which is where we create our embeddings. We recommend an instance type of ml.m5.24xlarge, but you can change this to suit your computing needs. See the following code:
We print the training_results variable to get the ID for the training job. Use the following command to check the status of your job:
%neptune_ml training status --job-id {training_results['id']} --store-to training_status_results
After your training job is complete, the last step is to download your raw embeddings. The following steps show you how to download embeddings created by using KGE (you can use the same process for RGCN).
In the following code, we use neptune_ml.get_mapping()
and get_embeddings()
to download the mapping file (mapping.info
) and the raw embeddings file (entity.npy
). Then we need to map the appropriate embeddings to their corresponding IDs.
To download RGCNs, follow the same process with a new training job name by processing the data with the modelType parameter set to heterogeneous
, then training your model with the modelName parameter set to rgcn
see here for more details. Once that is finished, call the get_mapping
and get_embeddings
functions to download your new mapping.info and entity.npy files. After you have the entity and mapping files, the process to create the CSV file is identical.
Finally, upload your embeddings to your desired Amazon S3 location:
Make sure you remember this S3 location, you will need to use it in Part 3.
When you’re done using the solution, be sure to clean up any resources to avoid ongoing charges.
In this post, we discussed how to use Neptune ML to train GNN embeddings from IMDb data.
Some related applications of knowledge graph embeddings are concepts like out-of-catalog search, content recommendations, targeted advertising, predicting missing links, general search, and cohort analysis. Out of catalog search is the process of searching for content that you don’t own, and finding or recommending content that is in your catalog that is as close to what the user searched as possible. We dive deeper into out-of-catalog search in Part 3.
Jasper Research Lab’s new shadow generation research and model enable brands to create more photorealistic…
We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini…
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response…
This post is co-written with Martin Holste from Trellix. Security teams are dealing with an…
As AI continues to unlock new opportunities for business growth and societal benefits, we’re working…
An internal email obtained by WIRED shows that NOAA workers received orders to pause “ALL…