imdb blog2 1
This three-part series demonstrates how to use graph neural networks (GNNs) and Amazon Neptune to generate movie recommendations using the IMDb and Box Office Mojo Movies/TV/OTT licensable data package, which provides a wide range of entertainment metadata, including over 1 billion user ratings; credits for more than 11 million cast and crew members; 9 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention.
In Part 1, we discussed the applications of GNNs, and how to transform and prepare our IMDb data for querying. In this post, we discuss the process of using Neptune to generate embeddings used to conduct our out-of-catalog search in Part 3 . We also go over Amazon Neptune ML, the machine learning (ML) feature of Neptune, and the code we use in our development process. In Part 3 , we walk through how to apply our knowledge graph embeddings to an out-of-catalog search use case.
Large connected datasets often contain valuable information that can be hard to extract using queries based on human intuition alone. ML techniques can help find hidden correlations in graphs with billions of relationships. These correlations can be helpful for recommending products, predicting credit worthiness, identifying fraud, and many other use cases.
Neptune ML makes it possible to build and train useful ML models on large graphs in hours instead of weeks. To accomplish this, Neptune ML uses GNN technology powered by Amazon SageMaker and the Deep Graph Library (DGL) (which is open-source). GNNs are an emerging field in artificial intelligence (for an example, see A Comprehensive Survey on Graph Neural Networks). For a hands-on tutorial about using GNNs with the DGL, see Learning graph neural networks with Deep Graph Library.
In this post, we show how to use Neptune in our pipeline to generate embeddings.
The following diagram depicts the overall flow of IMDb data from download to embedding generation.
We use the following AWS services to implement the solution:
In this post, we walk you through the following high-level steps:
We use the following commands as part of implementing this solution:
We use neptune_ml export to check the status or start a Neptune ML export process, and neptune_ml training to start and check the status of a Neptune ML model training job.
For more information about these and other commands, refer to Using Neptune workbench magics in your notebooks.
To follow along with this post, you should have the following:
Before we begin, you’ll need to set up your environment by setting the following variables: s3_bucket_uri and processed_folder. s3_bucket_uri is the name of the bucket used in Part 1 and processed_folder is the Amazon S3 location for the output from the export job .
In Part 1, we created a SageMaker notebook and export service to export our data from the Neptune DB cluster to Amazon S3 in the required format.
Now that our data is loaded and the export service is created, we need to create an export job start it. To do this, we use NeptuneExportApiUri and create parameters for the export job. In the following code, we use the variables expo and export_params. Set expo to your NeptuneExportApiUri value, which you can find on the Outputs tab of your CloudFormation stack. For export_params, we use the endpoint of your Neptune cluster and provide the value for outputS3path, which is the Amazon S3 location for the output from the export job.
To submit the export job use the following command:
To check the status of the export job use the following command:
After your job is complete, set the processed_folder variable to provide the Amazon S3 location of the processed results:
Now that the export is done, we create a data processing job to prepare the data for the Neptune ML training process. This can be done a few different ways. For this step, you can change the job_name and modelType variables, but all other parameters must remain the same. The main portion of this code is the modelType parameter, which can either be heterogeneous graph models (heterogeneous) or knowledge graphs (kge).
The export job also includes training-data-configuration.json. Use this file to add or remove any nodes or edges that you don’t want to provide for training (for example, if you want to predict the link between two nodes, you can remove that link in this configuration file). For this blog post we use the original configuration file. For additional information, see Editing a training configuration file.
Create your data processing job with the following code:
To check the status of the export job use the following command:
After the processing job is complete, we can begin our training job, which is where we create our embeddings. We recommend an instance type of ml.m5.24xlarge, but you can change this to suit your computing needs. See the following code:
We print the training_results variable to get the ID for the training job. Use the following command to check the status of your job:
%neptune_ml training status --job-id {training_results['id']} --store-to training_status_results
After your training job is complete, the last step is to download your raw embeddings. The following steps show you how to download embeddings created by using KGE (you can use the same process for RGCN).
In the following code, we use neptune_ml.get_mapping() and get_embeddings() to download the mapping file (mapping.info) and the raw embeddings file (entity.npy). Then we need to map the appropriate embeddings to their corresponding IDs.
To download RGCNs, follow the same process with a new training job name by processing the data with the modelType parameter set to heterogeneous, then training your model with the modelName parameter set to rgcn see here for more details. Once that is finished, call the get_mapping and get_embeddings functions to download your new mapping.info and entity.npy files. After you have the entity and mapping files, the process to create the CSV file is identical.
Finally, upload your embeddings to your desired Amazon S3 location:
Make sure you remember this S3 location, you will need to use it in Part 3.
When you’re done using the solution, be sure to clean up any resources to avoid ongoing charges.
In this post, we discussed how to use Neptune ML to train GNN embeddings from IMDb data.
Some related applications of knowledge graph embeddings are concepts like out-of-catalog search, content recommendations, targeted advertising, predicting missing links, general search, and cohort analysis. Out of catalog search is the process of searching for content that you don’t own, and finding or recommending content that is in your catalog that is as close to what the user searched as possible. We dive deeper into out-of-catalog search in Part 3.
A good language model should learn correct language usage, free of biases and errors.
TL;DR Using Google’s new Veo 3.1 video model, we created a breathtaking 1 minute 40…
Conditional generative modeling aims to learn a conditional data distribution from samples containing data-condition pairs.…
For decades, SQL has been the universal language for data analysis, offering access to analytics…