Amazon Neptune ML is a machine learning (ML) capability of Amazon Neptune that helps you make accurate and fast predictions on your graph data. Under the hood, Neptune ML uses Graph Neural Networks (GNNs) to simultaneously take advantage of graph structure and node/edge properties to solve the task at hand. Traditional methods either only use properties and no graph structure (e.g., XGBoost, Neural Networks), or only graph structure and no properties (e.g., node2vec, Label Propagation). To better manipulate the node/edge properties, ML algorithms require the data to be well behaved numerical data, but raw data in a database can have other types, like raw text. To make use of these other types of data, we need specialized processing steps that convert them from their native type into numerical data, and the quality of the ML results is strongly dependent on the quality of these data transformations. Raw text, like sentences, are among the most difficult types to transform, but recent progress in the field of Natural Language Processing (NLP) has led to strong methods that can handle text coming from multiple languages and a wide variety of lengths.
Beginning with version 1.1.0.0, Neptune ML supports multiple text encoders (text_fasttext, text_sbert, text_word2vec, and text_tfidf), which bring the benefits of recent advances in NLP and enables support for multi-lingual text properties as well as additional inference requirements around languages and text length. For example, in a job recommendation use case, the job posts in different countries can be described in different languages and the length of job descriptions vary considerably. Additionally, Neptune ML supports an auto option that automatically chooses the best encoding method based on the characteristics of the text feature in the data.
In this post, we illustrate the usage of each text encoder, compare their advantages and disadvantages, and show an example of how to choose the right text encoders for a job recommendation task.
The goal of text encoding is to convert the text-based edge/node properties in Neptune into fixed-size vectors for use in downstream machine learning models for either node classification or link prediction tasks. The length of the text feature can vary a lot. It can be a word, phrase, sentence, paragraph, or even a document with multiple sentences (the maximum size of a single property is 55 MB in Neptune). Additionally, the text features can be in different languages. There may also be sentences that contain words in several different languages, which we define as code-switching.
Beginning with the 1.1.0.0 release, Neptune ML allows you to choose from several different text encoders. Each encoder works slightly differently, but has the same goal of converting a text value field from Neptune into a fixed-size vector that we use to build our GNN model using Neptune ML. The new encoders are as follows:
text_fasttext
is recommended for features that use one and only one of the five languages that fastText supports (English, Chinese, Hindi, Spanish, and French). The text_fasttext
method can optionally take the max_length
field, which specifies the maximum number of tokens in a text property value that will be encoded, after which the string is truncated. You can regard a token as a word. This can improve performance when text property values contain long strings, because if max_length
is not specified, fastText encodes all the tokens regardless of the string length.text_sbert
is recommended when the language is not supported by text_fasttext
. Neptune supports two SBERT methods: text_sbert128
, which is the default if you just specify text_sbert
, and text_sbert512
. The difference between them is the maximum number of tokens in a text property that get encoded. The text_sbert128
encoding only encodes the first 128 tokens, whereas text_sbert512
encodes up to 512 tokens. As a result, using text_sbert512
can require more processing time than text_sbert128
. Both methods are slower than text_fasttext
.Note that text_word2vec
and text_tfidf
were previously supported and the new methods text_fasttext
and text_sbert
are recommended over the old methods.
The following table shows the detailed comparison of all the supported text encoding options (text_fasttext
, text_sbert
, and text_word2vec
). text_tfidf
is not a model-based encoding method, but rather a count-based measure that evaluates how relevant a token (for example, a word) is to the text features in other nodes or edges, so we don’t include text_tfidf
for comparison. We recommend using text_tfidf
when you want to quantify the importance or relevance of some words in one node or edge property amongst all the other node or edge properties.)
. | . | text_fasttext | text_sbert | text_word2vec |
Model Capability | Supported language | English, Chinese, Hindi, Spanish, and French | More than 50 languages | English |
Can encode text properties that contain words in different languages | No | Yes | No | |
Max-length support | No maximum length limit | Encodes the text sequence with the maximum length of 128 and 512 | No maximum length limit | |
Time Cost | Loading | Approximately 10 seconds | Approximately 2 seconds | Approximately 2 seconds |
Inference | Fast | Slow | Medium |
Note the following usage tips:
text_fasttext
is the recommended encoding. However, it can’t handle cases where the same sentence contains words in more than one language. For other languages than the five that fastText
supports, use text_sbert
encoding.max_length
field to limit the number of tokens in each string that text_fasttext
encodes.To summarize, depending on your use case, we recommend the following encoding method:
text_fasttext
due to its fast inference. text_fasttext
is the recommended choices and you can also use text_sbert
in the following two exceptions.text_sbert
because it’s the only supported method that can encode text properties containing words in several different languages.text_sbert
because it supports more than 50 languages.text_sbert512
or text_fasttext
. Both methods can use encode longer text sequences.text_word2vec
, but we recommend using text_fasttext
for its fast inference.The goal of the job recommendation task is to predict what jobs users will apply for based on their previous applications, demographic information, and work history. This post uses an open Kaggle dataset. We construct the dataset as a three-node type graph: job, user, and city.
A job is characterized by its title, description, requirements, located city, and state. A user is described with the properties of major, degree type, number of work history, total number of years for working experience, and more. For this use case, job title, job description, job requirements, and majors are all in the form of text.
In the dataset, users have the following properties:
Jobs have the following properties:
The node type city like Washington DC and Orlando FL only has the identifier for each node. In the following section, we analyze the characteristics of different text features and illustrate how to select the proper text encoders for different text properties.
For our example, the Major and Title properties are in multiple languages and have short text sequences, so text_sbert
is recommended. The sample code for the export parameters is as follows. For the text_sbert
type, there are no other parameter fields. Here we choose text_sbert128
other than text_sbert512
, because the text length is relatively shorter than 128.
The Description and Requirements properties are usually in long text sequences. The average length of a description is around 192 words, which is longer than the maximum input length of text_sbert
(128). We can use text_sbert512
, but it may result in slower inference. In addition, the text is in a single language (English). Therefore, we recommend text_fasttext
with the en language value because of its fast inference and not limited input length. The sample code for the export parameters is as follows. The text_fasttext
encoding can be customized using language and max_length. The language
value is required, but max_length
is optional.
More details of the job recommendation use cases can be found in the Neptune notebook tutorial.
For demonstration purposes, we select one user, i.e., user 443931, who holds a Master’s degree in ‘Management and Human Resources. The user has applied to five different jobs, titled as “Human Resources (HR) Manager”, “HR Generalist”, “Human Resources Manager”, “Human Resources Administrator”, and “Senior Payroll Specialist”. In order to evaluate the performance of the recommendation task, we delete 50% of the apply jobs (the edges) of the user (here we delete “Human Resources Administrator” and “Human Resources (HR) Manager) and try to predict the top 10 jobs this user is most likely to apply for.
After encoding the job features and user features, we perform a link prediction task by training a relational graph convolutional network (RGCN) model. Training a Neptune ML model requires three steps: data processing, model training, and endpoint creation. After the inference endpoint has been created, we can make recommendations for user 443931. From the predicted top 10 jobs for user 443931 (i.e., “HR Generalist”, “Human Resources (HR) Manager”, “Senior Payroll Specialist”, “Human Resources Administrator”, “HR Analyst”, et al.), we observe that the two deleted jobs are among the 10 predictions.
In this post, we showed the usage of the newly supported text encoders in Neptune ML. These text encoders are simple to use and can support multiple requirements. In summary,
For more details about the solution, see the GitHub repo. We recommend using the text encoders on your graph data to meet your requirements. You can just choose an encoder name and set some encoder attributes, while keeping the GNN model unchanged.
Jasper Research Lab’s new shadow generation research and model enable brands to create more photorealistic…
We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini…
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response…
This post is co-written with Martin Holste from Trellix. Security teams are dealing with an…
As AI continues to unlock new opportunities for business growth and societal benefits, we’re working…
An internal email obtained by WIRED shows that NOAA workers received orders to pause “ALL…