ML 13704 imagetest
Searching for insights in a repository of free-form text documents can be like finding a needle in a haystack. A traditional approach might be to use word counting or other basic analysis to parse documents, but with the power of Amazon AI and machine learning (ML) tools, we can gather deeper understanding of the content.
Amazon Comprehend is a fully, managed service that uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend develops insights by recognizing the entities, key phrases, sentiment, themes, and custom elements in a document. Amazon Comprehend can create new insights based on understanding the document structure and entity relationships. For example, with Amazon Comprehend, you can scan an entire document repository for key phrases.
Amazon Comprehend lets non-ML experts easily do tasks that normally take hours of time. Amazon Comprehend eliminates much of the time needed to clean, build, and train your own model. For building deeper custom models in NLP or any other domain, Amazon SageMaker enables you to build, train, and deploy models in a much more conventional ML workflow if desired.
In this post, we use Amazon Comprehend and other AWS services to analyze and extract new insights from a repository of documents. Then, we use Amazon QuickSight to generate a simple yet powerful word cloud visual to easily spot themes or trends.
The following diagram illustrates the solution architecture.
To begin, we gather the data to be analyzed and load it into an Amazon Simple Storage Service (Amazon S3) bucket in an AWS account. In this example, we use text formatted files. The data is then analyzed by Amazon Comprehend. Amazon Comprehend creates a JSON formatted output that needs to be transformed and processed into a database format using AWS Glue. We verify the data and extract specific formatted data tables using Amazon Athena for a QuickSight analysis using a word cloud. For more information about visualizations, refer to Visualizing data in Amazon QuickSight.
For this walkthrough, you should have the following prerequisites:
Upload your data to an S3 bucket. For this post, we use UTF-8 formatted text of the US Constitution as the input file. Then you’re ready to analyze the data and create visualizations.
There are many types of text-based and image information that can be processed using Amazon Comprehend. In addition to text files, you can use Amazon Comprehend for one-step classification and entity recognition to to accept image files, PDF files, and Microsoft Word files as input, which are not discussed in this post.
To analyze your data, complete the following steps:
The job will run and the status will be displayed on the Analysis jobs page.
Wait for the analysis job to complete. Amazon Comprehend will create a file and place it in the output data folder you provided. The file is in .gz or GZIP format.
This file needs to be download and converted to a non-compressed format. You can download an object from the data folder or S3 bucket using the Amazon S3 console.
The uncompressed file must be uploaded to the output folder before the AWS Glue crawler can process it. For this example, we upload the uncompressed file into the same output folder that we use in later steps.
After you upload the file, delete the original zipped file.
This will leave one file remaining in the output folder: the uncompressed file.
In this step, you prepare the Amazon Comprehend output to be used as input into Athena. The Amazon Comprehend output is in JSON format. You can use AWS Glue to convert JSON into a database structure to ultimately be read by QuickSight.
Be sure to add the trailing /
to the path name. AWS Glue will search the folder path for all files.
You can monitor the crawler status on the AWS Glue console.
Athena will extract data from the database tables the AWS Glue crawler created to provide a format that QuickSight will use to create the word cloud.
To create a table compatible for QuickSight, the data must be unnested from the arrays.
Finally, you can create the visual output from the analysis.
Make sure QuickSight has access to the S3 buckets where the Athena tables are stored.
By configuring access to AWS services, QuickSight can access the data in those services. Access by users and groups can be controlled through the options.
Now you can create the word cloud.
Choose the options menu (three dots) in the visualization to access the edit options. For example, you might want to hide the term “other” from the display. You can also edit items such as the title and subtitle for your visual. To download the word cloud as a PDF, choose Download on the QuickSight toolbar.
To avoid incurring ongoing charges, delete any unused data and processes or resources provisioned on their respective service console.
Amazon Comprehend uses NLP to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. You can use Amazon Comprehend to create new products based on understanding the structure of documents. For example, with Amazon Comprehend, you can scan an entire document repository for key phrases.
This post described the steps to build a word cloud to visualize a text content analysis from Amazon Comprehend using AWS tools and QuickSight to visualize the data.
Let’s stay in touch via the comments section!
Surreal September was more than just a challenge—it was about elevating your AI art skills…
Gen AI is not just another technology layer; it has the potential to eat the…
From beach days to board meetings, these top totes are designed to protect your valuables,…
This tutorial is in two parts; they are: • Using DistilBart for Summarization • Improving…
Those clicks and pops aren't supposed to be there! Give your music a bath with…
Overfitting is one of the most (if not the most!) common problems encountered when building…