Discovering Amazon SageMaker Ground Truth
Added to Amazon SageMaker in late 2018, Amazon SageMaker Ground Truth helps you quickly build accurate training datasets. Machine learning practitioners can distribute labeling work to public and private workforces of human labelers. Labelers can be productive immediately, thanks to built-in workflows and graphical interfaces for common image, video, and text tasks. In addition, Ground Truth can enable automatic labeling, a technique that trains a machine learning model able to label data without additional human intervention.
In this section, you'll learn how to use Ground Truth to label images and text.
Using workforces
The first step in using Ground Truth is to create a workforce, a group of workers in charge of labeling data samples.
Let's head out to the SageMaker console: in the left-hand vertical menu, we click on Ground Truth, then on Labeling workforces. Three types of workforces are available: Amazon Mechanical Turk, Vendor, and Private. Let's discuss what they are, and when you should use them.
Amazon Mechanical Turk
Amazon Mechanical Turk (https://www.mturk.com/) makes it easy to break down large batch jobs into small work units that can be processed by a distributed workforce.
With Mechanical Turk, you can enroll tens or even hundreds of thousands of workers located across the globe. This is a great option when you need to label extremely large datasets. For example, think about a dataset for autonomous driving, made up of 1000 hours of video: each frame would need to be processed in order to identify other vehicles, pedestrians, road signs, and more. If you wanted to annotate every single frame, you'd be looking at 1,000 hours x 3,600 seconds x 24 frames per second = 86.4 million images! Clearly, you would have to scale out your labeling workforce to get the job done, and Mechanical Turk lets you do that.
Vendor workforce
As scalable as Mechanical Turk is, sometimes you need more control on who data is shared with, and on the quality of annotations, particularly if additional domain knowledge is required.
For this purpose, AWS has vetted a number of data labeling companies, which have integrated Ground Truth in their workflows. You can find the list of companies on AWS Marketplace (https://aws.amazon.com/marketplace/), under Machine Learning | Data Labeling Services | Amazon SageMaker Ground Truth Services.
Private workforce
Sometimes, data can't be processed by third parties. Maybe it's just too sensitive, or maybe it requires expert knowledge that only your company's employees have. In this case, you can create a private workforce, made up of well-identified individuals that will access and label your data.
Creating a private workforce
Creating a private workforce is the quickest and simplest option. Let's see how it's done:
Starting from the Labeling workforces entry in the SageMaker console, we select the Private tab, as seen in the following screenshot. Then, we click on Create private team:
We give the team a name, then we have to decide whether we're going to invite workers by email, or whether we're going to import users that belong to an existing Amazon Cognito group.
Amazon Cognito (https://aws.amazon.com/cognito/) is a managed service that lets you build and manage user directories at any scale. Cognito supports both social identity providers (Google, Facebook, and Amazon), and enterprise identity providers (Microsoft Active Directory, SAML).
This makes a lot of sense in an enterprise context, but let's keep things simple and use email instead. Here, I will use some sample email addresses: please make sure to use your own, otherwise you won't be able to join the team!
Then, we need to enter an organization name, and more importantly a contact email that workers can use for questions and feedback on the labeling job. These conversations are extremely important in order to fine-tune labeling instructions, pinpoint problematic data samples, and more.
Optionally, we can set up notifications with Amazon Simple Notification Service (https://aws.amazon.com/sns/), to let workers know that they have work to do.
The screen should look like in the following screenshot. Then, we click on Create private team:
A few seconds later, the team has been set up. Invitations have been sent to workers, requesting that they join the workforce by logging in to a specific URL. The invitation email looks like that shown in the following screenshot:
Clicking on the link opens a login window. Once we've logged in and defined a new password, we're taken to a new screen showing available jobs, as in the following screenshot. As we haven't defined one yet, it's obviously empty:
Let's keep our workers busy and create an image labeling job.
Uploading data for labeling
As you would expect, Amazon SageMaker Ground Truth uses Amazon S3 to store datasets:
Using the AWS CLI, we create an S3 bucket hosted in the same region we're running SageMaker in. Bucket names are globally unique, so please make sure to pick your own unique name when you create the bucket. Use the following code:
$ aws s3 mb s3://sagemaker-book --region eu-west-1
Then, we copy the cat images located in the chapter2 folder of our GitHub repository as follows:
$ aws s3 cp --recursive cat/ s3://sagemaker-book/chapter2/cat/
Now that we have some data waiting to be labeled, let's create a labeling job.
Creating a labeling job
As you would expect, we need to define the location of the data, what type of task we want to label it for, and what our instructions are:
In the left-hand vertical menu of the SageMaker console, we click on Ground Truth, then on Labeling jobs, then on the Create labeling job button.
First, we give the job a name, say my-cat-job. Then, we define the location of the data in S3. Ground Truth expects a manifest file: a manifest file is a JSON file that lets you filter which objects need to be labeled, and which ones should be left out. Once the job is complete, a new file, called the augmented manifest, will contain labeling information, and we'll be able to use this to feed data to training jobs.
Then, we define the location and the type of our input data, just like in the following screenshot:
As is visible in the next screenshot, we select the IAM role that we created for SageMaker in the first chapter (your name will be different), and we then click on the Complete data setup button to validate this section:
Clicking on View more details, you can learn about what is happening under the hood. SageMaker Ground Truth crawls your data in S3 and creates a JSON file called the manifest file. You can go and download it from S3 if you're curious. This file points at your objects in S3 (images, text files, and so on).
Optionally, we could decide to work either with the full manifest, a random sample, or a filtered subset based on a SQL query. We could also provide an Amazon KMS key to encrypt the output of the job. Let's stick to the defaults here.
The Task type section asks us what kind of job we'd like to run. Please take a minute to explore the different task categories that are available (text, image, video, point cloud, and custom). You'll see that SageMaker Ground Truth can help you with the following tasks:
a) Text classification
b) Named entity recognition
c) Image classification: Categorizing images in specific classes
d) Object detection: Locating and labeling objects in images with bounding boxes
e) Semantic segmentation: Locating and labeling objects in images with pixel-level precision
f) Video clip classification: Categorizing videos in specific classes
g) Multi-frame video object detection and tracking
h) 3D point clouds: Locating and labeling objects in 3D data, such as LIDAR data for autonomous driving
i) Custom user-defined tasks
As shown in the next screenshot, let's select the Image task category and the Semantic segmentation task, and then click Next:
On the next screen, visible in the following screenshot, we first select our private team of workers:
If we had a lot of samples (say, tens of thousands or more), we should consider enabling automated data labeling, as this feature would reduce both the duration and the cost of the labeling job. Indeed, as workers would start labeling data samples, SageMaker Ground Truth would train a machine learning model on these samples. It would use them as a dataset for a supervised learning problem. With enough worker-labeled data, this model would pretty quickly be able to match and exceed human accuracy, at which point it would replace workers and label the rest of the dataset. If you'd like to know more about this feature, please read the documentation at https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html.
The last step in configuring our training job is to enter instructions for the workers. This is an important step, especially if your job is distributed to third-party workers. The better our instructions, the higher the quality of the annotations. Here, let's explain what the job is about, and enter a cat label for workers to apply. In a real-life scenario, you should add detailed instructions, provide sample images for good and bad examples, explain what your expectations are, and so on. The following screenshot shows what our instructions look like:
Once we're done with instructions, we click on Create to launch the labeling job. After a few minutes, the job is ready to be distributed to workers.
Labeling images
Logging in to the worker URL, we can see from the screen shown in the following screenshot that we have work to do:
Clicking on Start working opens a new window, visible in the next picture. It displays instructions as well as a first image to work on:
Using the graphical tools in the toolbar, and especially the auto-segment tool, we can very quickly produce high-quality annotations. Please take a few minutes to practice, and you'll be able to do the same in no time.
Once we're done with the three images, the job is complete, and we can visualize the labeled images under Labeling jobs in the SageMaker console. Your screen should look like the following screenshot:
More importantly, we can find labeling information in the S3 output location.
In particular, the augmented manifest (output/my-cat-job/manifests/output/output.manifest) contains annotation information on each data sample, such as the classes present in the image, and a link to the segmentation mask:
{"source-ref":"s3://sagemaker-book/chapter2/cat/cat1.jpg","my-cat-job-ref":"s3://sagemaker-book/chapter2/cat/output/my-cat-job/annotations/consolidated-annotation/output/0_2020-04-21T13:48:00.091190.png","my-cat-job-ref-metadata":{"internal-color-map":{"0":{"class-name":"BACKGROUND","hex-color":"#ffffff","confidence":0.8054600000000001},"1":{"class-name":"cat","hex-color":"#2ca02c","confidence":0.8054600000000001}},"type":"groundtruth/semantic-segmentation","human-annotated":"yes","creation-date":"2020-04-21T13:48:00.562419","job-name":"labeling-job/my-cat-job"}}
Yes, that's quite a mouthful! Don't worry though: in Chapter 5, Training Computer Vision Models, we'll see how we can feed this information directly to the built-in computer vision algorithms implemented in Amazon SageMaker. Of course, we could also parse this information, and convert it for whatever framework we use to train our computer vision model.
As you can see, SageMaker Ground Truth makes it easy to label image datasets. You just need to upload your data to S3 and create a workforce. Ground Truth will then distribute the work automatically, and store the results in S3.
We just saw how to label images, but what about text tasks? Well, they're equally easy to set up and run. Let's go through an example.
Labeling text
This is a quick example of labeling text for named entity recognition. The dataset is made up of text fragments from one of my blog posts, where we'd like to label all AWS service names. These are available in our GitHub repository:
$ cat ner/1.txt With great power come great responsibility. The second you create AWS resources, you're responsible for them: security of course, but also cost and scaling. This makes monitoring and alerting all the more important, which is why we built services like Amazon CloudWatch, AWS Config and AWS Systems Manager.
We will start labeling text using the following steps:
First, let's upload text fragments to S3 with the following line of code:
$ aws s3 cp --recursive ner/ s3://sagemaker-book/chapter2/ner/
Just like in the previous example, we configure a text labeling job, set up input data, and select an IAM role, as shown in the following screenshot:
Then, we select Text as the category, and Named entity recognition as the task.
On the next screen, shown in the following screenshot, we simply select our private team again, add a label, and enter instructions:
Once the job is ready, we log in to the worker console and start labeling. You can see a labeled example in the following screenshot:
We're done quickly, and we can find the labeling information in our S3 bucket. Here's what we find for the preceding text: for each entity, we get a start offset, an end offset, and a label:
{"source": "Since 2006, Amazon Web Services has been striving to simplify IT infrastructure. Thanks to services like Amazon Elastic Compute Cloud (EC2),Amazon Simple Storage Service (S3), Amazon Relational Database Service (RDS), AWS CloudFormation and many more,millions of customers can build reliable, scalable, and secure platforms in any AWS region in minutes. Having spent 10 years procuring, deploying and managing more hardware than I care to remember, I'm still amazed every day by the pace of innovation that builders achieve with our services.","my-ner-job": {"annotations": {"entities":[{"endOffset":133,"startOffset":105,"label":"aws_service"}, {"endOffset":138,"startOffset":135,"label":"aws_service"},{"endOffset":170,"startOffset":141,"label":"aws_service"}, {"endOffset":174,"startOffset":172,"label":"aws_service"}, {"endOffset":211,"startOffset":177,"label":"aws_service"}, {"endOffset":216,"startOffset":213,"label":"aws_service"}, {"endOffset":237,"startOffset":219,"label":"aws_service"}], "labels":[{"label":"aws_service"}]}}, "my-ner-job-metadata": {"entities":[{"confidence":0.12},{"confidence":0.08},{"confidence":0.13},{"confidence":0.07},{"confidence":0.14},{"confidence":0.08},{"confidence":0.1}],"job-name":"labeling-job/my-ner-job","type":"groundtruth/text-span","creation-date":"2020-04-21T14:32:36.573013","human-annotated":"yes"}}
Amazon SageMaker Ground Truth really makes it easy to label datasets at scale. It has many nice features including job chaining and custom workflows, which I encourage you to explore at https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html.
Next, we're going to learn about Amazon SageMaker Processing, and you can easily use it to process any dataset using your own code.