Video Indexer

The Video Indexer service is designed to help you extract information from videos. It provides functionality that you can use for:

Facial recognition - detecting the presence of individual people in the image.
Optical character recognition - reading text in the video.
Speech transcription - creating a text transcript of spoken dialog in the video.
Topics - Identification of key topics discussed in the video.
Sentiment - Analysis of how positive or negative segments within the video are.
Labels - label tags that identify key objects or themes throughout the video.
Content moderation - detection of adult or violent themes in the video.
Scene segmentation - a breakdown of the video into its constituent scenes.

The Video Indexer service provides a Video Indexer portal web site that you can use to upload, view, and analyze videos interactively.

You can use a free, standalone version of the Video Indexer service (with some limitations), or you can connect it to an Azure Media Services resource in your Azure subscription for full functionality.

Custom Insights

Video Indexer includes predefined models that can recognize well-known celebrities and brands, and transcribe spoken phrases into text. You can extend the capabilities of Video Indexer recognition by creating custom models for:

People. Add images of the faces of people you want to recognize in videos, and train a model. Video Indexer will then recognize these people in all of your videos.
Language. If your organization uses specific terminology that may not be in common usage, you can train a custom model to detect and transcribe it.
Brands. You can train a model to recognize specific names as brands, for example to identify products, projects, or companies that are relevant to your business.
Animated characters. In addition to recognizing human individuals, you may want to be able to detect the presence of individual animated characters in a video.

Video Indexer Widgets and API

While you can perform all video analysis tasks in the Video Indexer portal, you may want to incorporate the Video Indexer service into custom applications. There are two ways you can accomplish this.

Video Indexer widgets

The widgets used in the Video Indexer portal to play, analyze, and edit videos can be embedded in your own custom HTML interfaces. You can use this technique to share insights from specific videos with others without giving them full access to your Video Indexer account in the Video Indexer portal.

Video Indexer API

Video Indexer provides a REST API that you can subscribe to in order to get a subscription key. You can then use your subscription key to consume the REST API and automate video indexing tasks, such as uploading and indexing videos, retrieving insights, and determining endpoints for Video Indexer widgets.

The Custom Vision Service

The Custom Vision service enables you to build your own computer vision models for:

Image classification - predicting a label that describes the class (category) of an image based on contents.
Object detection - identifying the location of one or more classes of object in an image.

To use the Custom Vision service, you must provision two kinds of Azure resource:

A training resource (used to train your models). This can be:
- A Cognitive Services resource
- A Custom Vision (Training) resource.
A prediction resource, used by client applications to get predictions from your model. This can be:
- A Cognitive Services resource
- A Custom Vision (Prediction) resource.

You can use a single Cogitive Services resource for both training and prediction, and you can mix-and-match resource types (for example, using a Custom Vision (Training) resource to train a model that you then publish using a Cognitive Services resource).

What is Image Classification?

Image classification is a computer vision technique in which a model is trained to predict a class label for an image based on its contents. Usually, the class label relates to the main subject of the image (in other words, it indicates what this is an image of).

Models can be trained for multiclass classification (in other words, there are multiple classes, but each image can belong to only one class) or multilabel classification (in other words, an image might be associated with multiple labels).

Training an Image Classifier

To train an image classification model with the Custom Vision service, you can use the Custom Vision portal, the Custom Vision REST API or SDK, or a combination of both approaches.

In most cases, you'll typically use the Custom Vision portal to train your model. The portal provides a graphical interface that you can use to:

Create an image classification project for your model and associate it with a training resource.
Upload images, assigning class label tags to them.
Review and edit tagged images.
Train and evaluate a classification model.
Test a trained model.
Publish a trained model to a prediction resource.

The REST API and SDKs enable you to perform the same tasks programmatically, which is useful if you need to automate model training and publishing as part of a DevOps process.

What is Object Detection?

Object detection is a form of computer vision in which a model is trained to detect the presence and location of one or more classes of object in an image.

There are two components to an object detection prediction:

The class label of each object detected in the image. For example, you might ascertain that an image contains one apple and two oranges.
The location of each object within the image, indicated as coordinates of a bounding box that encloses the object.

Training an Object Detector

The options for training an object detection model are similar to those for training an image classification model. You can use the Custom Vision portal to upload and label images, before training, evaluating, testing, and publishing the model; or you can use the REST API or SDK to perform the training tasks programmatically.

The most significant difference between training an image classification model and training an object detection model is the labeling of the images with tags. While image classification requires one or more tags that apply to the whole image, object detection require that each label consists of a tag and a region that defines the bounding box for each object in an image.

Labeling Images

The easiest option for labeling images for object detection is to use the interactive interface in the Custom Vision portal. This interface automatically suggests regions that contain objects, to which you can assign tags or adjust by dragging the bounding box to enclose the object you want to label. Additionally, after tagging an initial batch of images, you can train the model; and subsequent labeling of new images can benefit from the smart labeler tool in the portal, which can suggest not only the regions, but the classes of object they contain.

Alternatively, you can use an alternative labeling tool, such as the one provided in Azure Machine Learning Studio or the Microsoft Visual Object Tagging Tool (VOTT), to take advantage of additional features, such as assigning image labeling tasks to multiple team members.

Bounding box measurement units

If you choose to use a labeling tool other than the Custom Vision portal, you may need to adjust the output to match the measurement units expected by the Custom Vision API. Bounding boxes are defined by four values that represent the left (X) and top (Y) coordinates of the top-left corner of the bounding box, and the width and height of the bounding box. These values are expressed as proportional values relative to the source image size. For example, consider this bounding box:

Left: 0.1
Top: 0.5
Width: 0.5
Height: 0.25

This defines a box in which the left is located 0.1 (one tenth) from the left edge of the image, and the top is 0.5 (half the image height) from the top. The box is half the width and a quarter of the height of the overall image.

LAB

Analyze Images with Computer Vision

Computer vision is an artificial intelligence capability that enables software systems to interpret visual input by analyzing images. In Microsoft Azure, the Computer Vision cognitive service provides pre-built models for common computer vision tasks, including analysis of images to suggest captions and tags, detection of common objects, landmarks, celebrities, brands, and the presence of adult content. You can also use the Computer Vision service to analyze image color and formats, and to generate "smart-cropped" thumbnail images.

Analyze Video with Video Indexer

A large proportion of the data created and consumed today is in the format of video. Video Indexer is an AI-powered service that you can use to index videos and extract insights from them.

LABS MODULE9

Classify Images with Custom Vision

The Custom Vision service enables you to create computer vision models that are trained on your own images. You can use it to train image classification and object detection models; which you can then publish and consume from applications.

In this exercise, you will use the Custom Vision service to train an image classification model that can identify three classes of fruit (apple, banana, and orange).

Search This Blog

AI-102

Module8: Analyzing Videos and Module9: Developing custom vision