Module12: Creating a knowledge mining solution

 

Azure Cognitive Search

Azure Cognitive Search is a service that supports AI-powered search and knowledge mining solutions. With Azure Cognitive Search, you can:

  • Index documents and data from a range of sources.

  • Use cognitive skills to enrich index data.

  • Store extracted insights in a knowledge store for analysis and integration.

Azure Resources for Cognitive Search

To use Azure Cognitive Search, you must provision an Azure Cognitive Search resource in your Azure subscription. Additionally, if you plan to use cognitive skills to enrich index data, you will need a Cognitive Services resource; and if you plan to persist the enriched data to a knowledge store, you will need a Storage Account resource.

Core Components of a Cognitive Search Solution

A cognitive search solution consists of multiple components, each playing an important part in the process of extracting, enriching, indexing, and searching data.

Data source

Most search solutions start with a data source containing the data you want to search. Azure Cognitive Search supports multiple types of data source, including:

  • Unstructured files in Azure blob storage containers.

  • Tables in Azure SQL Database.

  • Documents in Cosmos DB.

Azure Cognitive Search can pull data from these data sources for indexing.

Alternatively, applications can push JSON data directly into an index, without pulling it from an existing data store.

Skillset

In a basic search solution, you might simply index the data extracted from the data source. he information that can be extracted depends on the data source. for example, when indexing data in a database, the fields in the database tables might be extracted; or when indexing a set of documents, file metadata such as file name, modified date, size, and author might be extracted along with the text content of the document.

While a basic search solution that indexes data values extracted directly from the data source can be useful, the expectations of modern application users have driven a need for richer insights into the data. In Azure Cognitive Search, you can apply artificial intelligence (AI) skills as part of the indexing process to enrich the source data with new information, which can be mapped to index fields. The skills used by an indexer are encapsulated in a skillset that defines an enrichment pipeline in which each step enhances the source data with insights obtained by a specific AI skill. Examples of the kind of information that can be extracted by an AI skill include:

  • The language in which a document is written.

  • Key phrases that might help determine the main themes or topics discussed in a document.

  • A sentiment score that quantifies how positive or negative a document is.

  • Specific locations, people, organizations, or landmarks mentioned in the content.

  • AI-generated descriptions of images, or image text extracted by optical character recognition.

  • Custom skills that you develop to meet specific requirements.

Indexer

The indexer is the engine that drives the overall indexing process. It takes the outputs extracted using the skills in the skillset, along with the data and metadata values extracted from the original data source, and maps them to fields in the index.

Index

The index is the searchable result of the indexing process. It consists of a collection of JSON documents, with fields that contain the values extracted during indexing. Client applications can query the index to retrieve, filter, and sort information.

How an Enrichment Pipeline Works

The enrichment pipeline works by iteratively constructing a document that represents the enriched data. You can think of this document as a JSON structure, which initially consists of a document with the index fields you have mapped to fields extracted directly from the source data, like this:

  • document

    • metadata_storage_name

    • metadata_author

    • content

When the documents in the data source contain images, you can configure the indexer to extract the image data and place each image in a normalized_images collection, like this:

  • document

    • metadata_storage_name

    • metadata_author

    • content

    • normalized_images

      • image0

      • image1

Normalizing the image data in this way enables you to use the collection of images as an input for skills that extract information from image data.

Each skill adds fields to the document, so for example a skill that detects the language in which a document is written might store its output in a language field, like this:

  • document

    • metadata_storage_name

    • metadata_author

    • content

    • normalized_images

      • image0

      • image1

    • language

The document is structured hierarchically, and the skills are applied to a specific context within the hierarchy, enabling you to run the skill for each item at a particular level of the document. For example, you could run an optical character recognition (OCR) skill for each image in the normalized images collection to extract any text they contain:

  • document

    • metadata_storage_name

    • metadata_author

    • content

    • normalized_images

      • image0

        • Text

      • image1

        • Text

    • language

The output fields from each skill can be used as inputs for other skills later in the pipeline, which in turn store their outputs in the document structure. For example, we could use a merge skill to combine the original text content with the text extracted from each image to create a new merged_content field that contains all of the text in the document, including image text.

  • document

    • metadata_storage_name

    • metadata_author

    • content

    • normalized_images

      • image0

        • Text

      • image1

        • Text

    • language

    • merged_content

The fields in the final document structure at the end of the pipeline are mapped to index fields by the indexer in one of two ways:

  1. Fields extracted directly from the source data are all mapped to index fields. These mappings can be implicit (fields are automatically mapped to in fields with the same name in the index) or explicit (a mapping is defined to match a source field to an index field, often to rename the field to something more useful or to apply a function to the data value as it is mapped).

  2. Output fields from the skills in the skillset are explicitly mapped from their hierarchical location in the output to the target field in the index.

Introduction to Custom Skills

You can use the predefined skills in Azure Cognitive Search to greatly enrich an index by extracting additional information from the source data. However, there may be occasions when you have specific data extraction needs that cannot be met with the predefined skills and require some custom functionality.

For example:

  • Integrate Form Recognizer

  • Consume an Azure Machine Learning model

  • Any other custom logic

To support these scenarios, you can implement custom skills as web-hosted services (such as Azure Functions) that support the required interface for integration into a skillset.


Custom Skill Interfaces

Your custom skill must implement the expected schema for input and output data that is expected by skills in an Azure Cogitive Search skillset.

Input Schema

The input schema for a custom skill defines a JSON structure containing a record for each document to be processed. Each document has a unique identified, and a data payload with one or more inputs, like this:

{ "values": [ { "recordId": "<unique_identifier>", "data": { "<input1_name>": "<input1_value>", "<input2_name>": "<input2_value>", ... } }, { "recordId": "<unique_identifier>", "data": { "<input1_name>": "<input1_value>", "<input2_name>": "<input2_value>", ... } }, ... ] }

Output schema

The schema for the results returned by your custom skill reflect the input schema. It is assumed that the output will contain a record for each input record, with either the results produced by the skill or details of any errors that occurred.

{ "values": [ { "recordId": "<unique_identifier_from_input>", "data": { "<output1_name>": "<output1_value>", ... }, "errors": [...], "warnings": [...] }, { "recordId": "< unique_identifier_from_input>", "data": { "<output1_name>": "<output1_value>", ... }, "errors": [...], "warnings": [...] }, ... ] }

The output value in this schema is a property bag that can contain any JSON structure, reflecting the fact that index fields are not necessarily simple data values, but can contain complex types.


Adding a Custom Skill to a Skillset

To integrate a custom skill into your indexing solution, you must add a skill for it to a skillset using the Custom.WebApiSkill skill type.

The skill definition must:

  • Specify the URI to your web API endpoint, including parameters and headers if necessary.

  • Set the context to specify at which point in the document hierarchy the skill should be called

  • Assign input values, usually from existing document fields

  • Store output in a new field, optionally specifying a target field name (otherwise the output name is used)

{ "skills": [ ..., { "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill", "description": "<custom skill description>", "uri": "https://<web_api_endpoint>?<params>", "httpHeaders": { "<header_name>": "<header_value>" }, "context": "/document/<where_to_apply_skill>", "inputs": [ { "name": "<input1_name>", "source": "/document/<path_to_input_field>" } ], "outputs": [ { "name": "<output1_name>", "targetName": "<optional_field_name>" } ] } ] }

What is a Knowledge Store?

While the index might be considered the primary output from an indexing process, the enriched data it contains might also be useful in other ways. For example:

  • Since the index is essentially a collection of JSON objects, each representing an indexed record, it might be useful to export the objects as JSON files for integration into a data orchestration process using tools such as Azure Data Factory.

  • You may want to normalize the index records into a relational schema of tables for analysis and reporting with tools such as Microsoft Power BI.

  • Having extracted embedded images from documents during the indexing process, you might want to save those images as files.

Azure Cognitive Search supports these scenarios by enabling you to define a knowledge store in the skillset that encapsulates your enrichment pipeline. The knowledge store consists of projections of the enriched data, which can be JSON objects, tables, or image files. When an indexer runs the pipeline to create or update an index, the projections are generated and persisted in the knowledge store.


Using the Shaper Skill for Projections

The process of indexing incrementally creates a complex document that contains the various output fields from the skills in the skillset. This can result in a schema that is difficult to work with, and which includes collections of primitive data values that don't map easily to well-formed JSON.

To simplify the mapping of these field values to projections in a knowledge store, it's common to use the Shaper skill to create a new, field containing a simpler structure for the fields you want to map to projections.

For example, consider the following Shaper skill definition:

{ "@odata.type": "#Microsoft.Skills.Util.ShaperSkill", "name": "define-projection", "description": "Prepare projection fields", "context": "/document", "inputs": [ { "name": "file_name", "source": "/document/metadata_content_name" }, { "name": "url", "source": "/document/url" }, { "name": "sentiment", "source": "/document/sentimentScore" }, { "name": "key_phrases", "source": null, "sourceContext": "/document/merged_content/keyphrases/*", "inputs": [ { "name": "phrase", "source": "/document/merged_content/keyphrases/*" } ] } ], "outputs": [ { "name": "output", "targetName": "projection" } ] }

Ths creates a projection field with the following structure:


{ "file_name": "file_name.pdf", "url": "https://<storage_path>/file_name.pdf", "sentiment": 1.0, "key_phrases": [ { "phrase": "first key phrase" }, { "phrase": "second key phrase" }, { "phrase": "third key phrase" }, ... ] }


Implementing a Knowledge Store

To define the knowledge store and the projections you want to create in it, you must create a knowledgeStore object in the skillset that specifies the Azure Storage connection string for the storage account where you want to create projections, and the definitions of the projections themselves.

You can define object projections, table projections and file projections depending on what you want to store; however note that you must define a separate projection for each type of projection, even though each projection contains lists for tables, objects, and files. Projection types are mutually exclusive in a projection definition, so only one of the projection type lists can be populated. If you to create all three kinds of projection, you must include a projection for each type; as shown here:

"knowledgeStore": { "storageConnectionString": "<storage_connection_string>", "projections": [ { "objects": [ { "storageContainer": "<container>", "source": "/projection" } ], "tables": [], "files": [] }, { "objects": [], "tables": [ { "tableName": "KeyPhrases", "generatedKeyName": "keyphrase_id", "source": "projection/key_phrases/*", }, { "tableName": "docs", "generatedKeyName": "document_id", "source": "/projection" } ], "files": [] }, { "objects": [], "tables": [], "files": [ { "storageContainer": "<container>", "source": "/document/normalized_images/*" } ] } ] }

For object and file projections, the specified container will be created if it does not already exist. An Azure Storage table will be created for each table projection, with the mapped fields and a unique key field with the name specified in the generatedKeyName property. These key fields can be used to define relational joins between the tables for analysis and reporting.


LAB


Create an Azure Cognitive Search solution

All organizations rely on information to make decisions, answer questions, and function efficiently. The problem for most organizations is not a lack of information, but the challenge of finding and extracting the information from the massive set of documents, databases, and other sources in which the information is stored.

For example, suppose Margie's Travel is a travel agency that specializes in organizing trips to cities around the world. Over time, the company has amassed a huge amount of information in documents such as brochures, as well as reviews of hotels submitted by customers. This data is a valuable source of insights for travel agents and customers as they plan trips, but the sheer volume of data can make it difficult to find relevant information to answer a specific customer question.

To address this challenge, Margie's Travel can use Azure Cognitive Search to implement a solution in which the documents are indexed and enriched by using AI-based cognitive skills to make them easier to search.



Comments

Popular posts from this blog

Module6: QnA Maker and Module7: Conversational AI and Azure Bot service