Module3: Getting started with Natural Language Processing
The Text Analytics Service
The Text Analytics Service
The Text Analytics service is designed to help you extract information from text. It provides functionality that you can use for:
Language detection - determining the language in which text is written.
Key phrase extraction - identifying important words and phrases in the text that indicate the main points.
Sentiment analysis - quantifying how positive or negative the text is.
Named entity recognition - detecting references to entities, including people, locations, time periods, organizations, and more.
Entity linking - identifying specific entities by providing reference links to Wikipedia articles.
You can provision Text Analytics as a single-service resource, or you can use the Text Analytics API in a multi-service Cognitive Services resource.
Language Detection
The Language Detection API evaluates text input and, for each document submitted, returns language identifiers with a score indicating the strength of the analysis. Text Analytics recognizes up to 120 languages.
This capability is useful for content stores that collect arbitrary text, where language is unknown. Another scenario could involve a chat bot. If a user starts a session with the chat bot, language detection can be used to determine which language they are using and allow you to configure your bot responses in the appropriate language.
You can parse the results of this analysis to determine which language is used in the input document. The response also returns a score, which reflects the confidence of the model (a value between 0 and 1).
Language detection can work with documents or single phrases. It's important to note that the document size must be under 5,120 characters. The size limit is per document and each collection is restricted to 1,000 items (IDs). A sample of a properly formatted JSON payload that you might submit to the service in the request body is shown here, including a collection of documents, each containing a unique id and the text to be analyzed. Optionally, you can provide a countryHint to improve prediction performance.
{
"documents": [
{
"countryHint": "US",
"id": "1",
"text": "Hello world"
},
{
"id": "2",
"text": "Bonjour tout le monde"
}
]
}
The service will return a JSON response that contains a result for each document in the request body, including the predicted language and a value indicating the confidence level of the prediction. The confidence level is a value ranging from 0 to 1 with values closer to 1 being a higher confidence level. Here's an example of a standard JSON response that maps to the above request JSON.
{
"documents": [
{
"id": "1",
"detectedLanguage": {
"name": "English",
"iso6391Name": "en",
"confidenceScore": 1
},
"warnings": []
},
{
"id": "2",
"detectedLanguage": {
"name": "French",
"iso6391Name": "fr",
"confidenceScore": 1
},
"warnings": []
}
],
"errors": [],
"modelVersion": "2020-04-01"
}
In our sample, all of the languages show a confidence of 1, mostly because the text is relatively simple and easy to identify the language for.
If you pass in a document that has multilingual content, the service will behave a bit differently. Mixed language content within the same document returns the language with the largest representation in the content, but with a lower positive rating, reflecting the marginal strength of that assessment. In the following example, the input is a blend of English, Spanish, and French. The analyzer uses statistical analysis of the text to determine the predominant language.
{
"documents": [
{
"id": "1",
"text": "Hello, I would like to take a class at your University. ¿Se ofrecen clases en español? Es mi primera lengua y más fácil para escribir. Que diriez-vous des cours en français?"
}
]
}
The following sample shows a response for this multi-language example.
{
"documents": [
{
"id": "1",
"detectedLanguages": [
{
"name": "Spanish",
"iso6391Name": "es",
"score": 0.9375
}
]
}
],
"errors": []
}
The last condition to consider is when there is ambiguity as to the language content. The scenario might happen if you submit textual content that the analyzer is not able to parse, for example because of character encoding issues when converting the text to a string variable. As a result, the response for the language name and ISO code will indicate (unknown) and the score value will be returned as NaN, or Not a Number. The following example shows how the response would look.'
Key Phrase Extraction
Key phrase extraction is the process of evaluating the text of a document, or documents, and then identifying the main points around the context of the document(s).
Key phrase extraction works best for larger documents (the maximum size that can be analyzed is 5,120 characters).
Just as with language detection, the REST interface enables you to submit one or more documents for analysis.
{
"documents": [
{
"id": "1",
"language": "en",
"text": "You must be the change you wish
to see in the world."
},
{
"id": "2",
"language": "en",
"text": "The journey of a thousand miles
begins with a single step."
}
]
}
The response contains a list of key phrases detected in each document.
{ "documents": [ { "id": "1", "keyPhrases": [ "change", "world" ], "warnings": [] }, { "id": "2", "keyPhrases": [ "miles", "single step", "journey" ], "warnings": [] } ], "errors": [], "modelVersion": "2020-04-01" }
Sentiment Analysis
Sentiment analysis is used to evaluate how positive or negative a text document is, which can be useful in a variety of workloads, such as:
Evaluating a movie, book, or product by quantifying sentiment based on reviews.
Prioritizing customer service responses to correspondence received through email or social media messaging.
When using the Text Analytics service to evaluate sentiment, the response includes overall document sentiment and individual sentence sentiment for each document submitted to the service.
For example, you could submit a single document for sentiment analysis like this:
{
"documents": [
{
"language": "en",
"id": "1",
"text": "Smile! Life is good!"
}
]
}
The response from the service might look like this:
{
"documents": [
{
"id": "1",
"sentiment": "positive",
"confidenceScores": {
"positive": 0.99,
"neutral": 0.01,
"negative": 0.00
},
"sentences": [
{
"text": "Smile!",
"sentiment": "positive",
"confidenceScores": {
"positive": 0.97,
"neutral": 0.02,
"negative": 0.01
},
"offset": 0,
"length": 6
},
{
"text": "Life is good!",
"sentiment": "positive",
"confidenceScores": {
"positive": 0.98,
"neutral": 0.02,
"negative": 0.00
},
"offset": 7,
"length": 13
}
],
"warnings": []
}
],
"errors": [],
"modelVersion": "2020-04-01"
}
Sentence sentiment is based on confidence scores for positive, negative, and neutral classification values between 0 and 1.
Overall document sentiment is based on sentences:
If all sentences are neutral, the overall sentiment is neutral.
If sentence classifications include only positive and neutral, the overall sentiment is positive.
If the sentence classifications include only negative and neutral, the overall sentiment is negative.
If the sentence classifications include positive and negative, the overall sentiment is mixed.
Named Entity Recognition
Named Entity Recognition identifies entities that are mentioned in the text. Entities are grouped into categories and subcategories, for example:
Person
Location
DateTime
Organization
Address
Email
URL
Note: For a full list of categories, see the Text Analytics documentation.
Input for entity recognition is similar to input for other text Analytics functions:
{
"documents": [
{
"language": "en",
"id": "1",
"text": "Joe went to London on Saturday"
}
]
}
The response includes a list of categorized entities found in each document:
{ "documents":[ { "id":"1", "entities":[ { "text":"Joe", "category":"Person", "offset":0, "length":3, "confidenceScore":0.62 }, { "text":"London", "category":"Location", "subcategory":"GPE", "offset":12, "length":6, "confidenceScore":0.88 }, { "text":"Saturday", "category":"DateTime", "subcategory":"Date", "offset":22, "length":8, "confidenceScore":0.8 } ], "warnings":[] } ], "errors":[], "modelVersion":"2021-01-15" }
Entity Linking
In some cases, the same name might be applicable to more than one entity. For example, does an instance of the word “Venus” refer to the planet or the Greek goddess?
Entity linking can be used to disambiguate entities of the same name by referencing an article in a a knowledge base. Wikipedia provides the knowledge base for the Text Analytics service. Specific article links are determined based on entity context within the text.
For example, “I saw Venus shining in the sky” is associated with the link https://en.wikipedia.org/wiki/Venus; while "Venus, the goddess of beauty" is associated with https://en.wikipedia.org/wiki/Venus_(mythology).
As with all Text Analytics functions, you can submit one or more documents for analysis:
{
"documents": [
{
"language": "en",
"id": "1",
"text": "I saw Venus shining in the sky"
}
]
}
The response includes the entities identified in the text along with links to associated articles:
{ "documents": [ { "id":"1", "entities":[ { "name":"Venus", "matches":[ { "text":"Venus", "offset":6, "length":5, "confidenceScore":0.01 } ], "language":"en", "id":"Venus", "url":"https://en.wikipedia.org/wiki/Venus", "dataSource":"Wikipedia" } ], "warnings":[] } ], "errors":[], "modelVersion":"2020-02-01" }
The Translator Service
The Translator service provides a multilingual text translation API that you can use for:
Language detection
One-to-many translation
Script transliteration (converting text from its native script to an alternative script)
You can provision Translator as a single-service resource, or use the Translator API in a multi-service Cognitive Services resource.
Detection, Translation, and Transliteration
Let's explore the capabilities of the Translator service.
Language detection
You can use the detect REST function to detect the language in which text is written.
For example, yuo could submit the following request.
{ 'Text' : 'こんにちは' }
The response to this request looks like this, indicating that the text is written in Japanese:
[
{
"isTranslationSupported": true,
"isTransliterationSupported": true,
"language": "ja",
"score": 1.0
}
]
Translation
To translate text from one language to another, use the translate function; specifying a single from parameter to indicate the source language, and one or more to parameters to specify the languages into which you want the text translated.
For example, you could submit the same JSON we previously used to detected the language, specifying a from parameter of ja (Japanese) and two to parameters with the values en (English) and fr (French). This would produce the following result:
[
{"translations":
[
{"text": "Hello", "to": "en"},
{"text": "Bonjour", "to": "fr"}
]
}
]
Transliteration
Our Japanese text is written using Kanji script, so rather than translate it to a different language, you may want to transliterate it to a different script - for example to render the text in Latin script (as used by English language text).
To accomplish this, we can submit the Japanese text to the transliterate fucntion with a fromScript parameter of Jpan and a toScript parameter of Latn to get the following result:
[ { "script": "Latn", "text": "Kon'nichiwa" } ]
ranslation Options
The translate function supports numerous parameters that affect the output.
Word alignment
In written English (using Latin script), spaces are used to separate words. However, in some other languages (and more specifically, scripts) this is not always the case.
For example, translating “Cognitive Services” from en (English) to zh (Simplified Chinese) produces the result "认知服务", and its difficult to understand the relationship between the characters in the source text and the corresponding characters in the translation. To resolve this problem, you can specify the includeAlignment parameter with a value of true to produce the following result:
[
{"translations":
[
{"text": "认知服务", "to": "zh-Hans",
"alignment": {"proj": "0:8-0:1 10:17-2:3"}
}
]
}
]
These results tell us that characters 0 to 8 in the source correspond to characters 0 to 1 in the translation (), while characters 10 to 17 in the source correspond to characters 2 to 3 in the translation.
Sentence length
Sometimes it might be useful to know the length of a translation, for example to determine how best to display it in a user interface. You can get this information by setting the includeSentenceLength parameter to true.
For example, specifying this parameter when translating the English (en) text “Hello world!” to French (fr) produces the following results:
[
{"translations":
[
{"text": "Salut tout le monde!", "to": "fr",
"sentLen":{"srcSentLen":[12], "transSentLen":[20]}
}
]
}
]
Profanity filtering
Sometimes text contains profanities, which you might want to obscure or omit altogether in a translation. You can handle profanities by specifying the profanityAction parameter, which can have one of the following values:
NoAction: Profanities are translated along with the rest of the text.
Deleted: Profanities are omitted in the translation.
Marked: Profanities are indicated using the technique indicated in the profanityMarker parameter (if supplied). The default value for this parameter is Asterisk, which replaces characters in profanities with "*". As an alternative, you can specify a profanityMarker value of Tag, which cases profanities to be enclosed in XML tags.
For example, translating the English (en) text “JSON is ▇▇▇▇ great!” (where the blocked out word is a profanity) to French (fr) with a profanityAction of Marked and a profanityMarker of Asterisk produces the following result:
[ {"translations": [ {"text": "JSON est *** génial!", "to": "fr"} ] } ]
Custom Translation
While the default translation model used by the Translator service is effective for general translation, you may need to develop a translation solution for businesses or industries in that have specific vocabularies of terms that require custom translation.
To solve this problem, you can create a custom model that maps your own sets of source and target terms for translation. To create a custom model, use the Custom Translator portal to:
Create a workspace linked to your Translator resource
Create a project
Upload training data files
Train a model
Your custom model is assigned a unique category Id, which you can specify in translate calls to your Translator resource by using the category parameter, causing translation to be performed by your custom model instead of the default model.
Analyze Text
The Text Analytics API is a cognitive service that supports analysis of text, including language detection, sentiment analysis, key phrase extraction, and entity recognition.
For example, suppose a travel agency wants to process hotel reviews that have been submitted to the company's web site. By using the Text Analytics API, they can determine the language each review is written in, the sentiment (positive, neutral, or negative) of the reviews, key phrases that might indicate the main topics discussed in the review, and named entities, such as places, landmarks, or people mentioned in the reviews.
Translate Text
The Translator service is a cognitive service that enables you to translate text between languages.
For example, suppose a travel agency wants to examine hotel reviews that have been submitted to the company's web site, standardizing on English as the language that is used for analysis. By using the Translator service, they can determine the language each review is written in, and if it is not already English, translate it from whatever source language it was written in into English
Comments
Post a Comment