Topic Extraction
Topic extraction determines the dominant topics in the text.
This functionality is also known as:
- theme identification
- subject detection
- key topic recognition
Tisane stores the topics under the topics
array (strings without topic_stats
, objects with topic_stats
). The topics are document level.
When a particular word has multiple interpretations, the sense of the word must be determined in the current context. For example, Jupiter is a planet and a Roman deity. Whether it's the planet or the deity, depends on the text.
For example, the sentence Juno is the wife of Jupiter refers to the deity. Tisane determines the relevant topics as Roman mythology
, supernatural
(gods), relationship
, and family
(since the spousal connection is mentioned).
{
"text": "Juno is the wife of Jupiter",
"topics": [
"supernatural",
"Roman mythology",
"relationship",
"family"
]
}
On the other hand, the sentence Jupiter is farther from the sun than Mars refers to planets. Tisane determines the topics to be outer space
and astronomy
.
{
"text": "Jupiter is farther from the sun than Mars",
"topics": [
"outer space",
"astronomy"
]
}
Topic Statistics
If the setting topic_stats
is set to true
, then the portion of the input where the topic is active is provided. The topic is then not provided as a string but as an object made of the topic itself (topic
(string) attribute) and its distribution statistic (coverage
(float) attribute).
Example
Request:
{
"language":"en",
"content":"Jupiter is farther from the sun than Mars. Which is not important in the current context",
"settings":
{
"topic_stats": true
}
}
Response:
{
"text": "Jupiter is farther from the sun than Mars. Which is not important in the current context",
"topics": [
{
"topic": "outer space",
"coverage": 0.5
},
{
"topic": "astronomy",
"coverage": 0.5
}
]
}
(both detected topics appear in 1 sentence out of 2, which is 0.5 of all sentences)
Standards
There are common taxonomy standards that Tisane can use with topic_standard
setting:
native
- native Tisane topic names; based on standard English terms for the topic. The default standard.iptc_code
- codes of the IPTC (International Press Telecommunications Council) Media Topics classification - a standard used in the media.iptc_description
- English descriptions of the IPTC codes.iab_code
- codes of the IAB (Interactive Advertising Bureau) content taxonomy.iab_description
- English descriptions of the IAB codes.wikidata
- Wikidata codes (usually of the form Qnnnnn, e.g. Q123).
To specify the standard, add the topic_standard
setting.
Example
Request:
{
"language":"en",
"content":"Jupiter is farther from the sun than Mars.",
"settings":
{
"topic_standard": "wikidata"
}
}
Response:
{
"text": "Jupiter is farther from the sun than Mars. Which is not important in the current contex",
"topics": [
"Q4169",
"Q333"
]
}
The standard taxonomies cover a small fraction of the native standard. When a concept is not covered by a taxonomy, it is omitted from the response.