Topic Extraction

Topic extraction determines the dominant topics in the text.

This functionality is also known as:

theme identification
subject detection
key topic recognition

Tisane stores the topics under the topics array (strings without topic_stats, objects with topic_stats). The topics are document level.

When a particular word has multiple interpretations, the sense of the word must be determined in the current context. For example, Jupiter is a planet and a Roman deity. Whether it's the planet or the deity, depends on the text.

For example, the sentence Juno is the wife of Jupiter refers to the deity. Tisane determines the relevant topics as Roman mythology, supernatural (gods), relationship, and family (since the spousal connection is mentioned).

{
	"text": "Juno is the wife of Jupiter",
	"topics": [
		"supernatural",
		"Roman mythology",
		"relationship",
		"family"
	]
}

On the other hand, the sentence Jupiter is farther from the sun than Mars refers to planets. Tisane determines the topics to be outer space and astronomy.

{
	"text": "Jupiter is farther from the sun than Mars",
	"topics": [
		"outer space",
		"astronomy"
	]
}

Topic Statistics

If the setting topic_stats is set to true, then the portion of the input where the topic is active is provided. The topic is then not provided as a string but as an object made of the topic itself (topic (string) attribute) and its distribution statistic (coverage (float) attribute).

Example

Request:

{
  "language":"en",
  "content":"Jupiter is farther from the sun than Mars. Which is not important in the current context",
  "settings": 
  {
    "topic_stats": true
  }
}

Response:

{
	"text": "Jupiter is farther from the sun than Mars. Which is not important in the current context",
	"topics": [
		{
			"topic": "outer space",
			"coverage": 0.5
		},
		{
			"topic": "astronomy",
			"coverage": 0.5
		}
	]
}

(both detected topics appear in 1 sentence out of 2, which is 0.5 of all sentences)

Standards

There are common taxonomy standards that Tisane can use with topic_standard setting:

native - native Tisane topic names; based on standard English terms for the topic. The default standard.
iptc_code - codes of the IPTC (International Press Telecommunications Council) Media Topics classification - a standard used in the media.
iptc_description - English descriptions of the IPTC codes.
iab_code - codes of the IAB (Interactive Advertising Bureau) content taxonomy.
iab_description - English descriptions of the IAB codes.
wikidata - Wikidata codes (usually of the form Qnnnnn, e.g. Q123).

To specify the standard, add the topic_standard setting.