Last updated

Lexical Chunking and Tokenization

Tisane uses a unified representation for lexical chunks, opting for a logical, morpheme-based representation.

In languages using compounds like German, the compounds are sliced into constituents.

Idiomatic multi-word expressions ("kung fu", "power plant", "clay pigeon") are viewed as a single lexeme.

Examples

  • English: "I don't see the power plant." => ["I", "do", "n't", "see", "the", "power plant", "."]
  • German: "Jetzt sollen die Stahlkugeln ersetzt werden." => ["Jetzt", "sollen", "die", "Stahl", "kugeln", "ersetzt", "werden", "."]
  • Simplified Chinese: "我给了老张三本书" => ["我", "给了", "老张", "三", "本", "书"] (In languages not using white spaces, particles are often joined with the word they modify.)
  • Spanish: "Asimismo, San Francisco es una de las mejores ciudades de EE. UU." => ["Asimismo", ",", "San Francisco", "es", "una", "de", "las", "mejores", "ciudades", "de", "EE. UU."]

How To Use

To use Tisane for tokenization/lexical chunking:

  1. Specify "words":true in your settings.
  2. In the response, traverse all elements in the sentence_list section (individual sentences).
  3. The lexical chunks are under words.