Lexical Chunking and Tokenization
Tisane uses a unified representation for lexical chunks, opting for a logical, morpheme-based representation.
In languages using compounds like German, the compounds are sliced into constituents.
Idiomatic multi-word expressions ("kung fu", "power plant", "clay pigeon") are viewed as a single lexeme.
Examples
- English: "I don't see the power plant." => ["I", "do", "n't", "see", "the", "power plant", "."]
- German: "Jetzt sollen die Stahlkugeln ersetzt werden." => ["Jetzt", "sollen", "die", "Stahl", "kugeln", "ersetzt", "werden", "."]
- Simplified Chinese: "我给了老张三本书" => ["我", "给了", "老张", "三", "本", "书"] (In languages not using white spaces, particles are often joined with the word they modify.)
- Spanish: "Asimismo, San Francisco es una de las mejores ciudades de EE. UU." => ["Asimismo", ",", "San Francisco", "es", "una", "de", "las", "mejores", "ciudades", "de", "EE. UU."]
How To Use
To use Tisane for tokenization/lexical chunking:
- Specify
"words":true
in yoursettings
. - In the response, traverse all elements in the
sentence_list
section (individual sentences). - The lexical chunks are under
words
.