Lexical Chunking and Tokenization
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude
Connect to Cursor
Install MCP server on Cursor
Connect to VS Code
Install MCP server on VS Code

Tisane uses a unified representation for lexical chunks, opting for a logical, morpheme-based representation.

In languages using compounds like German, the compounds are sliced into constituents.

Idiomatic multi-word expressions ("kung fu", "power plant", "clay pigeon") are viewed as a single lexeme.

English: "I don't see the power plant." => ["I", "do", "n't", "see", "the", "power plant", "."]
German: "Jetzt sollen die Stahlkugeln ersetzt werden." => ["Jetzt", "sollen", "die", "Stahl", "kugeln", "ersetzt", "werden", "."]
Simplified Chinese: "我给了老张三本书" => ["我", "给了", "老张", "三", "本", "书"] (In languages not using white spaces, particles are often joined with the word they modify.)
Spanish: "Asimismo, San Francisco es una de las mejores ciudades de EE. UU." => ["Asimismo", ",", "San Francisco", "es", "una", "de", "las", "mejores", "ciudades", "de", "EE. UU."]

How To Use

To use Tisane for tokenization/lexical chunking:

Specify "words":true in your settings.
In the response, traverse all elements in the sentence_list section (individual sentences).
The lexical chunks are under words.