Algospeak and Adversarial Text Manipulations
Tisane uses a special type of built-in spellchecker module to process text with both unintentional errors (misspellings) and adversarial text manipulations (e.g. algospeak).
The spellchecker employs several different techniques to handle different types of manipulations (masking characters, substitutions, etc.). These corrections are not limited by profanities or slurs, and consider the context. The same misspelled word may be interpreted differently in different sentences.
If corrections were found to be necessary in a sentence, the sentence gets a corrected_text
attribute where the corrected text is logged. (Set words
to true
to output sentence data.)
Limitations
Spell-checking is not a "did you mean" tool, as many people seem to believe:
- If the word is a legitimate word, no matter if misused or esoteric, Tisane will not correct it. For example, if noun is misspelled as nun, or house is misspelled as horse, Tisane won't help (unless it's part of a known often obfuscated concept, e.g. corn star in English).
- The primary purpose of the spellchecker is to decipher obfuscations. Therefore, the spellchecker is biased toward more profane, objectionable, or heavily used concepts.
Excluding Esoteric Senses And Words To Get Better Results
To get around the issue, you can use the min_generic_frequency
parameter.
This allows you to exclude the most esoteric senses and words.
The frequency is graded between 0 and 10, with 10 being the most frequent.
Some esoteric senses are also graded at -10.
We recommend you initially set min_generic_frequency
to 1
or2
to see if it works in your situation.
Excluding Potential Proper Nouns
If you need to avoid spell-checking potential proper nouns, set lowercase_spellcheck_only
to true
.
Example
Request:
{
"language":"en",
"content":"I will br*k his neck and kll him",
"settings":
{
"words":true,"topics":false,"sentiment":false,"snippets":true
}
}
Response:
"text": "I will br*k his neck and kll him",
"abuse": [
{
"sentence_index": 0,
"offset": 0,
"length": 32,
"text": "I will br*k his neck and kll him",
"type": "criminal_activity",
"severity": "medium",
"tags": [
"threat",
"violence",
"death"
]
}
],
"sentence_list": [
{
"offset": 0,
"text": "I will br*k his neck and kll him",
"words": [
{
"type": "word",
"offset": 0,
"text": "I",
"lettercase": "capitalized",
"role": "agent",
"lexeme": 63061,
"family": 301,
"grammar": [
"PRON"
],
"stopword": true
},
{
"type": "word",
"offset": 2,
"text": "will",
"lexeme": 146938,
"family": 316,
"grammar": [
"VERB"
],
"stopword": true
},
{
"type": "word",
"offset": 7,
"text": "br*k",
"role": "verb",
"lexeme": 20996,
"family": 107846,
"grammar": [
"VERB"
]
},
{
"type": "word",
"offset": 12,
"text": "his",
"lexeme": 63064,
"family": 303,
"grammar": [
"DET"
],
"stopword": true
},
{
"type": "word",
"offset": 16,
"text": "neck",
"lexeme": 93293,
"family": 40510,
"wikidata": "Q9633",
"grammar": [
"NOUN"
]
},
{
"type": "word",
"offset": 21,
"text": "and",
"lexeme": 4096,
"family": 322,
"grammar": [
"CCONJ"
],
"stopword": true
},
{
"type": "word",
"offset": 25,
"text": "kll",
"role": "verb",
"lexeme": 77380,
"family": 113102,
"grammar": [
"VERB"
]
},
{
"type": "word",
"offset": 29,
"text": "him",
"role": "patient",
"lexeme": 63062,
"family": 303,
"grammar": [
"PRON"
],
"stopword": true
}
],
"corrected_text": "I will break his neck and kill him"
}
]
}