Similarity and Training Data

ALTCHA Sentinel’s Similarity Detection feature enables smart content analysis by evaluating how closely input text resembles known examples of spam, abuse, or other unwanted content. It helps keep user-generated platforms clean, safe, and high-quality by automatically identifying problematic submissions.

What It Does

Similarity Detection compares incoming text to a curated set of example phrases using fast, language-aware matching. It’s especially effective at spotting recurring spam, phishing patterns, or specific banned content — even if modified.

Detect spam and phishing attempts using pre-trained examples
Customize detection sensitivity with configurable thresholds and weighting
Continuously improve results by training on user-reported examples

This feature is ideal for chat systems, forums, comment sections, and any environment where moderation at scale is required.

How It Works

Similarity scoring is based on cosine similarity between embeddings generated from the open-source model all-MiniLM-L6-v2. This model is optimized for sentence-level comparison in English and supports multiple languages with reasonable accuracy.

Input text longer than 256 tokens is truncated automatically
Precision may vary for non-English inputs

When used with the Classifier endpoint, matching content increases the overall spam score — allowing you to fine-tune how and when content is flagged.

Example Use Cases

Chat & forum moderation: Instantly flag known spam and promotional scams
User-reported spam training: Improve results using reports from real users
Phrase-based filtering: Block or monitor specific words, phrases, or behaviors

Similarity and Training Data

What It Does

How It Works

Example Use Cases

Get Started