Skip to content

Similarity and Training Data

ALTCHA Sentinel’s Similarity Detection feature enables smart content analysis by evaluating how closely input text resembles known examples of spam, abuse, or other unwanted content. It helps keep user-generated platforms clean, safe, and high-quality by automatically identifying problematic submissions.

What It Does

Similarity Detection compares incoming text to a curated set of example phrases using fast, language-aware matching. It’s especially effective at spotting recurring spam, phishing patterns, or specific banned content — even if modified.

  • Detect spam and phishing attempts using pre-trained examples
  • Customize detection sensitivity with configurable thresholds and weighting
  • Continuously improve results by training on user-reported examples

This feature is ideal for chat systems, forums, comment sections, and any environment where moderation at scale is required.

How It Works

Similarity scoring is based on cosine similarity between embeddings generated from the open-source model all-MiniLM-L6-v2. This model is optimized for sentence-level comparison in English and supports multiple languages with reasonable accuracy.

  • Input text longer than 256 tokens is truncated automatically
  • Precision may vary for non-English inputs

When used with the Classifier endpoint, matching content increases the overall spam score — allowing you to fine-tune how and when content is flagged.

Example Use Cases

  • Chat & forum moderation: Instantly flag known spam and promotional scams
  • User-reported spam training: Improve results using reports from real users
  • Phrase-based filtering: Block or monitor specific words, phrases, or behaviors

Get Started