Skip to content

Similarity and Training Data

The Similarity Detection feature in ALTCHA Sentinel provides powerful content analysis capabilities by comparing input text against predefined example sets. This similarity matching system helps identify patterns, detect unwanted content, and maintain quality standards across user-generated content platforms such as chats, forums, or social platforms.

Resources

Feature Highlights

  • Detects spam or phishing attempts from pre-trained examples
  • Configurable thresholds and weights for the pre-trained data

Common Use Cases

Similarity matching is commonly used to automatically flag or block unwanted content by comparing user submissions against known spam, phishing, or banned material. For example:

  • Spam detection in chats/forums: Match messages against pre-trained spam templates (e.g., phishing links, promotional scams).
  • User-reported spam training: Dynamically improve detection by incorporating user-reported spam (e.g., via “Report Spam” buttons).
  • Content moderation: Enforce bans on specific phrases, harassment, or prohibited topics in user-generated content.

Precision

The similarity matching system uses cosine similarity on text embeddings generated by the open-source model all-MiniLM-L6-v2. This model is designed as a sentence and short-paragraph encoder, and any input text exceeding 256 word pieces is automatically truncated.

While the model is primarily trained on English content, it supports multiple languages with varying levels of precision.

Implementation Guide

Using the Similarity Detection Endpoint

To match for similarity based on text examples, provide a list of examples and the text to match. In the example below, the first three examples will match with a high score (>= 70%), while the last example will match with 0%.

Terminal window
POST /v1/similarity
Content-Type: application/json
{
"examples": [
"Claim your exclusive reward now by clicking the link below!",
"Get your exclusive prize now by visiting this link!",
"Don't miss out—claim your unique prize by clicking below!",
"The weather today is sunny and perfect for a walk in the park."
],
"text": "Claim your exclusive prize now by clicking the link below!"
}

Example response:

{
"matches": {
"examples": {
"matches": [
{
"example": "Claim your exclusive reward now by clicking the link below!",
"score": 0.85
},
{
"example": "Get your exclusive prize now by visiting this link!",
"score": 0.9
},
{
"example": "Don't miss out—claim your unique prize by clicking below!",
"score": 0.74
},
{
"example": "The weather today is sunny and perfect for a walk in the park.",
"score": 0
}
],
"time": 24.833
}
}
}

To check against pre-trained data, define your data in the app and submit a groups: string[] parameter containing the names of groups instead of examples.

For more details, refer to the API Documentation.

Using the Classifier Endpoint

The Classifier endpoint accepts a similarityGroups: string[] property where you can provide the names of groups containing pre-defined training data. When matches meet the configured threshold, it increases the overall score and classifies the data as spam.

For more details, refer to the Classifier Documentation.