Skip to content

Spam Filter API

ALTCHA’s Spam Filter is a unique anti-spam feature that enables you to classify text and other information, allowing you to filter out spam and identify legitimate messages. It works by analyzing textual and other information, evaluating various factors to provide a numeric score indicating whether the message appears legitimate or is likely spam.

The Spam Filter utilizes natural language processing and machine learning to analyze data quickly and reliably. For optimal results, it is recommended to use languages with full support: Dutch, English, French, German, Italian, Spanish, and Portuguese. While other languages can also be analyzed, some scoring factors may not be available.

Privacy and GDPR compliance are paramount for all ALTCHA services. The Spam Filter respects user privacy and ensures GDPR compliance to protect both you and your customers. Learn more about privacy considerations.

Use Cases

  • Comprehensive anti-spam: Quickly and reliably detect spam submitted through online forms or APIs by analyzing text and validating factors such as email addresses and IP addresses.
  • Email address validation: Detect fake or suspicious email addresses and distinguish between “free” and “work” emails.
  • IP address validation: Identify whether an IP address is associated with a data center, proxy, or TOR exit, and check against blocklists for malicious activity.
  • Security firewall: Protect against common HTML and SQL injection attempts in text, as well as identify known attackers through extensive blocklists.
  • Language detection: Automatically detect up to 160 languages from provided text.
  • Geo-location: Reliably detect user geo-location, commonly spoken languages, currency, and other information from IP addresses or user time zones.
  • Geo-fencing: Effectively block certain countries, regions, or continents from accessing or using your website or APIs.

Try it!

Test the Spam Filter API with your own input using the form below.

ALTCHA Spam Filter

Test the Spam Filter for yourself by submitting the form with your own input.

  • Use different languages
  • Include profanities or common spam words
  • Use a real email address and a fake-looking one
  • Explore any of the supported text rules
By submitting the form, you agree to and accept our privacy policy. Your IP address and geo-location will be automatically detected.

Authorization

Access to the API requires an API Key. Refer to the API authorization documentation for more information.

Usage with the Widget

If you’re using the ALTCHA widget as Captcha protection, integrating the Spam Filter directly onto your website is simple. The form will be classified during ALTCHA’s verification, before data reaches your server.

To utilize the Spam Filter, add the spamfilter attribute to the widget (version 0.3+ required):

<altcha-widget
challengeurl="https://eu.altcha.org/api/v1/challenge?apiKey=ckey_..."
spamfilter
></altcha-widget>

For additional information and required server changes, consult the documentation.

Text classification

The Spam Filter API analyzes provided text, searching for common patterns seen in spam. It scores various factors and provides a cumulative score indicating the text’s quality.

It can provide valuable insights into the text:

  • Language detection
  • Overall sentiment evaluation
  • Identification of spam words and profanities
  • URL detection
  • Detection of HTML and harmful JavaScript injections
  • Identification of potential SQL injections

Refer to text rules for more details.

Email verification

The API can verify email addresses, checking their legitimacy. A higher score indicates a suspicious or fake email address.

  • DNS record checking
  • Detection of free-email providers
  • Blocklist checks for known spammers

See email rules for more details.

IP address verification

Verifying user IP addresses is crucial for detecting abuse. The API identifies proxies, TOR exits, data centers, and known malicious IP addresses.

  • Geo-location detection
  • Geo-fencing
  • Identification of datacenters, proxies, VPNs, and TOR exits
  • Blocklist check for known malicious actors

See IP address rules for more details.

Time-zone verification

User time-zone detection via the browser provides reliable geo-location information. The API resolves time-zones to specific countries.

  • Geo-location detection
  • Geo-fencing

Refer to time-zone rules for more details.

You can retrieve a user’s IANA time-zone using the following JavaScript code:

const timeZone = Intl.DateTimeFormat().resolvedOptions().timeZone;

API endpoint

API Reference

To classify your data, utilize the /api/v1/classify endpoint:

POST https://eu.altcha.org/api/v1/classify?apiKey=ckey_...
Content-Type: application/json
Referer: https://example.com/
{
"text": "To spam or not to spam, that is the question."
}

Request

To use the API, POST a JSON-encoded body in the following format:

{
"email": "@gmail.com",
"ipAddress": "auto",
"text": "Your text here...",
"timeZone": "Europe/London"
}

All request properties are optional:

  • blockedCountries - An array of country codes (ISO 3166 alpha-2) that you want to block.
  • classifier - Enforce a specific classifier. Supported classifiers: cs, en, de, es, fr, it, nl, pt.
  • disableRules - An array of rules to disable. E.g. ["text.EMOJI"].
  • email - An email address to verify. To respect user privacy, submit only @<domain>.
  • expectedCountries - An array of country codes (ISO 3166 alpha-2) that you’re expecting the user to be from.
  • expectedLanguages - An array of language codes (ISO 639 alpha-2) that you’re expecting the text to be written in.
  • fields - Submit textual fields as a key-value object. Can be used instead of text (the text property takes precedence).
  • ipAddress - The user’s IP address. Use auto to use the caller’s IP. Both IPv4 and IPv6 are supported.
  • text - The text to classify. An array of strings can also be submitted.
  • timeZone - The user’s time-zone in IANA format, provided by the browser.

Response

The API responds with a JSON-encoded classification of your data:

{
"classification": "GOOD",
"country": {
"code": "gb",
"name": "United Kingdom",
"native": "United Kingdom",
"phone": [
44
],
"continent": "eu",
"capital": "London",
"currency": [
"GBP"
],
"languages": [
"en"
]
},
"ipAddress": {
"city": "London",
"country": "gb",
"ipAddress": "10.0.0.1",
"rules": { ... },
"score": 0.5,
"zip": null
},
"reasons": [
"ipAddress.PROXY"
],
"score": 0.5,
"text": {
"classifier": "en",
"detectedLanguage": "en",
"rules": { ... },
"score": 0
}
}

The result is determined by properties in the response:

  • classification - Can be GOOD (< 1), NEUTRAL (1…2), or BAD (> 2), indicating overall scoring.
  • score - The overall numeric score. A score > 2 indicates spam.
  • reasons - An array of matching rules, sorted by score.

Scoring Rules

The classification API evaluates several scoring rules for each attribute you provide. The individual rules and their scores are returned by the API in the response. The resulting overall score is a sum of all rule scores.

There are 4 distinct categories of scoring rules, based on the input provided:

Text Rules

The text is analyzed with the following rules using natural language processing and machine learning. These rules are designed to detect common patterns used in unsolicited messages, such as spam and promotion, but also detect profanities and harmful content.

CAPITALIZATION

This rule finds CAPITALIZED words in the text. Capitalization of text suggests an unsolicited message.

  • Significance: low
  • Score: n × 0.25 where n is the number of occurrences.

CURRENCY

This rule finds all tokens matching common price or currency formats. Prices in the text indicate a commercial offer.

  • Significance: low
  • Score: n × 0.25 where n is the number of occurrences.

EMOJI

This rule finds all emoji characters. An excessive use of emoji is considered detrimental.

  • Significance: low
  • Score: n × 0.25 where n is the number of occurrences.

EXCLAMATION

This rule finds all exclamation characters. An excessive use of exclamation is considered detrimental.

  • Significance: low
  • Score: n × 0.25 where n is the number of occurrences.

HASH_TAGS

This rule finds all #hash-tags. An excessive use of hash-tags is considered detrimental.

  • Significance: low
  • Score: n × 0.25 where n is the number of occurrences.

HTML

This rule finds all HTML tags. The use of HTML is considered detrimental.

  • Significance: medium
  • Score: n × 1 where n is the number of occurrences.

HTML_INJECTION

This rule finds all harmful HTML tags, such as <script>, <style> and <iframe>, which indicate a malicious attempt.

  • Significance: high
  • Score: n × 5 where n is the number of occurrences.

NUMBERS_ONLY

This rule matches if the whole text consists only of numbers and indicates random key-strokes.

  • Significance: medium
  • Score: 0 | 2

PROFANITY

This rule finds commonly used profanities in the text.

  • Significance: high
  • Score: n × x where n is the number of occurrences, x is a varying word score.

RANDOM_CHARS

This rule finds character sequences that seem to fit random key-strokes.

  • Significance: medium
  • Score: n × 1 where n is the number of occurrences.

SENTIMENT

This rule evaluates the overall sentiment of the text. Bad or harmful sentiment increases the score.

  • Significance: medium
  • Score: 0 | 1

SHORT_TEXT

This rule matches if the text is too short, below 40 characters.

  • Significance: medium
  • Score: 0 | 1

SPAM_WORDS

This rule finds commonly used spam words in the text.

  • Significance: medium
  • Score: n × x where n is the number of occurrences, x is a varying word score.

SPECIAL_CHARS

This rule finds non-alphanumeric sequences longer than 5 characters.

  • Significance: medium
  • Score: n × 1 where n is the number of occurrences.

SQL_INJECTION

This rule finds potential SQL injection attempts, such as 1; drop table ....

  • Significance: high
  • Score: n × 5 where n is the number of occurrences.

UNEXPECTED_LANGUAGE

This rule matches if the detected language does not match expectedLanguages.

  • Significance: high
  • Score: 0 | 5

UNKNOWN_LANGUAGE

This rule matches if the language cannot be detected from the text.

  • Significance: medium
  • Score: 0 | 1

URL

This rule finds URL addresses in the text. An excessive use of URLs is considered detrimental.

  • Significance: low
  • Score: n × 0.5 where n is the number of occurrences.

Email Rules

If you provide an email address to the classifier API, it will be analyzed with the following rules designed to validate the address. It can tell you whether the email is a “free email” such as Gmail, or whether it can actually receive messages.

FREE_PROVIDER

This rule matches if the domain name of the email address is recognized as a known free-email provider such as Gmail. A score of 0 indicates a “work” email with a custom domain name, and a score of 0.5 indicates a free email provider from a list of the most popular “trusted providers”.

  • Significance: low
  • Score: 0 | 0.5 | 1

DMARC

This rule checks the DNS for a _dmarc. record and matches if the record is not configured. The missing DMARC record indicates that the domain is poorly configured.

  • Significance: low
  • Score: 0 | 0.5

MX

This rule checks the DNS for an MX record and matches if the record is not configured. A missing MX record indicates that the email address is not valid because email cannot be delivered.

  • Significance: high
  • Score: 0 | 5

REPORTED

This rule matches if the email address is found in one of the block-lists of known forum spammers.

  • Significance: high
  • Score: 0 | 5

INVALID

This rule matches if the format of the email address is invalid, such as an invalid domain name.

  • Significance: high
  • Score: 0 | 5

IP Address Rules

If you provide an ipAddress to the classifier API, it will be analyzed with the following rules designed to evaluate how harmful the actor is. It will tell you whether the user is using a proxy server or TOR, whether the IP address is located in a datacenter, or whether it is a known malicious IP address. You can use the IP evaluation for geo-blocking.

BLOCKED_COUNTRY

This rule matches if the detected geo-location matches blockedCountries.

  • Significance: high
  • Score: 0 | 5

HOSTING

This rule matches if the IP address is known to be located in a datacenter.

  • Significance: medium
  • Score: 0 | 2

MALICIOUS

This rule matches if the IP address is found in one of the block-lists of known malicious actors.

  • Significance: high
  • Score: 0 | 5

PROXY

This rule matches if the IP address is known to be a proxy server such as a VPN.

  • Significance: low
  • Score: 0 | 0.5

TOR

This rule matches if the IP address is known to be a TOR exit.

  • Significance: medium
  • Score: 0 | 1

UNEXPECTED_COUNTRY

This rule matches if the detected geo-location does not match expectedCountries.

  • Significance: medium
  • Score: 0 | 1

Time-zone Rules

The user’s time-zone (provided by the browser) is evaluated to detect an accurate geo-location of the user. This is often more accurate than the IP address due to inaccuracy of the IP dataset and the use of proxies.

BLOCKED_COUNTRY

This rule matches if the detected geo-location matches blockedCountries.

  • Significance: high
  • Score: 0 | 5

UNEXPECTED_COUNTRY

This rule matches if the detected geo-location does not match expectedCountries.

  • Significance: medium
  • Score: 0 | 1