Spam Filter API

ALTCHA’s Spam Filter is a unique anti-spam feature that enables you to classify text and other information, allowing you to filter out spam and identify legitimate messages. It works by analyzing textual and other information, evaluating various factors to provide a numeric score indicating whether the message appears legitimate or is likely spam.

The Spam Filter utilizes natural language processing and machine learning to analyze data quickly and reliably. For optimal results, it is recommended to use languages with full support. While other languages can also be analyzed, some scoring factors may not be available.

Privacy and GDPR compliance are paramount for all ALTCHA services. The Spam Filter respects user privacy and ensures GDPR compliance to protect both you and your customers. Learn more about privacy considerations.

Use Cases

Comprehensive anti-spam: Quickly and reliably detect spam submitted through online forms or APIs by analyzing text and validating factors such as email addresses and IP addresses.
Email address validation: Detect fake or suspicious email addresses and distinguish between “free” and “work” emails.
IP address validation: Identify whether an IP address is associated with a data center, proxy, or TOR exit, and check against blocklists for malicious activity.
Security firewall: Protect against common HTML and SQL injection attempts in text, as well as identify known attackers through extensive blocklists.
Language detection: Automatically detect up to 160 languages from provided text.
Geo-location: Reliably detect user geo-location, commonly spoken languages, currency, and other information from IP addresses or user time zones.
Geo-fencing: Effectively block certain countries, regions, or continents from accessing or using your website or APIs.

Try it!

Test the Spam Filter API with your own input using the form below.

Test the Spam Filter for yourself by submitting the form with your own input.

Use different languages
Include profanities or common spam words
Use a real email address and a fake-looking one
Explore any of the supported text rules

By submitting the form, you agree to and accept our privacy policy. Your IP address and geo-location will be automatically detected.

Classification
Took
Reasons
Detected language
IP address
Location (IP)
Location (time-zone)

Authorization

Access to the API requires an API Key. Refer to the API authorization documentation for more information.

If you’re using the ALTCHA widget as Captcha protection, integrating the Spam Filter directly onto your website is simple. The form will be classified during ALTCHA’s verification, before data reaches your server.

To utilize the Spam Filter, add the spamfilter attribute to the widget (version 0.3+ required):

<altcha-widget
  challengeurl="https://eu.altcha.org/api/v1/challenge?apiKey=ckey_..."
  spamfilter
></altcha-widget>

For additional information and required server changes, consult the documentation.

Modes of Operation

The Spam Filter offers several advanced features for spam detection. Depending on your use case and target audience, some features, such as text field classification, can be privacy-invasive. Fortunately, you can easily configure the behavior of the Spam Filter and set the verification mode.

Default Mode

In the default mode, the Spam Filter performs:

Text classification on all text fields in the form
Email address verification
IP verification
Language verification

IP Address Mode

Set spamfilter="ipAddress" to verify only the IP address and the user’s time zone. This mode does not submit text fields or email addresses, making it a more privacy-friendly option that avoids sending any personally identifiable information.

<altcha-widget
  challengeurl="https://eu.altcha.org/api/v1/challenge?apiKey=ckey_..."
  spamfilter="ipAddress"
></altcha-widget>

While the IP Address Mode cannot detect human-generated spam, it effectively identifies bots through comprehensive IP address checking.

Custom Modes

You can further customize the Spam Filter’s behavior using programmatic configuration. Provide spamfilter as an object with individual settings tailored to your needs.

Text classification

The Spam Filter API analyzes provided text, searching for common patterns seen in spam. It scores various factors and provides a cumulative score indicating the text’s quality.

It can provide valuable insights into the text:

Language detection
Overall sentiment evaluation
Identification of spam words and profanities
URL detection
Detection of HTML and harmful JavaScript injections
Identification of potential SQL injections

Refer to text rules for more details.

Language Support

The Spam Filter currently supports text classification in the languages listed below. For texts in languages not included in this list, the default English classifier will be used. This provides base-level functionality for spam detection, even for unsupported languages.

Bulgarian
Czech
Danish
Dutch
English
Finnish
French
German
Greek
Hungarian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Slovak
Spanish
Swedish

Email verification

The API can verify email addresses, checking their legitimacy. A higher score indicates a suspicious or fake email address.

DNS record checking
Detection of free-email providers
Blocklist checks for known spammers

See email rules for more details.

IP address verification

Verifying user IP addresses is crucial for detecting abuse. The API identifies proxies, TOR exits, data centers, and known malicious IP addresses.

Geo-location detection
Geo-fencing
Identification of datacenters, proxies, VPNs, and TOR exits
Blocklist check for known malicious actors

See IP address rules for more details.

Time-zone verification

User time-zone detection via the browser provides reliable geo-location information. The API resolves time-zones to specific countries.

Geo-location detection
Geo-fencing

Refer to time-zone rules for more details.

You can retrieve a user’s IANA time-zone using the following JavaScript code:

const timeZone = Intl.DateTimeFormat().resolvedOptions().timeZone;

API endpoint

API Reference

To classify your data, utilize the /api/v1/classify endpoint:

POST https://eu.altcha.org/api/v1/classify?apiKey=ckey_...
Content-Type: application/json
Referer: https://example.com/

{
  "text": "To spam or not to spam, that is the question."
}

Request

To use the API, POST a JSON-encoded body in the following format:

{
  "email": "@gmail.com",
  "ipAddress": "auto",
  "text": "Your text here...",
  "timeZone": "Europe/London"
}

All request properties are optional:

blockedCountries - An array of country codes (ISO 3166 alpha-2) that you want to block.
classifier - Enforce a specific classifier. Supported classifiers: cs, en, de, es, fr, it, nl, pt.
disableRules - An array of rules to disable. E.g. ["text.EMOJI"].
email - An email address to verify. To respect user privacy, submit only @<domain>.
expectedCountries - An array of country codes (ISO 3166 alpha-2) that you’re expecting the user to be from.
expectedLanguages - An array of language codes (ISO 639 alpha-2) that you’re expecting the text to be written in.
fields - Submit textual fields as a key-value object. Can be used instead of text (the text property takes precedence).
ipAddress - The user’s IP address. Use auto to use the caller’s IP. Both IPv4 and IPv6 are supported.
text - The text to classify. An array of strings can also be submitted.
timeZone - The user’s time-zone in IANA format, provided by the browser.

Response

The API responds with a JSON-encoded classification of your data:

{
  "classification": "GOOD",
  "country": {
    "code": "gb",
    "name": "United Kingdom",
    "native": "United Kingdom",
    "phone": [
      44
    ],
    "continent": "eu",
    "capital": "London",
    "currency": [
      "GBP"
    ],
    "languages": [
      "en"
    ]
  },
  "ipAddress": {
    "city": "London",
    "country": "gb",
    "ipAddress": "10.0.0.1",
    "rules": { ... },
    "score": 0.5,
    "zip": null
  },
  "reasons": [
    "ipAddress.PROXY"
  ],
  "score": 0.5,
  "text": {
    "classifier": "en",
    "detectedLanguage": "en",
    "rules": { ... },
    "score": 0
  }
}

The result is determined by properties in the response:

classification - Can be GOOD (< 1), NEUTRAL (1…2), or BAD (> 2), indicating overall scoring.
score - The overall numeric score. A score > 2 indicates spam.
reasons - An array of matching rules, sorted by score.

Scoring Rules

The classification API evaluates several scoring rules for each attribute you provide. The individual rules and their scores are returned by the API in the response. The resulting overall score is a sum of all rule scores.

There are 4 distinct categories of scoring rules, based on the input provided:

Text Rules
Email Rules
IP Address Rules
Time-zone Rules

Text Rules

The text is analyzed with the following rules using natural language processing and machine learning. These rules are designed to detect common patterns used in unsolicited messages, such as spam and promotion, but also detect profanities and harmful content.

`CAPITALIZATION`

This rule finds CAPITALIZED words in the text. Capitalization of text suggests an unsolicited message.

Significance: low
Score: n × 0.25 where n is the number of occurrences.

`CURRENCY`

This rule finds all tokens matching common price or currency formats. Prices in the text indicate a commercial offer.

Significance: low
Score: n × 0.25 where n is the number of occurrences.

`EMOJI`

This rule finds all emoji characters. An excessive use of emoji is considered detrimental.

Significance: low
Score: n × 0.25 where n is the number of occurrences.

`EXCLAMATION`

This rule finds all exclamation characters. An excessive use of exclamation is considered detrimental.

Significance: low
Score: n × 0.25 where n is the number of occurrences.

`HASH_TAGS`

This rule finds all #hash-tags. An excessive use of hash-tags is considered detrimental.

Significance: low
Score: n × 0.25 where n is the number of occurrences.

`HTML`

This rule finds all HTML tags. The use of HTML is considered detrimental.

Significance: medium
Score: n × 1 where n is the number of occurrences.

`HTML_INJECTION`

This rule finds all harmful HTML tags, such as <script>, <style> and <iframe>, which indicate a malicious attempt.

Significance: high
Score: n × 5 where n is the number of occurrences.

`NUMBERS_ONLY`

This rule matches if the whole text consists only of numbers and indicates random key-strokes.

Significance: medium
Score: 0 | 2

`PROFANITY`

This rule finds commonly used profanities in the text.

Significance: high
Score: n × x where n is the number of occurrences, x is a varying word score.

`RANDOM_CHARS`

This rule finds character sequences that seem to fit random key-strokes.

Significance: medium
Score: n × 1 where n is the number of occurrences.

`SENTIMENT`

This rule evaluates the overall sentiment of the text. Bad or harmful sentiment increases the score.

Significance: medium
Score: 0 | 1

`SHORT_TEXT`

This rule matches if the text is too short, below 40 characters.

Significance: medium
Score: 0 | 1

`SPAM_WORDS`

This rule finds commonly used spam words in the text.

Significance: medium
Score: n × x where n is the number of occurrences, x is a varying word score.

`SPECIAL_CHARS`

This rule finds non-alphanumeric sequences longer than 5 characters.

Significance: medium
Score: n × 1 where n is the number of occurrences.

`SQL_INJECTION`

This rule finds potential SQL injection attempts, such as 1; drop table ....

Significance: high
Score: n × 5 where n is the number of occurrences.

`UNEXPECTED_LANGUAGE`

This rule matches if the detected language does not match expectedLanguages.

Significance: high
Score: 0 | 5

`UNKNOWN_LANGUAGE`

This rule matches if the language cannot be detected from the text.

Significance: medium
Score: 0 | 1

`URL`

This rule finds URL addresses in the text. An excessive use of URLs is considered detrimental.

Significance: low
Score: n × 0.5 where n is the number of occurrences.

Email Rules

If you provide an email address to the classifier API, it will be analyzed with the following rules designed to validate the address. It can tell you whether the email is a “free email” such as Gmail, or whether it can actually receive messages.

`FREE_PROVIDER`

This rule matches if the domain name of the email address is recognized as a known free-email provider such as Gmail. A score of 0 indicates a “work” email with a custom domain name, and a score of 0.5 indicates a free email provider from a list of the most popular “trusted providers”.

Significance: low
Score: 0 | 0.5 | 1

`DMARC`

This rule checks the DNS for a _dmarc. record and matches if the record is not configured. The missing DMARC record indicates that the domain is poorly configured.

Significance: low
Score: 0 | 0.5

`MX`

This rule checks the DNS for an MX record and matches if the record is not configured. A missing MX record indicates that the email address is not valid because email cannot be delivered.

Significance: high
Score: 0 | 5

`REPORTED`

This rule matches if the email address is found in one of the block-lists of known forum spammers.

Significance: high
Score: 0 | 5

`INVALID`

This rule matches if the format of the email address is invalid, such as an invalid domain name.

Significance: high
Score: 0 | 5

IP Address Rules

If you provide an ipAddress to the classifier API, it will be analyzed with the following rules designed to evaluate how harmful the actor is. It will tell you whether the user is using a proxy server or TOR, whether the IP address is located in a datacenter, or whether it is a known malicious IP address. You can use the IP evaluation for geo-blocking.

`BLOCKED_COUNTRY`

This rule matches if the detected geo-location matches blockedCountries.

Significance: high
Score: 0 | 5

`HOSTING`

This rule matches if the IP address is known to be located in a datacenter.

Significance: medium
Score: 0 | 2

`MALICIOUS`

This rule matches if the IP address is found in one of the block-lists of known malicious actors.

Significance: high
Score: 0 | 5

`PROXY`

This rule matches if the IP address is known to be a proxy server such as a VPN.

Significance: low
Score: 0 | 0.5

`TOR`

This rule matches if the IP address is known to be a TOR exit.

Significance: medium
Score: 0 | 1

`UNEXPECTED_COUNTRY`

This rule matches if the detected geo-location does not match expectedCountries.

Significance: medium
Score: 0 | 1

Time-zone Rules

The user’s time-zone (provided by the browser) is evaluated to detect an accurate geo-location of the user. This is often more accurate than the IP address due to inaccuracy of the IP dataset and the use of proxies.

`BLOCKED_COUNTRY`

This rule matches if the detected geo-location matches blockedCountries.

Significance: high
Score: 0 | 5

`UNEXPECTED_COUNTRY`

This rule matches if the detected geo-location does not match expectedCountries.

Significance: medium
Score: 0 | 1

Spam Filter API

Use Cases

Try it!

Authorization

Usage with the Widget

Modes of Operation

Default Mode

IP Address Mode

Custom Modes

Text classification

Language Support

Email verification

IP address verification

Time-zone verification

API endpoint

Request

Response

Scoring Rules

Text Rules

CAPITALIZATION

CURRENCY

EMOJI

EXCLAMATION

HASH_TAGS

HTML

HTML_INJECTION

NUMBERS_ONLY

PROFANITY

RANDOM_CHARS

SENTIMENT

SHORT_TEXT

SPAM_WORDS

SPECIAL_CHARS

SQL_INJECTION

UNEXPECTED_LANGUAGE

UNKNOWN_LANGUAGE

URL

Email Rules

FREE_PROVIDER

DMARC

MX

REPORTED

INVALID

IP Address Rules

BLOCKED_COUNTRY

HOSTING

MALICIOUS

PROXY

TOR

UNEXPECTED_COUNTRY

Time-zone Rules

BLOCKED_COUNTRY

UNEXPECTED_COUNTRY

`CAPITALIZATION`

`CURRENCY`

`EMOJI`

`EXCLAMATION`

`HASH_TAGS`

`HTML`

`HTML_INJECTION`

`NUMBERS_ONLY`

`PROFANITY`

`RANDOM_CHARS`

`SENTIMENT`

`SHORT_TEXT`

`SPAM_WORDS`

`SPECIAL_CHARS`

`SQL_INJECTION`

`UNEXPECTED_LANGUAGE`

`UNKNOWN_LANGUAGE`

`URL`

`FREE_PROVIDER`

`DMARC`

`MX`

`REPORTED`

`INVALID`

`BLOCKED_COUNTRY`

`HOSTING`

`MALICIOUS`

`PROXY`

`TOR`

`UNEXPECTED_COUNTRY`

`BLOCKED_COUNTRY`

`UNEXPECTED_COUNTRY`