Performance-friendly way of blocking Referrer spam

Danila Vershinin

3 years ago

What is the Referrer spam

Referrer spam is typically used to spam your Google Analytics account, as well as log monitoring software.
It is done by loading your website pages and sending promoted URLs in the Referer HTTP header.
Since this traffic is completely useless, you do want to block it in order to reduce log noise as well as CPU load.

Where the `map` has gone wrong

Unfortunately, many solutions online will direct you to maintaining a map with multiple regular expressions, each representing an invalid referring domain name.
See this configuration, for a particularly bad example, running up to, at worst, 4 thousand regular expressions on a single request to NGINX:

map $http_referer $bad_referer {
    default 0;
    "~*0\-0\.fr" 1;
    "~*000free\.us" 1;
    "~*00author\.com" 1;
    # 4 thousand regular expressions more below...
}

It is bad because it comes with a substantial performance impact especially resulting in reduced NGINX requests’ concurrency.

Many bloggers out there already did testing of huge maps causing performance impact. See for example, “Use and performance test of map module in Nginx”.

Some NGINX forum members also questioned performance penalties of maps with exact strings vs regular expressions inside them.

One of the NGINX maintainers replied:

The regular expressions tested sequentially.

In other words, an NGINX map with hundreds of regular expressions is bad for you.

Towards a better `map`

To make a map in NGINX faster, we should reduce the usage of regular expressions to the least possible.

In our particular case of blocking referral spam, we can extract the Referer header’s domain once, and then match only against domain hosts using the special hostnames parameter of the map directive.

map $http_referer $http_referer_host {
    "~^(?:https?://)?([^/]+)" $1;
}

The above creates a special $http_referer_host variable, which will contain the domain name found in the Referer header’s URL.

The regular expression ^(?:https?://)?([^/]+) extracts the hostname by taking the URL part that follows the protocol until the first forward slash, which denotes the beginning of a URI.

So, for example for Referer with value “https://www.example.com/test”, we get www.example.com.

Next up, we can build out a performance-friendly map, and put as many entries as we want without significant performance penalties:

map $http_referer_host $bad_referer {
    hostnames;

    default       0;

    .bad-one.com   1;
    .another-spam.com 1;
    # etc.
}

The leading dots ensures we block both bare domains as well as subdomains, and since we rely on our $http_referer_host variable, there is only one regular expression ever evaluated in order to reduce spam.
That’s 1 regular expression against up to 4 thousand.

Install and protect

You can find the implementation of this approach at the following GitHub repository.

Installation instructions are pretty straightforward:

Drop the file to an included directory, e.g. /etc/nginx/conf.d:

curl -o /etc/nginx/conf.d/referral-spam.conf https://raw.githubusercontent.com/dvershinin/referrer-spam-blocker/main/referral-spam.conf

Also, we must be sure to increase maps’ hash bucket size, for NGINX to be able to work with larger maps. Create /etc/nginx/conf.d/custom.conf and specify:

map_hash_bucket_size 128;

See, why and how it is required to tune NGINX for large maps.

Add the following to each /etc/nginx/site-available/example.com.conf that needs protection:

server {
    if ($bad_referer) {
        return 444;
    }
}

Check NGINX configuration for errors by running nginx -t. If all is good, reload NGINX:

systemctl reload nginx

What is the Referrer spam

Where the map has gone wrong

Towards a better map

Install and protect

Share this:

Where the `map` has gone wrong

Towards a better `map`