What is the Referrer spam
Referrer spam is typically used to spam your Google Analytics account, as well as log monitoring software.
It is done by loading your website pages and sending promoted URLs in the Referer
HTTP header.
Since this traffic is completely useless, you do want to block it in order to reduce log noise as well as CPU load.
Where the map
has gone wrong
Unfortunately, many solutions online will direct you to maintaining a map
with multiple regular expressions, each representing an invalid referring domain name.
See this configuration, for a particularly bad example, running up to, at worst, 4 thousand regular expressions on a single request to NGINX:
map $http_referer $bad_referer {
default 0;
"~*0\-0\.fr" 1;
"~*000free\.us" 1;
"~*00author\.com" 1;
# 4 thousand regular expressions more below...
}
It is bad because it comes with a substantial performance impact especially resulting in reduced NGINX requests’ concurrency.
Many bloggers out there already did testing of huge map
s causing performance impact. See for example, “Use and performance test of map module in Nginx”.
Some NGINX forum members also questioned performance penalties of map
s with exact strings vs regular expressions inside them.
One of the NGINX maintainers replied:
The regular expressions tested sequentially.
In other words, an NGINX map
with hundreds of regular expressions is bad for you.
Towards a better map
To make a map
in NGINX faster, we should reduce the usage of regular expressions to the least possible.
In our particular case of blocking referral spam, we can extract the Referer
header’s domain once, and then match only against domain hosts using the special hostnames
parameter of the map
directive.
map $http_referer $http_referer_host {
"~^(?:https?://)?([^/]+)" $1;
}
The above creates a special $http_referer_host
variable, which will contain the domain name found in the Referer
header’s URL.
The regular expression ^(?:https?://)?([^/]+)
extracts the hostname by taking the URL part that follows the protocol until the first forward slash, which denotes the beginning of a URI.
So, for example for Referer
with value “https://www.example.com/test”, we get www.example.com
.
Next up, we can build out a performance-friendly map
, and put as many entries as we want without significant performance penalties:
map $http_referer_host $bad_referer {
hostnames;
default 0;
.bad-one.com 1;
.another-spam.com 1;
# etc.
}
The leading dots ensures we block both bare domains as well as subdomains, and since we rely on our $http_referer_host
variable, there is only one regular expression ever evaluated in order to reduce spam.
That’s 1 regular expression against up to 4 thousand.
Install and protect
You can find the implementation of this approach at the following GitHub repository.
Installation instructions are pretty straightforward:
Drop the file to an included directory, e.g. /etc/nginx/conf.d
:
curl -o /etc/nginx/conf.d/referral-spam.conf https://raw.githubusercontent.com/dvershinin/referrer-spam-blocker/main/referral-spam.conf
Also, we must be sure to increase maps’ hash bucket size, for NGINX to be able to work with larger maps. Create /etc/nginx/conf.d/custom.conf
and specify:
map_hash_bucket_size 128;
See, why and how it is required to tune NGINX for large maps.
Add the following to each /etc/nginx/site-available/example.com.conf
that needs protection:
server {
if ($bad_referer) {
return 444;
}
}
Check NGINX configuration for errors by running nginx -t
. If all is good, reload NGINX:
systemctl reload nginx