Blocking bots by their IP addresses, the DIY version

This post was written by eli on August 16, 2022
Posted Under: Internet,Linux,perl,Server admin

Introduction

I had some really annoying bots on one of my websites. Of the sort that make a million requests (like really, a million) per month, identifying themselves as a browser.

So IP blocking it is. I went for a minimalistic DIY approach. There are plenty of tools out there, but my experience with things like this is that in the end, it’s me and the scripts. So I might as well write them myself.

The IP set feature

Iptables has an IP set module, which allows feeding it with a set of random IP addresses. Internally, it creates a hash with these addresses, so it’s an efficient way to keep track of multiple addresses.

IP sets has been in the kernel since ages, but it has to be opted in the kernel with CONFIG_IP_SET. Which it most likely is.

The ipset utility may need to be installed, with something like

# apt install ipset

There seems to be a protocol mismatch issue with the kernel, which apparently is a non-issue. But every time something goes wrong with ipset, there’s a warning message about this mismatch, which is misleading. So it looks something like this.

# ipset [ ... something stupid or malformed ... ]
ipset v6.23: Kernel support protocol versions 6-7 while userspace supports protocol versions 6-6
[ ... some error message related to the stupidity ... ]

So the important thing is to be aware of is that odds are that the problem isn’t the version mismatch, but between chair and keyboard.

Hello, world

A quick session

# ipset create testset hash:ip
# ipset add testset 1.2.3.4
# iptables -I INPUT -m set --match-set testset src -j DROP
# ipset del testset 1.2.3.4

Attempting to add an IP address that is already in the list causes a warning, and the address isn’t added. So no need to check if the address is already there. Besides, there the -exist option, which is really great.

List the members of the IP set:

# ipset -L

Timeout

An entry can have a timeout feature, which works exactly as one would expect: The rule vanishes after the timeout expires. The timeout entry in ipset -L counts down.

For this to work, the set must be created with a default timeout attribute. Zero means that timeout is disabled (which I chose as a default in this example).

# ipset create testset hash:ip timeout 0
# ipset add testset 1.2.3.4 timeout 10

The ‘-exist’ flag causes ipset to re-add an existing entry, which also resets its timeout. So this is the way to keep the list fresh.

Don’t put the DROP rule first

It’s tempting to put the DROP rule with –match-set first, because hey, let’s give those intruders the boot right away. But doing that, there might be TCP connections lingering, because the last FIN packet is caught by the firewall as the new rule is added. Given that adding an IP address is the result of a flood of requests, this is a realistic scenario.

The solution is simple: There’s most likely a “state RELATED,ESTABLISHED” rule somewhere in the list. So push it to the top. The rationale is simple: If a connection has begun, don’t chop it in the middle in any case. It’s the first packet that we want killed.

Persistence

The rule in iptables must refer to an existing set. So if the rule that relies on the set is part of the persistent firewall rules, it must be created before the script that brings up iptables runs.

This is easily done by adding a rule file like this as /usr/share/netfilter-persistent/plugins.d/10-ipset

#!/bin/sh

IPSET=/sbin/ipset
SET=mysiteset

case "$1" in
start|restart|reload|force-reload)
	$IPSET destroy
	$IPSET create $SET hash:ip timeout 0
	;;

save)
	echo "ipset-persistent: The save option does nothing"
	;;

stop|flush)
	$IPSET flush $SET
	;;
*)
    echo "Usage: $0 {start|restart|reload|force-reload|save|flush}" >&2
    exit 1
    ;;
esac

exit 0

The idea is that the index 10 in the file’s name is smaller than the rule that sets up iptables, so it runs first.

This script is a dirty hack, but hey, it works. There’s a small project on this, for those who like to do it properly.

The operating system in question is systemd-based, but this old school style is still in effect.

Maybe block by country?

Since all offending requests came from the same country (cough, cough, China, from more than 4000 different IP addresses) I’m considering to block them in one go. A list of 4000+ IP addresses that I busted in August 2022 with aggressive bots (all from China) can be downloaded as a simple compressed text file.

So the idea is going something like

ipset create foo hash:net
ipset add foo 192.168.0.0/24
ipset add foo 10.1.0.0/16
ipset add foo 192.168.0/24

and download the per-country IP ranges from IP deny. That’s a simple and crude tool for denial by geolocation. The only thing that puts me down a bit is that it’s > 7000 rules, so I wonder if that doesn’t put a load on the server. But what really counts is the number of sizes of submasks, because each submask size has its own hash. So if the list covers all possible  sizes, from a full /32 down to say, 16/, there are 17 hashes to look up for each packet arriving.

On the other hand, since the rule should be after the “state RELATED,ESTABLISHED” rule, it only covers SYN packets. And if this whole thing is put as late as possible in the list of rules, it boils down to handling only packets that are intended for the web server’s ports, or those that are going to be dropped anyhow. So compared with the CPU cycles of handling the http request, even 17 hashes isn’t all that much.

The biggest caveat is however if other websites are colocated on the server. It’s one thing to block offending IPs, but blocking a whole country from all sites, that’s a bit too much.

Note to self: In the end, I wrote a little Perl-XS module that says if the IP belongs to a group. Look for byip.pm.

The blacklisting script

The Perl script that performs the blacklisting is crude and inaccurate, but simple. This is the part to tweak and play with, and in particular adapt to each specific website. It’s all about detecting abnormal access.

Truth to be told, I replaced this script with a more sophisticated mechanism pretty much right away on my own system. But what’s really interesting is the calls to ipset.

This script reads through Apache’s access log file, and analyzes each minute in time (as in 60 seconds). In other words, all accesses that have the same timestamp, with the seconds part ignored. Note that the regex part that captures $time in the script ignores the last part of :\d\d.

If the same IP address appears more than 50 times, that address is blacklisted, with a timeout of 86400 seconds (24 hours). Log file that correspond to page requisites and such (images, style files etc.) are skipped for this purpose. Otherwise, it’s easy to reach 50 accesses within a minute with legit web browsing.

There are several imperfections about this script, among others:

  • Since it reads through the entire log file each time, it keeps relisting each IP address until the access file is rotated away, and a new one is started. This causes an update of the timeout, so effectively the blacklisting takes place for up to 48 hours.
  • Looking in segments of accesses that happen to have the same minute in the timestamp is quite inaccurate regarding which IPs are caught and which aren’t.

The script goes as follows:

#!/usr/bin/perl
use warnings;
use strict;

my $logfile = '/var/log/mysite.com/access.log';
my $limit = 50; # 50 accesses per minute
my $timeout = 86400;

open(my $in, "<", $logfile)
  or die "Can't open $logfile for read: $!\n";

my $current = '';
my $l;
my %h;
my %blacklist;

while (defined ($l = <$in>)) {
  my ($ip, $time, $req) = ($l =~ /^([^ ]+).*?\[(.+?):\d\d[ ].*?\"\w+[ ]+([^\"]+)/);
  unless (defined $ip) {
    #    warn("Failed to parse line $l\n");
    next;
  }

  next
    if ($req =~ /^\/(?:media\/|robots\.txt)/);

  unless ($time eq $current) {
    foreach my $k (sort keys %h) {
      $blacklist{$k} = 1
	if ($h{$k} >= $limit);
    }

    %h = ();
    $current = $time;
  }
  $h{$ip}++;
}

close $in;

foreach my $k (sort keys %blacklist) {
  system('/sbin/ipset', 'add', '-exist', 'mysiteset', $k, 'timeout', $timeout);
}

It has to be run as root, of course. Most likely as a cronjob.

Add a Comment

required, use real name
required, will not be published
optional, your blog address