One thing I particularly enjoy about running my own blog, on my own platform, is seeing people use it. Posts on law libraries, on data visualization, and other business topics are visited during the week. Late at night, it’s posts on games, fixing devices, crafts. They come from all over the world and I hope they find what they’re looking for. But as I’ve noted before, there are other visitors I’m not so glad to see. I learned about autonomous system numbers (ASNs) recently and decided to add them to my site’s approach to security and resource preservation.

I’m not entirely clear on the purpose of the ASN but what I noticed was the “network operator” element of their operation. For my purposes, an ASN is attached to an internet service provider (ISP) or other network service, like a web host or platform (Google, Facebook). I’d recently noticed that my Matomo analytics were showing provider in relation to a visitor and, for the most part, it was an ISP. Good to know.

But then I noticed that I was getting a lot of traffic in my analytics that was from a couple of unusual providers: colocrossing and sprious. The latter hosts a product called “scraping robot” and the former appears to be a hosting company. I’m leery about scrapers and I also don’t like having my web host crawled for the operational/performance impact. There is also the copyright issue when a scraper is ingesting your content. My support for scrapers goes only as far as it is needed to help create an RSS feed.

I decided I’d see if I could use my Cloudflare firewall capability to block whatever these were. As I did so, I realized that the provider information could be used within Cloudflare to do things I couldn’t do before.

Block by Bot Provider

In the past, I have attempted to block by user-agent. This is part of the information you send when you visit a web site. Click this link to the Duck Duck Go search engine to see your user-agent information. It tells sites you visit about your software: browser, operating system, and so on. Search engines will often indicate their identity in the user-agent. So, for example, the Chinese Sogou crawler user-agent says this:

Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)

and Semrush, a digital marketing platform, says this:

Mozilla/5.0 (compatible; SemrushBot-BA; +http://www.semrush.com/bot.html)

If the visitor that you want to block is a bot, and it clearly designates itself as one, you can then use the user-agent information to enable that block. One thing that I have done in the past is to incorporate this into the .htaccess configuration file on my web server. So an example of what I’ve put in .htaccess in the past looked like this:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} (bot|crawl|robot|sogou|dataprovider) [NC]
RewriteCond %{HTTP_USER_AGENT} !(bing|google|googlebot|twitterbot|twitter|mobile|uptime) [NC]
RewriteRule ^/?.*$ “http://127.0.0.1” [R,L]

It’s a couple of conditional statements checking the user-agent and, depending on what it finds, it allows the bot to continue or it stops it. Ideally, you are using a robots.txt file to designate who you want to crawl your site. Unfortunately, some crawlers do not observe the file’s restrictions. If you forward bots to 127.0.0.1 as in the snippet above, it sends them to a localhost address, not your site. It’s a dead end.

Cloudflare Instead of .Htaccess

But the Cloudflare firewall also supports rules. And, frankly, I’d rather have this sort of activity happen before it gets to my web server. Even a check to see if the bot should be allowed uses my web hosting resources. If I can stop it at Cloudflare, that’s one less thing to worry about.

Cloudflare has a free plan. I think it’s plenty for most small law libraries. It’s easy to set up, requiring a DNS change to ensure that traffic goes through Cloudflare before it gets to your site.

If you go to the Firewall application in your dashboard, you’ll see an overview of what’s being blocked. There are two other important tabs: Firewall Rules and Tools. You can look at an entry on the Overview page to see the detail. This can help you unblock things that you want to allow through. Alternatively, you can use the Firewall Rules and Tools to set up additional blocks beyond what Cloudflare blocks automatically.

So, for example, I started out with a challenge rule. I noticed that a lot of bad requests were coming from specific countries. I’ve written about using country restrictions on content, like a paywall, as well. This is a blunter tool. It means any request coming from a country is blocked. Unless the visitor completes a challenge. The rule looks like this and uses country codes:

(ip.geoip.country eq "VN") or (ip.geoip.country eq "RU") or (ip.geoip.country eq "CN") or (ip.geoip.country eq "BR") or (ip.geoip.country eq "IN") or (ip.geoip.country eq "PH") or (ip.geoip.country eq "TR") or (ip.geoip.country eq "T1")

The reason I use a challenge is that there is some legitimate traffic from those countries. But it balances that genuine traffic with me not having to play whack-a-mole. In general, most visits from these countries do not complete the challenge. You can see this in your Firewall Rules after setting up a challenge.

Screenshot of Cloudflare Firewall Rules activity. The challenge rule, at the bottom, is highlighted with oval to show that, of 90 challenges, only 1 was solved.

Cloudflare also has a built-in known bad bots rule. This is why, when you look at your Firewall Overview without any rules, it is still blocking things. You can exclude bots or complement their list with bots you want to add to the block. In my case, there were a number I wanted to exclude from the blocking, primarily for well-known web sites or platforms:

(cf.client.bot 
and not http.user_agent contains "bingbot" 
and not http.user_agent contains "Googlebot"
and not http.user_agent contains "Jetpack"
and not http.user_agent contains "Feedly" 
and not http.user_agent contains "BingPreview" 
and not http.user_agent contains "duckduckgo"
and not http.user_agent contains "Pinterestbot"
and not http.user_agent contains "applebot" 
and not http.user_agent contains "Google-AMPHTML"
and not http.user_agent contains "CloudflareDiagnostics" and not http.user_agent contains "FeedFetcher-Google" 
and ip.geoip.asnum ne 2635)

You will recognize a lot of familiar sites on that list. One frustrating aspect is that some services send multiple bots with different user-agents, each with its own purpose. This is a WordPress site. Automattic, the WordPress developer, sends bots with user-agents photon, wp.com, wordpress.com, and so on. So a rule like this, has to take that into account.

Which is why that last line may be so useful. If you do a web search for asn automattic, you will find that it’s ASN is 2635. Now I can select the AS Num option in the Firewall Rules and filter out a bunch of bots at the same time. I have not done this with Google because some of the bad requests come from Google Private Cloud users. The same for Microsoft and Azure.

The rules don’t have to be complicated. In fact, most law librarians will be able to read this pretty easily, since it’s just nested boolean logic (A and (B or C)). They are processed in order, so my first rule blocks attempts to access my site’s login page. It’s one of the most requested URIs by bad actors. My rule is this:

ip.geoip.country ne "CA" and 
(http.request.uri contains "login.php" or http.request.uri contains "/wp-admin")

This isn’t how I secure my site. I have two-factor authentication, and I use the .htaccess file to provide additional restrictions. You might hide your login. But this stops those probing requests without even getting to the protection that is closer to home.

Cloudflare Firewall Rules require little technical knowledge to use. I’ve included the rules, above, but you create them with drop-down menus. They allow you to be very specific about what you allow in or don’t. You need to come with a bit of information, either from your own web site analytics, or by looking at the Firewall overview to see what is blocked.

A Deeper Moat

You only get 5 firewall rules with Cloudflare’s free plan so you have to use them wisely. But there are many ways to achieve the same goal in Cloudflare. Under the Tools option, you can designate specific IP addresses or ASNs that you want to block or allow. The Firewall Rules give you a way to fine-tuned control over your protection, if you want it. The AS Number blocking under Tools allows the bluntest instrument.

As I started this post saying, Matomo, my web site analytics, gives me provider information. When I saw the crawling from the two services, I did a web search to identify their ASNs. Then I blocked them in Cloudflare.

But I subsequently saw the traffic restart. In some cases, the provider name was the same but they were operating from a different ASN. In others, they’d switched to a different ASN. Any visitors from these network operators are now blocked.

Screenshot showing the Cloudflare Firewall Tools section that enables blocking by AS Number. Unlike Firewall Rules, you can apply these blocks to all web sites you manage at Cloudflare.

If it’s not clear what ASN to use, you can always look up the IP address of your visitor using a whois service. Take a look at the image below. It shows 4 visits in a span of one minute from 3 different cities. It’s all one scraper bot.

A screenshot of Matomo Analytics for this blog, showing visitor information on the left (time of visit, IP address, city, technology) and the URIs visited on the right.

What made me suspicious? A provider called unknown. A web site visitartisticplaces that doesn’t exist. A visitor looking at two pages that are not connected. My intrinsic paranoia!

Now, if you plug the IP addresses shown in that screenshot for those four visitors – 192.186.159.0, 192.186.177.0, 23.229.65.0, and 198.245.67.0 – into a whois lookup, you’ll get some overlapping information:

OriginAS:       AS55286
Organization:   B2 Net Solutions Inc. (BNS-34)

Now that I have that AS Number, I can add it to my Firewall Tools list of ASNs to block entirely.

There is a grey area. Services like Manzama and Ozmosys (?) and Meltwater (user-agent: MeltwaterNews) must be crawling or acquiring their content from somewhere. A UK crawler for commercial clients – Majestic – is one of the ones I block. There is a Java-based app – from Italy and the US – that doesn’t provide enough information to know what it’s doing, although I think it’s benign. So there are probably unintended consequences and you will need to balance your blocks with knowing the crawlers for the tools or platforms your audience may be using.

I’m still playing around with it. The AS will mean I need fewer specific items in my rules when one provider sends multiple bots. But if you make a change, be sure to check back after 30 minutes or a couple of hours to make sure it’s working the way you expected it.

I learned a lot about Cloudflare’s firewall options that I probably would have missed without having a specific reason to look. The firewall allows me more options so that I don’t need to have the technical skill to hack on .htaccess files. And it means I’ve moved even more of this negative web server usage away from my server resources.