阻止搜索引擎和恶意蜘蛛爬虫访问

大量的蜘蛛爬虫访问会消耗服务器性能开销,更有工具类爬虫对网站进行渗透访问,给网站安全造成威胁,本文分享这些爬虫的 User-Agent 以及阻止方法。

现在大部分网站都使用CDN进行加速,建议直接在CDN设置 User-Agent 黑名单

阿里云全站加速 DCDN 设置方法如图所示,在图中填入 User-Agent

*dotbot*|*Go-http-client*|*CensysInspect*|*okhttp*|*MegaIndex*|*MegaIndex.ru*|*BLEXBot*|*Qwantify*|*qwantify*|*semrush*|*Semrush*|*serpstatbot*|*hubspot*|*python*|*Bytespider*|*Go-http-client*|*Java*|*PhantomJS*|*SemrushBot*|*Scrapy*|*Webdup*|*AcoonBot*|*AhrefsBot*|*Ezooms*|*EdisterBot*|*EC2LinkFinder*|*jikespider*|*Purebot*|*MJ12bot*|*WangIDSpider*|*WBSearchBot*|*Wotbox*|*xbfMozilla*|*Yottaa*|*YandexBot*|*Jorgee*|*SWEBot*|*spbot*|*TurnitinBot-Agent*|*mail.RU*|*Perl*|*Python*|*Wget*|*Xenu*|*ZmEu*

Cloudflare设置方法如图,若使用表达式生成器手动一个个添加将耗费太多时间,直接编辑表达式填入如下表达式即可

(http.user_agent contains "Go-http-client") or (http.user_agent contains "CensysInspect") or (http.user_agent contains "okhttp") or (http.user_agent contains "MegaIndex") or (http.user_agent contains "MegaIndex.ru") or (http.user_agent contains "BLEXBot") or (http.user_agent contains "Qwantify") or (http.user_agent contains "qwantify") or (http.user_agent contains "semrush") or (http.user_agent contains "Semrush") or (http.user_agent contains "serpstatbot") or (http.user_agent contains "hubspot") or (http.user_agent contains "python") or (http.user_agent contains "Bytespider") or (http.user_agent contains "Go-http-client") or (http.user_agent contains "Java") or (http.user_agent contains "PhantomJS") or (http.user_agent contains "SemrushBot") or (http.user_agent contains "Scrapy") or (http.user_agent contains "Webdup") or (http.user_agent contains "AcoonBot") or (http.user_agent contains "AhrefsBot") or (http.user_agent contains "Ezooms") or (http.user_agent contains "EdisterBot") or (http.user_agent contains "EC2LinkFinder") or (http.user_agent contains "jikespider") or (http.user_agent contains "Purebot") or (http.user_agent contains "MJ12bot") or (http.user_agent contains "WangIDSpider") or (http.user_agent contains "WBSearchBot") or (http.user_agent contains "Wotbox") or (http.user_agent contains "xbfMozilla") or (http.user_agent contains "Yottaa") or (http.user_agent contains "YandexBot") or (http.user_agent contains "Jorgee") or (http.user_agent contains "SWEBot") or (http.user_agent contains "spbot") or (http.user_agent contains "TurnitinBot-Agent") or (http.user_agent contains "mail.RU") or (http.user_agent contains "perl") or (http.user_agent contains "Python") or (http.user_agent contains "Wget") or (http.user_agent contains "Xenu") or (http.user_agent contains "ZmEu")

在本机装有WAF的情况,例如宝塔WAF,直接导入即可,格式是一行一个UA

可使用AI工具处理格式

在Nginx配置代码如下:

    if ($http_user_agent ~ "MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" ) {
        return 444;
    }

Nginx配置网站同时支持多个PHP版本

很多框架系统都有插件应用市场(例如Discuz!),有些插件应用开发者由于各种原因不再对插件应用更新维护,导致该应用不支持PHP7、PHP8,但框架系统已经支持新版PHP。亦或是系统未支持新版PHP,但应用需要新版PHP才能运行。这种情况可以对Nginx进行配置实现同时支持多个PHP。

    location ~ [^/]\.php(/|$)
    {
        if ($request_uri ~*  "archives"){
          fastcgi_pass unix:/tmp/php-cgi-72.sock;
        }
      fastcgi_pass  unix:/tmp/php-cgi-56.sock;
      fastcgi_index index.php;
      include fastcgi.conf;
      include pathinfo.conf;
    }

配置Nginx禁止IP访问以防恶意解析

修改默认访问的站点配置文件,在server_name下方添加以下代码即可。

return 408;

当然改成502、403等其它状态码都可以,如果想把这部分流量导入到指定网站,添加以下代码即可。

rewrite ^(.*) https://www.jiangdefu.com permanent;

若要禁止HTTPS访问,添加监听端口443并配置任意的SSL证书即可。

Nginx配置移动端和电脑端自动双向跳转

场景

域名 描述
pc端 www.example.com 用于pc端访问官网
移动端 m.example.com 用于移动端访问

需求

在电脑端访问www.example.comm.example.com都跳转到www.example.com
在移动端访问www.example.comm.example.com都跳转到m.example.com

实现方法

为了实现跳转,可在页面中加入前端跳转代码JS对ua进行适配跳转。这种方式存在三个缺点:
a) 对用户:会加大由重定向的客户端造成的延迟;这是因为客户端需要先下载网页,接着解析并执行 JavaScript,然后才能触发重定向。301或302则不会有这个延迟。
b) 对搜索:爬虫也需要使用支持JS渲染的爬虫,才能发现此重定向。
c) 无法实现双向跳转或兼容性差:笔者尝试过多种公开代码进行测试,只能实现单向跳转,进行双向跳转时会造成死循环。

关于移动适配,百度的官方建议:
https://ziyuan.baidu.com/college/courseinfo?id=156

为了对用户和搜索引擎更友好,我们采取在Nginx进行跳转配置。

代码

电脑端:www.example.com

server {
      listen       80;
      server_name  www.example.com;

      #charset koi8-r;
      #access_log  logs/host.access.log  main;
    # 下面根据user_agent可以获取
     if ($http_host !~ "^www.example.com$") {
      rewrite  ^(.*)    http://www.example.com$1 permanent;
     }
     if ($http_user_agent ~* (mobile|nokia|iphone|ipad|android|samsung|htc|blackberry)) {
      rewrite  ^(.*)    http://m.example.com$1 permanent;
     }
    location / {
            root     /home/build/rampage-home-front/dist/html;
            index  index.html index.htm;
     }

}

作用部分代码如下:

     if ($http_host !~ "^www.example.com$") {
      rewrite  ^(.*)    http://www.example.com$1 permanent;
     }
     if ($http_user_agent ~* (mobile|nokia|iphone|ipad|android|samsung|htc|blackberry)) {
      rewrite  ^(.*)    http://m.example.com$1 permanent;
     }

手机端:m.example.com

server {
      listen       80;
      server_name  m.example.com;

      #charset koi8-r;
      #access_log  logs/host.access.log  main;
    #非移动端跳转到 www.example.com
     if ($http_user_agent !~* (mobile|nokia|iphone|ipad|android|samsung|htc|blackberry)) {
      rewrite  ^(.*)    http://www.example.com$1 permanent;
     }

    location / {
        root     /home/build/rampage-mobile-front/dist;
        index  index.html index.htm;
      }
}

作用部分代码如下:

if ($http_user_agent !~* (mobile|nokia|iphone|ipad|android|samsung|htc|blackberry)) {
      rewrite  ^(.*)    http://www.example.com$1 permanent;
     }

如果配置了SSL证书,需要在443端口同样配置。