<form id="dlljd"></form>
        <address id="dlljd"><address id="dlljd"><listing id="dlljd"></listing></address></address>

        <em id="dlljd"><form id="dlljd"></form></em>

          <address id="dlljd"></address>
            <noframes id="dlljd">

              聯系我們 - 廣告服務 - 聯系電話:
              您的當前位置: > 關注 > > 正文

              今日看點:bot短期密集訪問形成的流量高峰有哪些?如何解決?

              來源:CSDN 時間:2023-03-31 08:01:11

              周末大清早收到封警報郵件,估計網站被攻擊了,要么就是緩存日志memory的問題。打開access.log 看了一眼,原來該時間段內大波的bot(bot: 網上機器人;自動程序 a computer programthat performs a particular task again and again many times)訪問了我的網站。

              http://ltx71.com  http://mj12bot.com http://www.bing.com/bingbot.htm http://ahrefs.com/robot/ http://yandex.com/bots


              (資料圖)

              website.com (AWS) - Monitor is Down

              Down since Mar 25, 2017 1:38:58 AM CET

              Site Monitored

              Resolved IP

              54.171.32.xx

              Reason

              Service Unavailable.

              Monitor Group

              XX Applications

              Outage Details

              LocationResolved IPReasonLondon - UK (5.77.35.xx)54.171.32.xxService Unavailable.Headers : HTTP/1.1 503 Service Unavailable: Back-end server is at capacity Content-Length : 0 Connection : keep-alive GET / HTTP/1.1 Cache-Control : no-cache Accept : */* Connection : Keep-Alive Accept-Encoding : gzip User-Agent : Site24x7 Host : xxxSeattle - US (104.140.20.xx)54.171.32.xxService Unavailable.Headers : HTTP/1.1 503 Service Unavailable: Back-end server is at capacity Content-Length : 0 Connection : keep-alive GET / HTTP/1.1 Cache-Control : no-cache Accept : */* Connection : Keep-Alive Accept-Encoding : gzip User-Agent : Site24x7 Host : xxx

              上網搜了一下,發現許多webmaster都遇到了由于bot短期密集訪問形成的流量高峰而無法其它終端提供服務的問題。從這篇文章的分析中,我們看到有這樣幾種方法來block這些web bot。

              1.      robots.txt

              許多網絡爬蟲都是先去搜索robots.txt,如下所示:

              "199.58.86.206" - - [25/Mar/2017:01:26:50 +0000] "GET /robots.txt HTTP/1.1" 404 341 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "199.58.86.206" - - [25/Mar/2017:01:26:54 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "162.210.196.98" - - [25/Mar/2017:01:39:18 +0000] "GET /robots.txt HTTP/1.1" 404 341 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"

              許多bot的發布者也談到了如果不希望被爬取,應該如何來操作,就以MJ12bot為例:

              How can I block MJ12bot?

              MJ12bot adheres to the robots.txtstandard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt:

              User-agent: MJ12bot

              Disallow: /

              Please do not waste your time trying to block bot via IP in htaccess - we do not use any consecutive IP blocks so your efforts will be in vain. Also please make sure the bot can actually retrieve robots.txt itself - if it can"t then it will assume (this is the industry practice) that its okay to crawl your site.

              If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email: bot@majestic12.co.uk. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to.

              How can I slow down MJ12bot?

              You can easily slow down bot by adding the following to your robots.txt file:

              User-Agent: MJ12bot

              Crawl-Delay:   5

              Crawl-Delay should be an integer number and it signifies number of seconds of wait between requests. MJ12bot will make an up to 20 seconds delay between requests to your site - note however that while it is unlikely, it is still possible your site may have been crawled from multiple MJ12bots at the same time. Making high Crawl-Delay should minimise impact on your site. This Crawl-Delay parameter will also be active if it was used for * wildcard.

              If our bot detects that you used Crawl-Delay for any other bot then it will automatically crawl slower even though MJ12bot specifically was not asked to do so.

              那么我們可以寫如下的

              User-agent: YisouSpider

              Disallow: /

              User-agent: EasouSpider

              Disallow: /

              User-agent: EtaoSpider

              Disallow: /

              User-agent: MJ12bot

              Disallow: /

              另外,鑒于很多bot都會去訪問這些目錄:

              /wp-login.php /wp-admin/

              /trackback/

              /?replytocom=

              許多WordPress網站也確實用到了這些文件夾,那么如何在不影響功能的情況下做一些調整呢?

              robots.txt修改之前robots.txt修改之后

              User-agent: *

              Disallow: /wp-admin

              Disallow: /wp-content/plugins

              Disallow: /wp-content/themes

              Disallow: /wp-includes

              Disallow: /?s=User-agent: *

              Disallow: /wp-admin

              Disallow: /wp-*

              Allow: /wp-content/uploads/

              Disallow: /wp-content

              Disallow: /wp-login.php

              Disallow: /comments

              Disallow: /wp-includes

              Disallow: /*/trackback

              Disallow: /*?replytocom*

              Disallow: /?p=*&preview=true

              Disallow: /?s=

              不過,也可以看到許多爬蟲并不理會robots.txt,以這個為例,就沒有先去訪問robots.txt

              "10.70.8.30, 163.172.65.40" - - [25/Mar/2017:02:13:36 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:13:42 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:14:17 +0000] "GET /static/js/utils.js HTTP/1.1" 200 5345 "http://iatatravelcentre.com/" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:14:17 +0000] "GET /static/css/home.css HTTP/1.1" 200 8511 "http://iatatravelcentre.com/" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)"

              這個時候就要試一下其他幾種方法。

              2.      .htaccess

              原理就是利用URL rewrite,只要發現訪問來自于這些agent,就禁止其訪問。作者“~吉爾伽美什”的這篇文章介紹了關于.htaccess的很多用法。

              5. Blocking users by IP 根據IP阻止用戶訪問order allow,deny  deny from 123.45.6.7  deny from 12.34.5. (整個C類地址)  allow from all  6. Blocking users/sites by referrer 根據referrer阻止用戶/站點訪問需要mod_rewrite模塊  例1. 阻止單一referrer: badsite.comRewriteEngine on  # Options +FollowSymlinks  RewriteCond %{HTTP_REFERER} badsite\.com [NC]  RewriteRule .* - [F]  例2. 阻止多個referrer: badsite1.com, badsite2.comRewriteEngine on  # Options +FollowSymlinks  RewriteCond %{HTTP_REFERER} badsite1\.com [NC,OR]  RewriteCond %{HTTP_REFERER} badsite2\.com  RewriteRule .* - [F]  [NC] - 大小寫不敏感(Case-insensite)  [F] - 403 Forbidden  注意以上代碼注釋掉了”Options +FollowSymlinks”這個語句。如果服務器未在 httpd.conf 的 段落設置 FollowSymLinks, 則需要加上這句,否則會得到”500 Internal Server error”錯誤。  7. Blocking bad bots and site rippers (aka offline browsers) 阻止壞爬蟲和離線瀏覽器需要mod_rewrite模塊  壞爬蟲? 比如一些抓垃圾email地址的爬蟲和不遵守robots.txt的爬蟲(如baidu?)  可以根據 HTTP_USER_AGENT 來判斷它們  (但是還有更無恥的如”中搜 zhongsou.com”之流把自己的agent設置為 “Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)” 太流氓了,就無能為力了)  RewriteEngine On  RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]  RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]  RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]  RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]  RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]  RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]  RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]  RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]  RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]  RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]  RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]  RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]  RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]  RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]  RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]  RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]  RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]  RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]  RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]  RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]  RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]  RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]  RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]  RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]  RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]  RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]  RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]  RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]  RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]  RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]  RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]  RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]  RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]  RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]  RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]  RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]  RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]  RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]  RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]  RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]  RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]  RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]  RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]  RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]  RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]  RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]  RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]  RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]  RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]  RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]  RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]  RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]  RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]  RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]  RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]  RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]  RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]  RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]  RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]  RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]  RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]  RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]  RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]  RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]  RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]  RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]  RewriteCond %{HTTP_USER_AGENT} ^Zeus  RewriteRule ^.* - [F,L]  [F] - 403 Forbidden  [L] - 連接(Link) 8. Change your default directory page 改變缺省目錄頁面 DirectoryIndex index.html index.php index.cgi index.pl  9. Redirects 轉向單個文件Redirect /old_dir/old_file.html http://yoursite.com/new_dir/new_file.html  整個目錄Redirect /old_dir http://yoursite.com/new_dir  效果: 如同將目錄移動位置一樣  http://yoursite.com/old_dir -> http://yoursite.com/new_dir  http://yoursite.com/old_dir/dir1/test.html -> http://yoursite.com/new_dir/dir1/test.html  Tip: 使用用戶目錄時Redirect不能轉向的解決方法當你使用Apache默認的用戶目錄,如 http://mysite.com/~windix,當你想轉向 http://mysite.com/~windix/jump時,你會發現下面這個Redirect不工作:  Redirect /jump http://www.google.com  正確的方法是改成  Redirect /~windix/jump http://www.google.com  (source: .htaccess Redirect in “Sites” not redirecting: why?  )  10. Prevent viewing of .htaccess file 防止.htaccess文件被查看  order allow,deny  deny from all

              3.      拒絕IP的訪問

              可以在Apache配置文件httpd.conf指明拒絕來自某些IP的訪問。

              Order allow,deny

              Allow from all

              Deny from5.9.26.210

              Deny from162.243.213.131

              但是由于很多時候,這些訪問的IP并不固定,所以這種方法不太方便,而且修改了httpd.conf還要重啟apache才能生效,所以建議采用修改.htaccess。

              責任編輯:

              標簽:

              相關推薦:

              精彩放送:

              新聞聚焦
              Top 中文字幕在线观看亚洲日韩