If you haven’t heard of Cuil, odds are you soon will.  They are new search engine that has just launched with a massive index of over 120 billion pages.  They are being touted as the new Google.  But those of us who are familiar with Cuil’s spider, the Twiceler bot, know this guy is one bad boy trained not play fair with the rest of us. 

This bot began crawling one of my web sites around a year ago.  At first I was happy that it was indexing the 300, 000 plus pages on the site so quickly.  But then it started going out of control and began hacking URL’s.  My log files began to look I was being attacked by a script kiddie desperately searching for some kind of security vulnerability.  Eventually Cuil’s bot ended up crashing one of my web servers, a Windows 2003 sp2, Dual Intel P4 3.2 ghz processors with 4 gigs of memory (just pointing out the server has the hardware to support some tracffic) serveral times from certain malformed requests.

Cuil’s Twiceler bot would not obey my robots.txt file.  Attempts to make it go away by sending it blank responses with 404 (Page Not Found), 500 (Internal Server Errors),  and even 403 status codes (Access denied) went unrespected. 

I then banned the Cuil spider’s IP address.  The it started using differnt IP’s and and cloaked its identity by not sending its usual User-Agent header (Mozilla/5.0+(Twiceler-0.9+http://www.cuill.com/twiceler/robot.html).

However, I could still tell it was the same bot spider by the way it made its requests and its system search of my site for directories that just did not exist. Basically, it would find a link to a given URL, for example http://blog.alexanderhiggins.com/topics/blogging, and woudl begin hacking the url into different parts looking for hidden directories.  Using the above url as an example, it woul begin crawling the /topics directory like this:

/topcis/b
/topics/bl
/topics/blo
/topics/blog
etc…

Eventually, it would repeat the processe for chopping up /topics itself.  This bot was so bad that I had to programatically listen for malformed requests and ban its IP Address on the fly to prevent it from crashing my server again.  I programmed an email alert to notify me when this guy came by to sell hello and started misbehaving.

I still get several email alerts on a daily basis with “Banned IPs” from this bot malicously trying to bruteforce my server.  Some conditions I set include, to many requests for bad pages and massive amouts of requests within a given period of time(like a DOS attack).  Apparently, I am only one of several WebMasters forced had to write custom code to block Cuill’s Twiceler Bot because it doesn’t listen to webmasters Robots.txt either

Just take a look at my IIS Log Files showing the Cuil’s Twiceler Hacking at my server depsite being sent 403 Access Denied Errors.

Cuil's Twiceler Bot Malicously indexes content disrespecting webmaster guidelines
click the above image to view the log file

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • LinkedIn
  • Live
  • Propeller
  • Reddit
  • Slashdot
  • Technorati
  • YahooMyWeb