Cuil’s Twiceler Bot is a bad apple.
If you haven’t heard of Cuil, odds are you soon will. They are new search engine that has just launched with a massive index of over 120 billion pages. They are being touted as the new Google. But those of us who are familiar with Cuil’s spider, the Twiceler bot, know this guy is one bad boy trained not play fair with the rest of us.
This bot began crawling one of my web sites around a year ago. At first I was happy that it was indexing the 300, 000 plus pages on the site so quickly. But then it started going out of control and began hacking URL’s. My log files began to look I was being attacked by a script kiddie desperately searching for some kind of security vulnerability. Eventually Cuil’s bot ended up crashing one of my web servers, a Windows 2003 sp2, Dual Intel P4 3.2 ghz processors with 4 gigs of memory (just pointing out the server has the hardware to support some tracffic) serveral times from certain malformed requests.
Cuil’s Twiceler bot would not obey my robots.txt file. Attempts to make it go away by sending it blank responses with 404 (Page Not Found), 500 (Internal Server Errors), and even 403 status codes (Access denied) went unrespected.
I then banned the Cuil spider’s IP address. The it started using differnt IP’s and and cloaked its identity by not sending its usual User-Agent header (Mozilla/5.0+(Twiceler-0.9+http://www.cuill.com/twiceler/robot.html).
However, I could still tell it was the same bot spider by the way it made its requests and its system search of my site for directories that just did not exist. Basically, it would find a link to a given URL, for example http://blog.alexanderhiggins.com/topics/blogging, and woudl begin hacking the url into different parts looking for hidden directories. Using the above url as an example, it woul begin crawling the /topics directory like this:
/topcis/b
/topics/bl
/topics/blo
/topics/blog
etc…
Eventually, it would repeat the processe for chopping up /topics itself. This bot was so bad that I had to programatically listen for malformed requests and ban its IP Address on the fly to prevent it from crashing my server again. I programmed an email alert to notify me when this guy came by to sell hello and started misbehaving.
I still get several email alerts on a daily basis with “Banned IPs” from this bot malicously trying to bruteforce my server. Some conditions I set include, to many requests for bad pages and massive amouts of requests within a given period of time(like a DOS attack). Apparently, I am only one of several WebMasters forced had to write custom code to block Cuill’s Twiceler Bot because it doesn’t listen to webmasters Robots.txt either
Just take a look at my IIS Log Files showing the Cuil’s Twiceler Hacking at my server depsite being sent 403 Access Denied Errors.

click the above image to view the log file
















July 30th, 2008 at 2:49 am
This just gets better…
I searched for twiceler on cuil.com and amongst the results are obiously serveral forums / blogs of users compaining about the bot. In its index there is even a 403 Access Denied error page to some government website see: http://xxx.lanl.gov/denied.html
There results are full of duplicate content, sometimes 4 or 5 results from the same domain and I have even saw results showing different sections of the same exact page.
They obviously are doing absolutely no spam filtering what so ever, and even crawl links on http web proxy servers… what idiots. And check this page out http://www.cuil.com/search?q=alex%20hggins&sl=long. I intentionally mistyped my name and what do I get, bunch of spam. I could never imagine getting such results from Google. What’s even sadder is the 2 spammers in the result each get three different listings in the results for duplicated pages with mispellings in the url.
July 30th, 2008 at 6:30 pm
[...] The Twiceler Bot is a bad apple. [...]
July 31st, 2008 at 4:38 pm
I second that notion… I even contacted the Cuil tech people and they shrugged me off… see my post here: How CUIL Lost Me as a Customer. what a lousy spider.
August 1st, 2008 at 1:24 am
Jazzy, great post about CUIL. Also props for FlixPulse.com. Using twitter and an email spam linguistics algorithm to create movie reviews…. Absolute Genius!!
August 1st, 2008 at 8:08 am
[...] Their spider, the “Twiceler Bot”,
August 3rd, 2008 at 11:09 am
Too bad for a search engine. It should be banned.
August 3rd, 2008 at 7:59 pm
I totally agree. My logs actually show a fair amount of traffic for people looking to prevent this bot from crawling their site. I guess they are still up to their old tactics.
August 12th, 2008 at 9:54 pm
I found your site on technorati and read a few of your other posts. Keep up the good work. I just added your RSS feed to my Google News Reader. Looking forward to reading more from you down the road!
August 19th, 2008 at 5:47 pm
Your site- blog.alexanderhiggins.com is excellent site, tnks, owner. see this state electric hot water heater
August 20th, 2008 at 12:57 pm
Your site- blog.alexanderhiggins.com is interesting site, tnks, admin.
Buy viagra Buy Cialis Buy Cialis
September 4th, 2008 at 2:31 am
[...] l’indice più vasto del mondo, intaccava le performance dei siti, vampirizzava la banda, causava dei crash. C’era chi riscontrava un numero imponente di visite, quasi si trattasse di un attacco DoS, [...]
September 12th, 2008 at 10:57 pm
Well really the logs don’t show any sort of script kiddie desperate attacks. The blog posts I have seen about logs just look like the bot doesn’t have a good way of telling if its one of those sites that any url you put in for them have content or if they are search engines. New bot stuff will need to get iron’d out. Not listening to robots.txt is a very bad thing for a crawler not to do. But the rest doesn’t really seem so bad.
September 15th, 2008 at 5:24 pm
The logs do show script kiddie stuff, but the point of Logs I posted were to show that is still continued to crawl urls after being sent a 403 access denied error.
Look at the logs again. Thousands of requests to urls that didn’t exist. It was putting together random strings. If I hadn’t modified the pages to send 403 errors they would have all been 404 page not found errors.
And the rate of the requests was absolutely absurd.
September 24th, 2008 at 10:48 pm
Reading between the lines its quite obvious twicelers screwy design wasnt by accident. You have an Irish Techie who convinced VC’s to dump 33 million in small change into his concept, which was based mainly on their massive index size. But guess what, when the VC auditors came to see the index size for real Mr. Tom had to fill it with something. So he came up with a spur of the moment idea to run the system dictionary agaisnt every sites web directory, and presto, Cuil now has generated millions if not billions of indexes to unique ip addresses, ofcourse they are all 40(1,2,3) errors, but who cares, since he was selling index size to his VC’s not content. He passes the auditors test, they open up the bank account, and its muffins and chocolates forever (or at least until the money runs out and the VC’s try to sell a goldmine full of fools gold.