Home | About Us | White Hat Story | Testimonials | FAQ | Contact Us | Blog | Press

Bot Behavior

BOT BEHAVIOR 101
Â

Discussions held by representatives from Google, Yahoo and Become .com, at the Search Engine Strategies Conference, held August 7 – 10, in San Jose, California examined problems with search engine spiders or “bots”.
Â

The topic, “bot obedience,” brought up issues related to lost uptime, wasted bandwidth and stolen content – a result of unscrupulous spiders invading virtual domains. As creative engine marketers know, it’s easy to design a search-friendly site that can be easily read by a search engine spider or bot. However, when those bots start causing problems then what?
Â

Jon Glick of Become.com search engine opened the session by giving a brief overview of the topic. Glick said, “Robots are great at finding links and content, but they are dumb. They can’t process cookies or javascript.” Glick focused on how to create bot friendly sites using hyper link navigation or navigation that uses “pure” links with no javascript or Flash elements. He hinted that web authors usually avoid excessive parameters in dynamic URLs when trying to build bot friendly sites.�
Â

Another topic of conversation brought up by Dan Thies, SEO Research Labs, was the problems that duplicate content can cause – a consequence of untamed bots. He referred to mischievous bots that targeted highly performing web pages and scraped their content, placing the content in thousands of other pages all over the web. Not only does this cause consumer confusion, the duplicate content is eventually filtered out of the search engine’s main results and reassigned to supplemental results. In extreme circumstances, the content can be booted out of the entire search engine index. Theis noted that he has seen client’s sites lose “considerable revenue” after they were scraped and unauthorized duplicate content was posted throughout the web. Theis says, “One of my clients saw a drop of 50% in SEO referrals. This would be disastrous for most sites; however, since the client was good at PPC and other online advertising, the effect was noticeable, but not devastating.”   �
Â

Bill Atchison of Crawlwall.com got the audiences’ attention when he said, “Most bots steal, from me. I’ve had bad bots hit my site so hard, they took it down. Before I took action, my website was under constant attack.” Before he took steps to alleviate the problem, 10 of his site’s page views were from “bad bots,” or bots that ignored all instructions from robots.txt files or meta information. “I have had copyrighted material stolen from my web site and my servers have been overloaded and brought down,” claimed Atchison. “What was happening with the bad bots was unacceptable and had to be stopped,” claimed Atchison.  The first step to Bot obedience according to Atchison is to differentiate good bots from bad ones. He uses a list of characteristics to identify if a bot is good or bad. Good Bot characteristics include:

  • Internet standards like robots.txt should always be obeyed
  • A server should not be crawled abusively fast
  • Return to get fresh content in a reasonable timeframe
  • In return for crawling your site, provide traffic

Bad Bot characteristics include:

  • Will go great lengths to get content
  • Internet standards like robots.txt ignored
  • User Agent randomly changed to avoid filters
  • Pose as humans (stealth) to completely bypass filters
  • To avoid being stopped, crawl as fast as possible
  • To slide under the radar, crawl as slow as possible
  • To avoid detection, crawl from as many IPs as possible
  • Return often to get new content and get indexed first
  • Violate copyrights and repackage sites
  • Hijack search engine positions
  • Provide no value in return for crawling

Atchison outlined failed approaches that are commonly used to control rogue bots on websites. One approach – the “opt-out,” uses robots.txt files, meta information and user-agent blacklists to direct bots to places they can and cannot go and keep blacklisted bots away. “This approach fails most of the time, because rogue bots ignore this type of data and mask their true identity by falsifying their user-agent data,” said Atchison. He went on to say that rogue bots can turn the tables on site owners by using the prohibited files information in robots.text files as a road map for places they want to go. The approach Atchison uses is the “opt-in.” Only bots known to be good are allowed into the site, and all other bots are blocked. The approach is tricky, because some bots that provide quality traffic can be blocked from accessing the site, but Atchison believes that the benefits outweigh the risks.
Â

Rajat Mukherjee of Yahoo and Vanessa Fox from Google agree that the best way to control their bots is to use the robots.txt protocol. Murkherjee pointed out that slurp (the name for Yahoo’s bot) is obedient. “I always recommend reading the information at robotstxt.org in order to control slurp,” claims Mukherjee. Mukherjee took time to introduce the revamped Yahoo Site Explorer Program that allows site owners to authenticate their sites and get more detailed information about how their sites are viewed by Yahoo. The program allows users to manage site feeds (other than paid inclusion feeds) and export information in .CSV format. Fox and Mukherjee stated that Yahoo and Google want to be contacted if their bots are causing problems.
Â

During the Q&A, one member of the audience suggested that responsible bots implement and identify a program to certify that the bot is actually what its user-agent says it is. This suggestion was met with enthusiasm from all.Â

If you liked my post, feel free to subscribe to my rss feeds

These may also interest you...

Post a Comment

You must be logged in to post a comment.