Recent Topics

1 Sep 16, 2005 19:58    

http://forums.b2evolution.net/viewtopic.php?t=5362 offers an updated aggregator list. Since I use my heavy hitters hack I don't like an unidentified robot incrementing my hit counter, so every now and then I find one and add it to my list. Since I'm doing something entirely unrelated I figured now was a good time to finally grab kwa's aggregator list and offer up my robots list.

$user_agents = array(
// Robots:
array('robot', 'aggregator:Rojo', 'aggregator Rojo (http://rojo.com' ),
array('robot', 'Baiduspider', 'Chinese Search Engine' ),
array('robot', 'blogging ecosystem crawler', 'Blogging ecosystem'),
array('robot', 'Blogosphere/', 'Blogosphere' ),
array('robot', 'BlogPulse', 'BlogPulse - ISSpider 3.0' ),
array('robot', 'Blogshares Spiders', 'Blogshares Spiders Wolferized V1.39' ),
array('robot', 'blogsnowbot', 'blogsnow.com' ),
array('robot', 'ConveraCrawler/', 'ConveraCrawler 0.5' ),
array('robot', 'daypopbot/ ', 'DayPop' ),
array('robot', 'FAST-WebCrawler/', 'Fast' ),
array('robot', 'Frontier/', 'Userland (Frontier)' ),
array('robot', 'Gigabot/', 'Gigabot - gigablast.com' ),
array('robot', 'Googlebot/', 'Google (Googlebot)' ),
array('robot', 'heritrix/', 'Heretrix / crawler.archive.org' ),
array('robot', 'ia_archiver', 'ia_archiver' ),
array('robot', 'Jakarta Commons-HttpClient/3.0-rc3', 'Jakarta Commons' ),
array('robot', 'Java/1.4', 'Java 1.4 blah blah' ),
array('robot', 'Java/1.5', 'Java 1.5 blah blah' ),
array('robot', 'larbin_2.6.3', 'larbin_2.6.3 - bogus crap?' ),
array('robot', 'libwww-perl/5.76', 'bogus perl 5.76' ),
array('robot', 'Ask Jeeves/Teoma', 'Ask Jeeves/Teoma' ),
array('robot', 'Mozilla/4.0 (0000000000; 0000 000', 'Mozilla 0000000000' ),
array('robot', 'Mozilla/4.0 (compatible; grub-client', 'Mozilla grub client' ),
array('robot', 'Mozilla/4.0 (compatible; Lotus-Notes', 'Mozilla Lotus Notes' ),
array('robot', 'msnbot/', 'MSN (msnbot)' ),
array('robot', 'NITLE Blog Spider/', 'NITLE' ),
array('robot', 'NP/0.1', 'NameProtect - nameprotect.com' ),
array('robot', 'organica/','Organica' ),
array('robot', 'PHP/4.0.5RC1', 'PHP something or other' ),
array('robot', 'ping.blo.gs/', 'blo.gs' ),
array('robot', 'Slurp/', 'Inktomi (Slurp)' ),
array('robot', 'Syndic8/', 'syndic8.com' ),
array('robot', 'Technoratibot/', 'Technorati' ),
array('robot', 'The World as a Blog ', 'The World as a Blog' ),
array('robot', 'timboBot/', 'Breaking Blogs (timboBot)' ),
array('robot', 'TurnitinBot/', 'TurnitinBot - turnitin.com' ),
array('robot', 'W3C_CSS_Validator', 'CSS Validator' ),
array('robot', 'W3C_Validator/1.', 'XHTML Validator' ),
array('robot', 'Waypath development crawler', 'Waypath (a blog thing)' ),
array('robot', 'www.a2b.cc', 'Location Search' ),
array('robot', 'Yahoo! Slurp;', 'Yahoo (Slurp)' ),
array('robot', 'ZyBorg/1.0', 'ZyBorg (wisenutbot)' ),
array('robot', 'NG/2.0', 'exalead POS' ),
array('robot', 'http://career.drecom.jp/bot.html', 'drecom something or other' ),
array('robot', 'ichiro/1.0 (ichiro@nttr.co.jp)', 'ichiro - nttr.co.jp)' ),
array('robot', 'Mozilla(IE Compatible)', 'something new 2005-08-10' ),
array('robot', 'PipeLine Spider', 'something new 2005-08-17' ),
array('robot', 'DTAAgent', 'something new 2005-08-18' ),
array('robot', 'Missigua Locator', 'something new 2005-08-19' ),
array('robot', 'ISC Systems', 'something new 2005-08-20' ),
array('robot', 'Nutch', 'something new 2005-09-16' ),
// Aggregators:


As you can see towards the end of the list I stopped caring about the second part of each entry - I just told myself when I found it.

YMMV...

2 Apr 24, 2006 11:11

EdB,

Have you added the Google Mediabot to this list? Since the Google Bigdaddy update, the mediabot has taken a bigger role in crawling sites. Or do you mind posting your latest list.

Thanks,

Jeff


Form is loading...