Recent Topics

1 Sep 08, 2005 07:22    

I was wondering if in phoenix you might be ableto sqeeze in an easier way to update the RSS news aggregators? Since right now you have to have the different aggregators must be listed in /conf/_stats.php If there isgoingto be a way to update any new ones that come around.

2 Sep 08, 2005 16:51

I have recently updated my $user_agents array in conf/_stats.php based on my blog's hit log:

/**
 * UserAgent identifiers for logging/statistics
 *
 * The following substrings will be looked up in the user_agent http header
 *
 * @global array $user_agents
 */
$user_agents = array(
	// Robots:
	array('robot', 'Googlebot/', 'Google (Googlebot)' ),
	array('robot', 'Slurp/', 'Inktomi (Slurp)' ),
	array('robot', 'Yahoo! Slurp;', 'Yahoo (Slurp)' ),
	array('robot', 'Frontier/',	'Userland (Frontier)' ),
	array('robot', 'ping.blo.gs/', 'blo.gs' ),
	array('robot', 'organica/',	'Organica' ),
	array('robot', 'Blogosphere/', 'Blogosphere' ),
	array('robot', 'blogging ecosystem crawler',	'Blogging ecosystem'),
	array('robot', 'FAST-WebCrawler/', 'Fast' ),			// http://fast.no/support/crawler.asp
	array('robot', 'timboBot/', 'Breaking Blogs (timboBot)' ),
	array('robot', 'NITLE Blog Spider/', 'NITLE' ),
	array('robot', 'The World as a Blog ', 'The World as a Blog' ),
	array('robot', 'daypopbot/ ', 'DayPop' ),
	array('robot', 'larbin_', 'larbin' ),
	array('robot', 'msnbot/', 'MSNBot' ),
	array('robot', 'PSBot', 'Picsearch' ),

	// Aggregators:
	array('aggregator', 'Feedreader', 			'Feedreader' ),
	array('aggregator', 'Syndirella/',			'Syndirella' ),
	array('aggregator', 'rssSearch Harvester/', 'rssSearch Harvester' ),
	array('aggregator', 'Newz Crawler',			'Newz Crawler' ),
	array('aggregator', 'MagpieRSS/', 			'Magpie RSS' ),
	array('aggregator', 'CoologFeedSpider', 	'CoologFeedSpider' ),
	array('aggregator', 'Pompos/', 				'Pompos' ),
	array('aggregator', 'SharpReader/',			'SharpReader'),
	array('aggregator', 'Straw ',				'Straw'),
	array('aggregator', 'AppleSyndication/', 	'AppleSyndication'),
	array('aggregator', 'Bloglines/',			'Bloglines'),
	array('aggregator', 'Blogonautes.com',		'Blogonautes'),
	array('aggregator', 'BlogSearch',			'Blogonautes'),
	array('aggregator', 'Blogslive',			'Blogslive'),
	array('aggregator', 'FeedOnFeeds',			'FeedOnFeeds'),
	array('aggregator', 'Feedreader',			'Feedreader'),
	array('aggregator', 'Feedster Crawler/',	'Feedster'),
	array('aggregator', 'metaRSS/',				'metaRSS'),
	array('aggregator', 'Moreoverbot/',			'More Over'),
	array('aggregator', 'NetNewsWire/',			'NetNewsWire'),
	array('aggregator', 'PubSub-RSS-Reader/',	'PubSub'),
	array('aggregator', 'RSS Xpress',			'RSS Xpress'),
	array('aggregator', 'SharpReader/',			'Sharp Reader'),
	array('aggregator', 'Shrook/',				'Shrook'),
	array('aggregator', 'Technoratibot/',		'Technorati'),
	array('aggregator', 'topicblogs/',			'Topic Blogs'),
	array('aggregator', 'UniversalFeedParser/',	'Universal Feed Parser'),
	array('aggregator', 'YahooFeedSeeker/',		'Yahoo Feed Seeker'),
	array('aggregator', 'YahooSeeker/',			'Yahoo Seeker'),
);


However, it appears the backoffice user agent summary groups the robots and aggregators based on the full user agent string and not the short version of it. After updating your $user_agents array, you are going to see several entries of the Yahoo Feed Seeker and so, each version having its own user agent string...

It might be possible to group all the user agents using their short user agent version as well as updating that list automatically. However, that implies the user agent version becomes a database table of its own. That also implies to refactor the whole hit log management, since that table grows a lot when you have a popular blog. My blogs' hitlog table is 13 MB big and it keeps only the last 7 days of visits!

It would be very interesting to split that hitlog table into several smaller ones:

  • a hitlog table (referencing the following tables to make it shorter, avoiding redundant information);

  • a referrer table (referencing a new base domain referrer table);

  • a user agent table (referencing a new short description user agent table).[/list:u]In addition to that, the hit_remote_addr entry of the hitlog table might become a 32-bit or 128-bit numeric entry making it shorter and quicker to manipulate.

  • However, even it those modifications would be usefull, they are not as usefull as other more usefull features.

3 Sep 08, 2005 17:37

Thanks. I sort of figured there were more aggregators out there than the few listed. I'm not sure which forum it belongs in (probably not this one), but since I've always been on top of logging robots as robots to not muck up my "heavy hitters" section I can post a very full list of robots to add to that array. BTW since I don't care what they really are I often fill the text side with "something new on DATE". All I care is that robots get logged as robots. Anyway I'll post my list somewhere, and maybe with some helpful SQL strings for those who want to make their hitlog table accurate.

I guess I should be wondering if some of the things I'm calling robots were actually aggregators. Probably I would have noticed if a large number of "direct" hits were logged against xml-type pages? Hmmm... Since that's the only ones I would have corrected I probably have lots of hits against aggregators that aren't counted in the "by aggregator" section.

AARGH! Too much thinking!!!


Form is loading...