Recent Topics

1 Dec 08, 2009 11:42    

I'm trying to understand how the stats pages work, particularly around the user agents tab, though I'm assuming it reflects the other tabs too. At the moment, I'm seeing the following stats on my blog:

Agent Type: Agents Defined/# of Agents Hits Recorded Against

Browsers: 2107/241
RSS: 531/222
Robots: 82/5
Unknowns: 324/115
Total: 3044/583

Given the above, I'm trying to understand,

1. How are user agents defined? Is this a static or database driven list?

2. Under what circumstances is the user agent list amended?

3. Why would some user agents appear under more than one filter e.g. Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html) appears on both the robots and unknown list

4. Recognising that this question is rather dependent on 1 & 2 above; Why do the user agent categories display "agent signatures" when they have no hits against them? Can this be put to any useful purpose?

5. At the moment, my _stats.php file lists around 95 agent definitions for browsers, feeds and robots. Even taking into consideration that agents are constantly changing, this is significantly short of what the stats themselves are showing at around 3% of agents defined and 16% with hits recorded against them. Is there any significance in what I'm seeing here? e.g. the stats data needs a clear out, or the _stats.php is significantly short of all the definitions it needs?

6. Is there a need for _stats.php to be drawing on a dynamic data source, rather than it's current hard coded list? I understand that may have an impact on server resources, though how much of an impact I wouldn't hazard a guess.

7. In the event that 6 was possible and then developed, would it be possible to use this to blacklist undesirables in the same way as can be done for spam commenters or banned words?

8. Am I asking too many daft questions?

Any thoughts or background on this little lot would be much appreciated.

2 Dec 08, 2009 18:19

Chris,

Quite Frankly, the User Agents tab is a half baked feature that was started and never really finished because we realized along the way the original design was flawed. This tab will probably go away in version 4, as part of removing code that slows down every page display (trying to match the user agent with something that's already in the datatabse)

1. How are user agents defined? Is this a static or database driven list?

Database.

2. Under what circumstances is the user agent list amended?

Every time a new user agent is seen, it's added to the database. This results in a huge list that serves no real purpose.

3. Why would some user agents appear under more than one filter e.g. Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html) appears on both the robots and unknown list

Maybe you added cuil.com to the robots definitions after some hits were already logged as browsers.

4. Recognising that this question is rather dependent on 1 & 2 above; Why do the user agent categories display "agent signatures" when they have no hits against them? Can this be put to any useful purpose?

Because the stats have been pruned to only keep the last 15 days for example and there have been no more hits by those agents since then.

5. At the moment, my _stats.php file lists around 95 agent definitions for browsers, feeds and robots. Even taking into consideration that agents are constantly changing, this is significantly short of what the stats themselves are showing at around 3% of agents defined and 16% with hits recorded against them. Is there any significance in what I'm seeing here? e.g. the stats data needs a clear out, or the _stats.php is significantly short of all the definitions it needs?

I'm not sure I understand your question here, but it doesn't really matter. The agents table is definitely flawed.

6. Is there a need for _stats.php to be drawing on a dynamic data source, rather than it's current hard coded list? I understand that may have an impact on server resources, though how much of an impact I wouldn't hazard a guess.

I'm not sure I understand this either.

7. In the event that 6 was possible and then developed, would it be possible to use this to blacklist undesirables in the same way as can be done for spam commenters or banned words?

eh...

8. Am I asking too many daft questions?

Nope :)

The conclusion is: there should be some serious cleaning up of this feature involved in b2evo v4.

3 Dec 08, 2009 19:40

Thanks for the reply François. After your answers to 1-4, the rest of the questions became increasingly meaningless.

Simply put, I'd like to be able to blacklist some of the spambots in much the same way as any other sort of spam can be - recognising of course that this is targeting a rather different entity for attention.

I think I'm just trying to see a way round playing with config files and .htaccess to lock out some of the more persistent pests I'm seeing.

Thanks for letting me know which way things are heading.

4 Dec 08, 2009 21:31

We originally wanted to use the useragent for filtering too. However we found out that the useragent is becoming more and more meaningless.

At this time, the hardest thing to spoof is the IP address. Unfortunately, it's also pretty hard to have lists of clean / grey / black IPs :p

Work in progress, though...

5 Dec 08, 2009 23:44

OK, the user agents table is now officially dead in b2evo v4/cvs.

The user agent type is now saved directly into the hitlog table.

hit_agent_type ENUM('rss','robot','browser','unknown') DEFAULT 'unknown' NOT NULL

I expect this to speed up logging as well as stats queries. The only loss is the useragents tab... which was basically useless :p

6 Dec 09, 2009 07:34

Sounds good. I'll be looking forward to this. It'll be interesting to see whether/how this can be used beyond the purely statistical purpose.

7 Dec 09, 2009 20:30

Btw, everything currently called "stats" I'm planning on renaming to "Analytics". Does that make sense to you?

8 Dec 09, 2009 23:06

I can live with it, though I do wonder whether it implies something beyond what's available at the moment - bear in mind I'm on 2.4.7 though, so I don't know whether 3.x and therefore v4 is taking it beyond that.


Form is loading...