Paola Kathuria
2006
Log analysers are not accurate. They over-report visits and over-count
some browsers while under-counting other browsers. They cannot accurately
distinguish spiders and robots from human visitors and they do not use
fool-proof techniques for counting visits and visitors.
Spiders and robots are programs which are sent out to web sites
to index it, check links or to fetch content.
Your web browser is also a program: spiders and robots are
essentially no different to a web browser in terms of
what they can do. And, as it happens, some browsers can be set up
to send out link-checking robots.
There are thousands of spiders and robots visiting web sites. We've
been compiling a database of them for the past 10 years. We use it
to filter out spiders and robots from our own site stats.
A staggering 90% of page requests to this web site (www.limov.com), for
example, are by spiders and robots.
There are a handful of legitimate spiders but many are programs
created to harvest e-mail addresses, copy content or to look for
vulnerabilities in your web server. It is in the originator's interests
that their programs look like regular visitors so that they gain
full access to your site.
I use examples from actual logs in this article, mostly from
this web site.
Spiders visit web sites and follow links on a page, normally to collect
content so that it can be indexed by search engines. They can also
collect specific content such as e-mail addresses, images, and PDFs.
Robot is a term I use to describe programs which make
single-hit visits, often hitting the same page at regular intervals.
Robots don't follow links.
You don't need to be a computer expert to have your own spider.
Source code is freely available online.
I will use 'bot' for the remainder of this article to refer to both
spiders and robots.
In this section, I'll be covering server log structure, how bots
are supposed to identify themselves and how you can find rogue bots
in logs. If you know all this, skip to the next
section.
You need to know the difference between requests and hits to be able
to interpret web logs and stats. Requests refer to pages. Most
web pages include a mixture of text and images. The images are included
in the page as links to files on the server. If a web page includes
10 graphics, accessing the page will result in 1 request and 11 hits,
with one log line per hit.
Every hit made to a web site is logged. A hit can vary from finding out
whether a page or file has been updated to fetching web pages, style sheets,
images and other files, such as PDFs.
Failed requests are also logged, for example when a page has been
removed or when it's password-protected. Failed hits also include
hacking attempts to invoke vulnerabilities in (mostly) Microsoft Windows
servers.
This information is logged for each hit:
- IP address or host - where the request comes from
- username - the username of an authenticated user (via .htaccess)
- date/time - date and time of access
- request - request type (GET / POST / HEAD)
- URL - what was requested
- version - HTTP version
- status code - success/failure code
- size - number of bytes downloaded
- referrer - URL of referring page
- user-agent - how the browser identifies itself
'Agent' is a term used for tools sent out to act on your behalf.
Browsers and bots are agents.
Here are actual logs lines resulting from a visitor displaying one web
page, the Colour Selector entrance page (I've changed the IP address):
- 255.60.45.22 - - [04/Mar/2006:00:40:56 +0000] "GET /colour/ HTTP/1.1" 302 - "http://www.google.de/search?hl=de&client=firefox-a&rls=org.mozilla:en-US:official&q=color+scheme+library&spell=1" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.145.22 - - [04/Mar/2006:00:40:57 +0000] "GET /colour/?ID=WQVZHB7F7N30D00 HTTP/1.1" 302 - "http://www.google.de/search?hl=de&client=firefox-a&rls=org.mozilla:en-US:official&q=color+scheme+library&spell=1" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /colour/ HTTP/1.1" 200 - "http://www.google.de/search?hl=de&client=firefox-a&rls=org.mozilla:en-US:official&q=color+scheme+library&spell=1" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /css/screen.css HTTP/1.1" 200 6600 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/icons/colour-favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /css/screen-libr.css HTTP/1.1" 200 1087 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /css/screen-nav.css HTTP/1.1" 200 1560 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /css/print.css HTTP/1.1" 200 2167 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/g-libr.jpg HTTP/1.1" 200 4841 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U;Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/l-limov.gif HTTP/1.1" 200 3613 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /p.gif HTTP/1.1" 200 49 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/tb-serv.gif HTTP/1.1" 200 751 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/tb-port.gif HTTP/1.1" 200 565 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/tb-abou.gif HTTP/1.1" 200 950 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/tb-cont.gif HTTP/1.1" 200 932 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/p-my-yc-cm.gif HTTP/1.1" 200 2280 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/b00-cs.gif HTTP/1.1" 200 106 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/b00-nv.gif HTTP/1.1" 200 70 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/b00-sw.gif HTTP/1.1" 200 133 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- 255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/b00-bg.gif HTTP/1.1" 200 877 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
- By the user-agent
- If it fetches robots.txt
- By its behaviour
Spiders and robots usually identify themselves in the user-agent.
However, there is no standard text that they're supposed to add to
the user-agent (such as "I'm a spider") so that they can be found.
This means that there is no automatic way that programs which process
logs to produce stats can detect them. A list of spider and robot
user-agents and IP addresses must be maintained on an on-going basis
so that these visitors are not included in your regular web site stats.
This does not happen automatically.
Web developers can put instructions in text files for spiders to
make some parts of the site off limits. This could be for peformance
reasons. The instructions are put on the server in a file called
robots.txt. Spiders and robots are supposed to read this
file at the start of each visit but there's no way to enforce that they
do. Most bots ignore the file.
However, if a visitor does access robots.txt, it's most likely
a spider.
A visit from Google's Googlebot spider. It fetches robots.txt and has a helpful user-agent. Notice how the IP address and user-agent changes.
host | date/time | requested file | user-agent |
66.249.71.53 | 05/Mar/2006 @ 00:29:14 | /robots.txt | Googlebot/2.1 (+http://www.google.com/bot.html) |
66.249.71.53 | 05/Mar/2006 @ 00:29:15 | /contact.lml | Googlebot/2.1 (+http://www.google.com/bot.html) |
66.249.65.5 | 05/Mar/2006 @ 00:29:28 | /projects.lml?w=0&wo=1&p=bbr-5 | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
66.249.65.5 | 05/Mar/2006 @ 00:34:33 | /other-work.lml?p=disney-1 | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
66.249.71.69 | 05/Mar/2006 @ 00:39:31 | /projects.lml?w=0&wo=1&p=oup-6 | Googlebot/2.1 (+http://www.google.com/bot.html) |
66.249.64.42 | 05/Mar/2006 @ 01:06:50 | /other-work.lml?p=hp-1 | Googlebot/2.1 (+http://www.google.com/bot.html) |
66.249.71.32 | 05/Mar/2006 @ 01:20:26 | /projects.lml?p=crash-1 | Googlebot/2.1 (+http://www.google.com/bot.html) |
66.249.65.5 | 05/Mar/2006 @ 01:56:53 | / | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
66.249.64.30 | 05/Mar/2006 @ 02:18:11 | /projects.lml?p=whm-1 | Googlebot/2.1 (+http://www.google.com/bot.html) |
66.249.71.45 | 05/Mar/2006 @ 02:19:14 | /projects.lml?p=bbr-4 | Googlebot/2.1 (+http://www.google.com/bot.html) |
A visit from Baiduspider. It usefully identifies itself in the user-agent. It accesses our home page but doesn't fetch robots.txt
host | date/time | requested file | user-agent |
202.108.22.72 | 05/Mar/2006 @ 03:30:17 | / | Baiduspider+(+http://www.baidu.com/search/spider.htm) |
202.108.22.72 | 05/Mar/2006 @ 07:32:58 | / | Baiduspider+(+http://www.baidu.com/search/spider.htm) |
202.108.22.72 | 05/Mar/2006 @ 12:10:02 | / | Baiduspider+(+http://www.baidu.com/search/spider.htm) |
There are long gaps between visits. If Baiduspider hasn't been added
to your web site stat program's filter file for spiders (assuming one
exists), then this spider's visits will show up as regular one-page
visits in your web site stats.
Next is a suspect series of visits. The log lines appear together
in an uninterrupted block in the log file. The log lines of human
visitors are usually interleaved as people take longer between
requests compared to spiders.
The accesses shown people are from different IP addresses but they
all refer to the same session id. The session id is a unique visitor
id we add to the URL if we can't out it in a cookie. The IP addresses
are from different countries.
A suspect visit: same session id but each IP address is in
a different country. Stylesheets and images are only fetched for one page.
host | date/time | requested file | gap |
213.61.13.68 | 5/Mar/2006 @ 04:37:56 | /?ID=X46ZB0H3N5C00B4 | |
213.61.13.68 | 5/Mar/2006 @ 04:37:57 | /whatsnew.lml?ID=X46ZB0H3N5C00B4 | 1s |
213.61.13.68 | 5/Mar/2006 @ 04:37:58 | /projects.lml?ID=X46ZB0H3N5C00B4 | 1s |
213.61.13.68 | 5/Mar/2006 @ 04:38:01 | /journal/?ID=X46ZB0H3N5C00B4 | 3s |
221.45.136.41 | 5/Mar/2006 @ 04:38:06 | /description.lml?sm=1&w=8&ID=X46ZB0H3N5C00B4 | 5s |
196.40.26.246 | 5/Mar/2006 @ 04:38:11 | /contents.lml?ID=X46ZB0H3N5C00B4 | 5s |
192.138.77.36 | 5/Mar/2006 @ 04:38:12 | /preferences.lml?ID=X46ZB0H3N5C00B4 | 1s |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /ico/pref-favicon.ico |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /css/screen-nav.css |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /css/screen-site.css |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /css/print.css |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /p.gif |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /images/l-limov.gif |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /css/screen.css |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /images/p-os-01.gif |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /images/site-offsite.gif |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /images/p-os-2.gif |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /images/p-os-1.gif |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /images/p-fs-s.gif |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /images/p-lh-s.gif |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /images/p-fs-l.gif |
192.138.77.36 | 5/Mar/2006 @ 04:38:13 | /images/p-lh-t.gif |
195.113.171.76 | 5/Mar/2006 @ 04:38:15 | /colour/tips.lml?ID=X46ZB0H3N5C00B4 | 3s |
203.148.194.131 | 5/Mar/2006 @ 04:38:28 | /contact.lml?ID=X46ZB0H3N5C00B4 | 13s |
211.106.21.155 | 5/Mar/2006 @ 04:39:20 | /description.lml?sm=1&w=1&ID=X46ZB0H3N5C00B4 | 52s |
216.41.76.34 | 5/Mar/2006 @ 04:39:40 | /projects.lml?w=0&p=oup-7&ID=X46ZB0H3N5C00B4 | 20s |
219.24.170.3 | 5/Mar/2006 @ 04:39:49 | /projects.lml?wm=1&p=oup-7&ID=X46ZB0H3N5C00B4 | 9s |
220.84.214.190 | 5/Mar/2006 @ 04:39:59 | /about.lml?ID=X46ZB0H3N5C00B4 | 10s |
Is it a coincidence that different people in different countries
happened to visit this web site using the same session id in the URL
within a few seconds of each other, each only fetching web pages and
not the images and style sheets?
I'd say this was a spider. There is nothing in the host or user-agent
information which allows us to recognise it. Only its odd behaviour
gives it away.
To filter out this visitor from my custom stats in future, I have
to block it by the session id and/or all the IP addresses it used.
In addition to using algorithms to process standard server logs, people
can develop custom logs with extra information. They track visitors by
putting a generated unique session id in the URL or write it to a cookie.
The id is read back at every request so that the request can be logged
against the session id.
If you don't use session ids, you can make some guesses on which
request are from the same visitor by looking at the server logs.
- The same host IP address in a short period
- The referring page is from another web site
- The user-agent looks like a browser
However, any of these might change within a visit.
This is what I've found from reviewing server logs regularly.
Spiders can be sent cookies and allow them to be reread
by a site on subsequent visits.
Actual accesses from the same IP address from a repeat
visitor because a cookie with visit counts was being maintained.
visit count | IP | date/time | requested file | user-agent | referring page |
1 | 209.167.50.22 | 21-Oct-2005 @ 15:19:34 | / | LinkWalker | www.emlc.org.uk/Links.htm |
2 | 209.167.50.22 | 24-Oct-2005 @ 12:22:36 | / | LinkWalker | www.emlc.org.uk/Links.htm |
3 | 209.167.50.22 | 25-Oct-2005 @ 16:09:39 | / | LinkWalker | www.emlc.org.uk/Links.htm |
1 | 209.167.50.22 | 26-Oct-2005 @ 14:05:53 | / | LinkWalker | www.emlc.org.uk/Links.htm |
| 209.167.50.22 | 27-Oct-2005 @ 15:19:10 | / | LinkWalker | www.emlc.org.uk/Links.htm |
1 | 209.167.50.22 | 28-Oct-2005 @ 14:57:17 | / | LinkWalker | www.emlc.org.uk/Links.htm |
2 | 209.167.50.22 | 31-Oct-2005 @ 12:26:23 | / | LinkWalker | www.emlc.org.uk/Links.htm |
3 | 209.167.50.22 | 01-Nov-2005 @ 12:57:48 | / | LinkWalker | www.emlc.org.uk/Links.htm |
1 | 209.167.50.22 | 02-Nov-2005 @ 14:34:34 | / | LinkWalker | www.emlc.org.uk/Links.htm |
When examining raw logs, it is common to see a single visit in which
each page access is from a different host. This is how visitors appear
in logs when their connection is via a cacheing proxy.
Here is a visitor to the Colour Selector who made 17 page requests
from ten different hosts during a single three-minute visit.
Most log analysers will interpret these page requests as ten separate
visitors.
Log lines from a visitor showing up from a variety of hosts
host address | date/time | requested file | referring page |
anchovy.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:35:56 | /colour/colour.html | - |
mozzarella.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:36:17 | /colour/216.html | /colour/colour.html |
ham.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:36:46 | /colour/216/33ccff.html | /colour/216.html |
anchovy.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:37:24 | /colour/216/3399ff.html | /colour/216.html |
fides.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:37:30 | /colour/216/33ffff.html | /colour/216.html |
pineapple.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:37:35 | /colour/216/66ffff.html | /colour/216.html |
thyme.cant.ac.uk | 10-Aug-2002 @ 14:37:40 | /colour/216/66ccff.html | /colour/216.html |
basil.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:37:45 | /colour/216/6699ff.html | /colour/216.html |
tomato.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:37:49 | /colour/216/0099ff.html | /colour/216.html |
anchovy.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:38:04 | /colour/216/ffccff.html | /colour/216.html |
mozzarella.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:38:12 | /colour/216/ffcc33.html | /colour/216.html |
anchovy.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:38:17 | /colour/216/6600ff.html | /colour/216.html |
thyme.cant.ac.uk | 10-Aug-2002 @ 14:38:27 | /colour/216/ccffcc.html | /colour/216.html |
mozzarella.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:38:33 | /colour/216/ccff66.html | /colour/216.html |
tomato.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:38:37 | /colour/216/ccffff.html | /colour/216.html |
jalapeno.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:38:52 | /colour/216/ffffff.html | /colour/216.html |
oregano.ulcc.wwwcache.ja.net | 10-Aug-2002 @ 14:39:03 | /colour/216bg.html | /colour/216.html |
This second example is from an AOL user. Four images were viewed
during a visit, each from a different host (and IP) address.
A visitor where the host address is different for every
access. The source log lines of this visit were logged on the same day.
host address | time | requested file | user-agent |
cache-mtc-aa09.proxy.aol.com | 00:24:13 | /workshops/14th/lindsay.jpg | Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705) |
cache-mtc-ak07.proxy.aol.com | 00:24:38 | /workshops/14th/frank-lindsay.jpg | Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705) |
cache-mtc-am07.proxy.aol.com | 00:25:27 | /workshops/14th/rosa2-2002-06-18.jpg | Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705) |
cache-mtc-ak03.proxy.aol.com | 00:25:38 | /workshops/14th/lindsay-size.jpg | Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705) |
And, because of cacheing proxies or people who use services such
as AOL, different visitors can look like the same visitor if you look
only at their host address.
The referring URL is information that is sometimes made
available to the web server. It is the URL of the page which included
a link to your web site which was followed by the visitor (for example,
a page on your site might get listed in a search engine).
For the first page request of a visit, the referring URL will tell
you the address of the page on another web site from where a link
was followed. For subsequent pages, referrers will be pages within
the site.
It isn't always the case that the referring site will only appear as
the referrer for the first page in a visit because people use
the browser's Back button a lot.
You might expect log lines for a visit to follow the trend of each
page being the referrer of the next page accessed. The log lines of a
made-up a visit are shown next.
Example log of a possible visit, where each page
is the referrer of the next page requested
time | requested file | referring page |
12:44:05 | /colour/ | http://www.site.com/links.html |
12:47:00 | /colour/tips.lml | /colour/ |
12:47:16 | /colour/colour.lml | /colour/tips.lml |
12:47:29 | /colour/browse-palettes.lml | /colour/colour.lml |
12:50:05 | /library/ | /colour/browse-palettes.lml |
12:50:12 | /projects.lml | /library/ |
12:50:24 | /colour/ | /projects.lml |
12:50:32 | /services.lml | /colour/ |
12:50:45 | /colour/colour.lml | /services.lml |
12:51:10 | /colour/tools.lml | /colour/colour.lml |
However, when you look at your web server logs, you will see
that people manage to get to pages from a page which doesn't include
any links to the new page.
That is because they got to the new page from a link on an
earlier page that was cached. When a visitor uses their browser
to go back to a cached earlier page, the page isn't logged at
the web server; the redisplay of a cached page means the browser gets
the page from cache and not from the server.
A visit in which the referring page can be a page earlier
than the last page requested. The source log lines for this visit were logged
on the same date, with the same session id and user-agent.
time | requested file | referring page |
12:44:05 | /colour/ | http://www.web-graphics.com/feature-002.php |
12:47:00 | /colour/tips.lml | /colour/ |
12:47:16 | /colour/colour.lml | /colour/ |
12:47:29 | /colour/browse-palettes.lml | /colour/ |
12:50:05 | /library/ | /colour/browse-palettes.lml |
12:50:12 | /projects.lml | /colour/colour.lml |
12:50:24 | /colour/ | /library/ |
12:50:32 | /services.lml | /projects.lml |
12:50:45 | /colour/colour.lml | /colour/ |
12:51:10 | /colour/tools.lml | /colour/colour.lml |
Given this, it is possible that someone can reach your site from
a link on another site, explore your site for a bit but then return to
the entry page through the browser's Back button.
In this event, the first page request in the visit would have a
referrer of another web site. Subsequent pages would have internal
referrers but then the new outside referrer would reappear in the
logs.
A visit in which the referring site reappears as
the referrer for a later page. This was logged on the same day, with
the same session id and user-agent.
time | requested file | referring page |
14:57:15 | /colour/ | http://uk.google.yahoo.com/bin/query_uk?p=216+colours |
15:06:55 | /library/ | /colour/ |
15:07:06 | /projects.lml | /library/ |
15:07:50 | /services.lml | /projects.lml |
15:07:58 | /colour/ | http://uk.google.yahoo.com/bin/query_uk?p=216+colours |
15:08:14 | /colour/tips.lml | /colour/ |
Another scenario that explains such a visit pattern is when
a web site is accessed through multiple browsers. The choice of
earlier and later pages on screen to interact with will contribute
to the lack of a coherent path through the site in the logs.
In addition, anecdotal evidence suggests that people with screen
resolutions higher than 800x600 browse with multiple windows. In the case
of a site like this, where the session id is carried around in the URL,
this behaviour becomes apparent when the log includes visits from the
same referrer and from the same host address and user-agent.
The source log lines of this visit were logged on
the same day and with the same user-agent: Mozilla/4.0 (compatible;
MSIE 5.5; Windows 98; Win 9x 4.90)
session id | time | page requested | referring page |
7D9JHV75DE9H9G6 | 15:06:48 | /inetuk/notice.lml | - |
GEL8G594EHL5SH4 | 15:07:06 | /inetuk/notice.lml | - |
| 15:07:55 | /ico/hide1-favicon.ico | - |
GEL8G594EHL5SH4 | 15:08:05 | /inetuk/links.lml | /inetuk/notice.lml |
GEL8G594EHL5SH4 | 15:08:21 | /services.lml | /inetuk/links.lml |
7D9JHV75DE9H9G6 | 15:08:28 | /inetuk/about.lml | /inetuk/notice.lml |
GEL8G594EHL5SH4 | 15:08:48 | /projects.lml | /services.lml |
7D9JHV75DE9H9G6 | 15:11:04 | /projects.lml | /inetuk/about.lml |
| 15:11:14 | /ico/port-favicon.ico | - |
GEL8G594EHL5SH4 | 15:11:43 | /projects.lml | /services.lml |
GEL8G594EHL5SH4 | 15:12:17 | /projects.lml?s=t | /projects.lml |
7D9JHV75DE9H9G6 | 15:12:22 | /inetuk/about.lml | /inetuk/notice.lml |
7D9JHV75DE9H9G6 | 15:14:00 | /inetuk/notice.lml | /inetuk/about.lml |
It is possible for a different referring site to appear
during a visit.
The next visit is of a visitor to this site via Google but with a
different referring site for the sixth page request.
A visit with a referring site for the first page access but
a different referring site later on in the visit. This was logged on the
same day, with the same session id and user-agent.
time | page requested | referring page |
01:17:36 | /colour/ | http://www.google.com/search?q=color+palettes |
01:18:07 | /library/guidelines.lml | /colour/ |
01:18:39 | /library/promotion.lml | /library/guidelines.lml |
01:18:43 | /journal/ | /library/promotion.lml |
01:22:12 | /library/promotion.lml | /library/guidelines.lml |
01:22:27 | /journal/ | http://www.thestudyofdesign.com/links_magazines_l.asp |
During their visit, they created a link to our site from theirs
(complete with the session id in the URL) and then presumably tested the
link which explains the appearance of the second outside referrer.
You can't rely on time between requests to decide if it's a new visit.
This is because people might start something at work in the afternoon -
go home without closing their browser - then come back and expect to
carry on with whatever's in their browser. In this event, a gap between
page requests could easily be 17 hours. It is not uncommon to see gaps
of an hour or two in logs.
If accesses from the same host have a gap of more than 30 mins,
WebTrends counts is as from different visitors.
A visit which includes several long gaps. This was logged
on the same date, with the same session id and user-agent.
time | gap | requested file | referring page |
10:41:13 | |
/colour/ |
http://www.google.com/search?q=color+selector |
10:41:23 | 0:00:10 |
/colour/mix.lml?c=9CF |
/colour/ |
10:41:41 | 0:00:18 |
/colour/mix.lml?c=3CF |
/colour/mix.lml?c=9CF |
10:41:48 | 0:00:07 |
/colour/mix.lml?c=6FF |
/colour/mix.lml?c=3CF |
10:41:55 | 0:00:07 |
/colour/mix.lml?c=0FF |
/colour/mix.lml?c=6FF |
10:42:01 | 0:00:06 |
/colour/mix.lml?c=F93 |
/colour/mix.lml?c=0FF |
10:42:12 | 0:00:11 |
/colour/mix.lml?c=FC6 |
/colour/mix.lml?c=F93 |
10:42:55 | 0:00:43 |
/colour/mix.lml?c=F66 |
/colour/mix.lml?c=FC6 |
11:17:28 | 0:34:33 |
/colour/mix.lml?c=F63 |
/colour/mix.lml?c=F66 |
11:18:01 | 0:00:33 |
/colour/mix.lml?c=F60 |
/colour/mix.lml?c=F63 |
50 page requests not shown - gap range: 2 secs - 11 mins (average: 1 min) |
12:18:08 | 0:00:09 |
/colour/swatch.lml?c=3F9 |
/colour/swatch.lml?c=6F6 |
12:42:41 | 0:24:33 |
/colour/swatch.lml?c=3F6 |
/colour/swatch.lml?c=3F9 |
12:42:47 | 0:00:06 |
/colour/swatch.lml?c=3F3 |
/colour/swatch.lml?c=3F6 |
12:47:42 | 0:04:55 |
/colour/swatch.lml?c=3C3 |
/colour/swatch.lml?c=3F3 |
12:48:39 | 0:00:57 |
/colour/swatch.lml?c=3C6 |
/colour/swatch.lml?c=3C3 |
It is possible for the user-agent to change within a visit. When
it happens, it's usually a robot visitor but it can also happen with
human visitors.
The log lines below are a 183-page visit from the same IP address.
This can be recognised as a spider by the quick requests in a short
space of time.
A single visit in which the user-agent changes.
host IP | date/time | requested file | user-agent |
63.144.65.58 | 18/Apr/2001 @ 01:00:23 | /inetuk/providers.html | Mozilla/4.03 [en] (Win95; I) |
63.144.65.58 | 18/Apr/2001 @ 01:02:16 | /inetuk/providers/akhter.html | Mozilla/4.03 [en] (Win95; I) |
63.144.65.58 | 18/Apr/2001 @ 01:02:19 | /inetuk/providers/agent-cd.html | Mozilla/3.01Gold (Win95; I; 16bit) |
63.144.65.58 | 18/Apr/2001 @ 01:02:21 | /inetuk/providers/andover.html | Mozilla/3.01Gold (Win95; I; 16bit) |
63.144.65.58 | 18/Apr/2001 @ 01:02:21 | /inetuk/providers/angel.html | Mozilla/2.0 (compatible; MSIE 3.02; Windows 95) |
63.144.65.58 | 18/Apr/2001 @ 01:02:22 | /inetuk/providers/aladdin.html | Mozilla/4.0 (compatible; MSIE 4.0; Windows NT) |
63.144.65.58 | 18/Apr/2001 @ 01:02:22 | /inetuk/providers/apanet.html | Mozilla/3.0 (Win16; I) |
63.144.65.58 | 18/Apr/2001 @ 01:02:22 | /inetuk/providers/amity.html | Mozilla/4.03 [en] (Win95; I) |
63.144.65.58 | 18/Apr/2001 @ 01:02:22 | /inetuk/notify.html | Mozilla/3.0 (Win16; I) |
63.144.65.58 | 18/Apr/2001 @ 01:02:22 | /inetuk/catch/ | Mozilla/2.0 (compatible; MSIE 3.02; Windows 95) |
A short visit with a changing user-agent, the first references MSIE
host IP | date/time | requested file | user-agent |
njproxy4.avaya.com | 30/Apr/2001 @ 14:34:18 | /colour/navigate.lml | Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; DigExt; WebSite-Watcher (unreg.) http://aignes.net) |
njproxy4.avaya.com | 30/Apr/2001 @ 14:36:48 | /colour/navigate.lml | Mozilla/3.01 (compatible;) |
Logs lines from a 324-hit visit, all from agent.lisco.com.
The user-agent changes within the visit.
hit # | date/time | requested file | user-agent |
1 | 29/May/2001 @ 14:59:58 | /innovations/images/b-home.gif | Mozilla/3.01 (compatible;) |
4 | 29/May/2001 @ 14:59:58 | /innovations/library/requirements.html | Mozilla/4.77 [en] (Win95; U) |
5 | 29/May/2001 @ 14:59:58 | /innovations/images/d-structure.gif | Mozilla/3.01 (compatible;) |
9 | 29/May/2001 @ 15:00:02 | /innovations/innovate.css | Mozilla/4.77 [en] (Win95; U) |
10 | 29/May/2001 @ 15:00:02 | /innovations/images/g-libr.jpg | Mozilla/3.01 (compatible;) |
14 | 29/May/2001 @ 15:43:55 | /innovations/favicon.ico | Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90) |
15 | 29/May/2001 @ 16:12:41 | /~paola/ | Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90) |
16 | 29/May/2001 @ 16:12:41 | /~paola/pictures/icons/paola.jpg | Mozilla/3.01 (compatible;) |
23 | 29/May/2001 @ 16:12:41 | /~paola/paola.css | Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90) |
24 | 29/May/2001 @ 16:12:42 | /~paola/pictures/icons/contents.gif | Mozilla/3.01 (compatible;) |
Users can change the user-agent sent to web servers in browsers (e.g.,
Internet Explorer, Mozilla, Opera, Konqueror and Lynx) and spiders.
If these visitors aren't know by stats programs, their visits
will be counted in browser stats.
On average, 10% of human visitors to this web site can't or won't accept cookies.
With a cookie-enabled browser, web users have control of how cookies are
used.
- They can accept all cookies
- They can only accept cookies from certain domains
- They can only accept certain cookies from certain domains
- They can reject third-party cookies, those from a domain different to the current site
- They can remove one of more cookies
- They can edit the cookie contents
Regardless of what browser is used, it is always possible to remove
cookies within a visit. Cookies may be removed deliberately, perhaps in
a big clearout, or become corrupted or lost during a disk crash.
One can't assume that a) cookies get written or that
b) they'll remain on visitors' computers.
StatMarket's HitBox counts visitors by use of third-party cookies.
People can easily configure their browsers to reject third-party
cookies - those not originating from the web site they're visiting.
If someone visits a site with a HitBox counter and their browser
rejects the cookie, HitBox will count every page request
as a new visit. HitBox over-counts visits.
Example: imagine a site that had 2,000 actual visits in one day with
three requests on average. The true visit count is 2,000. If our
10% non-cookie figure is typical, HitBox would correctly count 1,800
(90% of 2,000) of the visits. However, it would process the 600 (10% of
2,000 visits x 3 pages) page requests as visits, producing an incorrect
total of 2,400 visits.
A lot of web sites are optimised for Internet Explorer because it's
easier for developers to ignore other browsers; until a couple of years
ago, the Marks and Spencers web site turned away Mozilla users, telling
them to get a better browser.
To get around this problem, modern browsers let users set the
user-agent to something else. This is usually MSIE, since so many sites are
optimised for IE.
Sometimes the only indication that a visitor is robot is the time
between accesses. Our sites have repeat visitors which accept cookies,
and have an user-agent that looks like a normal browser.
What gives them away as robots are:
- They visit at regular intervals and access the same pages, such as all the links on the home page
- They access all the links on a web page and in the order they appear
- They access 5-10 pages within a second
The last behaviour is how they can be spotted in the logs as their
log lines will appear in clumps.
Because so many sites are optimised for Microsoft Internet Explorer (MSIE), bots send
an MSIE user-agent. if they aren't detected as bots, they'll over-represent the
proportion of IE users, misleading the site's developers into thinking
they made the right decision to turn away other browsers.
Below are log lines from a single IP address to this site. It's a
robot specifically designed to show in the logs with certain referring
sites. This has become a trend in robots once blogs, for example, started
publishing trackback links to referring sites. These bots are basically
getting other sites to publish links to their sites.
All the accesses came from 166-82-31-14.quickclick.ctc.net with the
user-agent
Mozilla/4.0 (compatible; MSIE 5.01; Windows 98). I've edited the referrers.
date/time | requested file | referring page |
30/Dec/2005 @ 18:20:50 | /journal/?ID=H64SBVK4NKN00F8&jm=1&e=1061 | http://www.adsense-xpress.falling.net/forex777.htm |
30/Dec/2005 @ 18:20:51 | /journal/?ID=XQKSBV74NKM00DR&jm=1&e=1061 | http://www.adsense-xpress.falling.net/swapclix.htm |
24/Jan/2006 @ 12:54:45 | /inetuk/interop96.lml | http://www.tvinfomercials.com/ |
24/Jan/2006 @ 12:54:45 | /inetuk/interop96.lml | http://www.7dayplan.war-q.com |
17/Feb/2006 @ 06:30:53 | /inetuk/interop96.lml | http://www.bugtraininginfo.com/ |
20/Feb/2006 @ 11:51:05 | /inetuk/interop96.lml | http://www.phoneconferences247.com/ |
20/Feb/2006 @ 11:51:06 | /inetuk/interop96.lml | http://www.bugtraininginfo.com |
26/Feb/2006 @ 06:32:18 | /inetuk/interop96.lml | http://www.catcast2006.com/ |
03/Mar/2006 @ 14:21:49 | /inetuk/interop96.lml | http://www.war-q.com |
03/Mar/2006 @ 14:21:51 | /inetuk/interop96.lml | http://www.200-free-4resale-products.numbers.com |
I've spent more time than I care to admit looking at server logs and
discovered unexpected behaviour by both human and spider visitors.
Because unique visitors can't be accurately detected, some browsers
end up being over- or under-counted by stats.
There are still more complications. For example, many hit counters
collect stats via accesses to a GIF placed on your web pages. Text-only
browsers and screen-readers don't access the GIF and so are never
included in the browser stats. This means that some disabled visitors
are not included at all in browser stats.
I've concluded that you mustn't believe your web stats if they're
based on log analysis - they'll tend tell you good news when the reality
is likely to be discouraging.
|