Grepping your web logs
Linux, Web Development July 12th, 2007 - 14,601 viewsIf you’re anything like me, you spend far too much time checking your web server stats, and not enough time actually creating content and coding. Thankfully, my logs are always close at hand since I work almost exclusively on the command line. With the help of a few common unix filters, you can quickly gauge how things are going on your site. These commands work with Apache, or Apache compatible log files, and can probably be tweaked to work with other log file formats pretty easily.
Here’s what a line from an Apache log file looks like (I added the line break):
71.206.3.109 - - [12/Jul/2007:09:16:31 -0500] "GET / HTTP/1.1"
200 33545 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en)
AppleWebKit/419 (KHTML, like Gecko) Safari/419.3"
For a quick feel for how many visitors I’ve received for the day I use grep to find all of today’s requests, awk to extract the IP address, then sort and uniq to eliminate duplicate IPs (sort is necessary because uniq only works with sorted input). Piping the result through wc results in the number of unique IP addresses that have made requests:
# grep "12/Jul" immike.net-access.log | awk '{print $1}' | \\
> sort | uniq | wc -l
1188
To determine how many people visited an individual page, you can add a grep for that page’s URL before sending the logs through awk:
# grep "12/Jul" immike.net-access.log | \\
> grep "interview-with-leah-culver" | \\
> awk '{print $1}' | sort | uniq | wc -l
261
For a bit more detail, the following command will determine the 10 most requested pages (excluding css, js, gif, ico, png, and jpg files) and list them in order:
# awk '{print $7}' immike.net-access.log | \\
> grep -ivE '(\.gif|\.jpg|\.png|\.ico|\.css|\.js)' | \\
> sed 's/\\/$//g' | sort | uniq -c | sort -rn | head -10
2972 /blog/feed
2590 /blog/2007/07/06/interview-with-leah-culver-the-making...
712 /blog/2007/04/06/5-regular-expressions-every-web...
648 /robots.txt
588 /blog/2007/04/06/the-absolute-bare-minimum-every...
467 /blog
321
280 /blog/2007/06/21/extreme-regex-foo-what-you-need-to...
279 /blog/2007/07/03/full-text-search-with-apache-lucene
195 /blog/2007/07/06/interview-with-leah-culver-the-making...
A similar command can be used to list the top 10 referrers. I’ve added an additional filter (the first grep) to remove any pages from my own site, since I’m only interested in counting referrals from external sites:
# awk '{print $11}' immike.net-access.log | grep -v 'immike.net' | \\
> grep -v '"-"' | sort | uniq -c | sort -rn | head -10
341 "http://www.google.com/reader/view/"
328 "http://simonwillison.net/2007/Jul/7/interview/"
184 "http://feeds.feedburner.com/ImMike"
114 "http://www.djangoproject.com/weblog/2007/jul/08/django...
112 "http://www.dzone.com/rsslinks/interview_with_leah...
112 "http://www.dzone.com/links/interview_with_leah...
74 "http://www.planetpython.org/"
57 "http://www.santosj.name/programming/php-related/php...
57 "http://blog.assembleron.com/2007/07/10/competition...
51 "http://agiletesting.blogspot.com/2007/07/another-django...
So, there’s a wealth of information sitting in Apache log files, and you don’t need fancy log analyzers to get at it. Just don’t spend too much time grokking it, or you’ll never get anything done!
July 13th, 2007 at 10:50 am
I get sort: command not found. Is sort part of some special package?
July 13th, 2007 at 10:58 am
Nevermind. My bad. Typo. However, no typo on this one. Your top 10 requested pages I get
sed: -e expression #1, char 6: unknown option to `s’
Any ideas?
July 14th, 2007 at 3:21 pm
Hey, yea. There was a little error in the sed expression caused by some escaping issues with Wordpress. Try it now and let me know if it doesn’t work.
July 15th, 2007 at 10:45 am
Worked perfectly. Thanks!
July 16th, 2007 at 2:28 am
I feel your stats obsessivness. Too bad my server doesn’t have apache?
July 18th, 2007 at 9:40 pm
You have some formatting issues … ah, unless I put the font back down to uber-tiny.
you can save yourself a bit of typing. Top ten hits to your blog:
awk ‘$7 ~ “^/blog/” {print $7}’
July 20th, 2007 at 7:30 pm
[...] Grepping your web logs - I?m Mike - With the help of a few common unix filters, you can quickly gauge how things are going on your site. These commands work with Apache, or Apache compatible log files, and can probably be tweaked to work with other log file formats pretty easily. [...]
July 28th, 2007 at 2:35 pm
Great post again. I made a little script to show me hits so far today, but my logs are rotated daily at 6am EST so I’m not sure how accurate they are.
#!/bin/bash
TODAY=$(date +%d/%b)
sudo grep "$TODAY" /path/to/my/access_log | awk '{print $1}' | sort | uniq | wc -l
July 28th, 2007 at 2:38 pm
@Micahville - your server has Apache, I’d go so far as to say 80%+ of the servers online use Apache.
“Apache/2.0.54 (Unix) PHP/4.4.7 mod_ssl/2.0.54 OpenSSL/0.9.7e mod_fastcgi/2.4.2 DAV/2 SVN/1.4.2″ (your server)
August 1st, 2007 at 12:53 pm
You definitely need to couple this with RRDTool http://oss.oetiker.ch/rrdtool/ on say 5 minutes intervals to see the fluctuations of top ten pages and possibly top ten referrers, though that may be a problem to do ;)
August 7th, 2007 at 7:02 am
Personally, I’ve given up using GREP to parse my logs. Between Google Analytics and Webalizer I get most of what I need to know. I do use it for most things non-Apache though.
November 2nd, 2007 at 2:23 pm
I’ve always wanted a logfile grep to spit out the last of each uniq authuser (htuser) logins. IE a table of the most recent time each user in the log logged in (authuser, date, return) But I keep giving up in frustration. Can anyone figure this out?
March 29th, 2008 at 11:29 pm
[...] Malone has an article detailing how to grep through your web logs. However if it’s live stats you want, I’ve come to like [...]