Google and sitemap.xml
I finally got around to adding a robots.txt and sitemap.xml to my site. But I had some issues along the way…
First, I wanted to use Google’s sitemap.xml generator to generate the xml file. Here are the steps I took to get this working with my host 34sp. The instructions from google’s site are very good, though there was one minor point of confusion.
I have a mirrored setup that allows me to test changes to the site on my iBook, before uploading to the website. These are 3 files I used;
- sitemap_gen.py – the script, you shouldn’t need to modify this
- config.xml – could be called anything, but I stuck with the default
- urllist.txt – a list of the urls I want the script to use.
The last two are the ones you need to change. The docs say to delete the sections (in config.xml) that you don’t need – however, I’ve been commenting them out as, no doubt, I’ll start using the other sections. For the moment I’m only using the urllist.txt, as this seemed the easiest approach.
Here are my sections (without the comments, and paths removed)
<site
base_url="http://rexy.co.uk"
store_into="/absolute/path/to/httpdocs/sitemap.xml.gz"
verbose="1"
>
<urllist path="/absolute/path/to/private/sitemap/urllist.txt" encoding="UTF-8" />
Notice here that I’ve stuck all 3 files in a not web-accessible location. Probably best not to stick it in your document root. (On 34sp, the private directory should already be set up for you)
Now, I don’t have SSH access, but running the script
python sitemap_gen.py --config=config.xml --testing
locally worked fine (remember to remove the –testing switch when it’s all working).Now to get this working for my site, I’ve set up a cron job and here’s where things went odd; Initially I had it looking like:
0 0 * * * /usr/local/bin/python ~/private/sitemap/sitemap_gen.py --config=config.xml
this is as some docs have listed. This failed with the python error ‘ValueError: unknown url type: ~/private/sitemap/config.xml’. So I tried setting the path.
0 0 * * * /usr/local/bin/python ~/private/sitemap/sitemap_gen.py --config=~/private/sitemap/config.xmlAgain this failed. I then realised that the cron jobs were probably being run as a different user, so using ‘~’ in the path is not going to work. Switching to the absolute path worked a treat:
0 0 * * * /usr/local/bin/python ~/private/sitemap/sitemap_gen.py --config=/absolute/path/to/private/sitemap/config.xmlThis generates a sitemap.xml.gz in the root, so at http://rexy.co.uk/sitemap.xml.gz. Now Google needs to be told about the sitemap. You CAN ‘upload’ the gz file, it’s fine.
However, in my case, even having done all this, it still wasn’t working; I was getting an ‘unsupported format’ error.
In the end I traced this down to my Rewrite rules not behaving in my root .htaccess file. So what I thought should have been the sitemap file, was actually ending up as blosxom post.
This as it turned out was also affecting my robots.txt, but I hadn’t realised it.
