Ticket #1646 (new enhancement)

Opened 13 months ago

Last modified 10 months ago

sitemaps.xml: Allow gzip support, cache?

Reported by: smagnusson Owned by: wscott
Priority: minor Milestone:
Component: CMS - General Version:
Severity: medium effort / impact Keywords:
Cc: Hours:

Description

Google's suggestion in the sitemaps protocol is to provide it as a .gz file, and not just as an .xml file carried over mod_deflate.

Enable support for accessing /sitemap.gz (or is sitemap.xml.gz?)

This will increase the CPU load for producing the sitemap, and so it would be best for this to be cached, say to /assets/.sitemap.xml.gz

E.g.

if (cachefile < 1 hour old OR ?flush=1 ) create-xml-file-and-save-to-gzipped-cache-file()

header temporary redirect to cache file()

Does the protocol support putting a comment inside the xml file to say the generation time?

Change History

Changed 13 months ago by wscott

The Google document on the specification is here:

Sitemap Protocol

Google says you can compress your Sitemap, but doesn't give any benefit to doing so. The max size restriction on the file is applied to the uncompressed version. Additionally, a single Sitemap file will not be regularly accessed. (you will likely modify the file by adding a page to your site more often than it is queried by a search engine.)

Changed 13 months ago by smagnusson

Will Scott mentions:

The sitemap isn't a gz file, but that shouldn't be an issue. In the initial negotiation with Apache, Google's servers will say they can handle gzip-compressed data, so apache ought to send out a gzip compressed version of the data. (to my knowledge serving the file as .gz will be beneficial only if you have the gzip php module, and have content negotiation turned off in apache.)

The file is also not cached. And again I'm not sure how much benefit would be gained by caching it. there are only a limited number of search engines that actually check the sitemap file, and they check it very infrequently. Silverstripe is set up to ping Google when it's updated, and that's going to happen more often even than Google will get around to checking the file. The file will probably only get requested a couple times per version at most, so it's unlikely to need to get cached.

Sig concludes:

Lets do this anyway, at some point, to prevent a DOS which could occur if someone accesssed the file over and over again.

First of all we will see how slow it is, tomorrow I will benchmark one, and I will close or keep the issue as appropriate.

Changed 13 months ago by smagnusson

  • owner changed from aoneil to wscott

Page performance is alright for the moment, but it will need to cache so that as sites grow and become popular, they won't be brought to their knees by continuous requests to sitemap.xml

Benchmark: 2 seconds for a 350 page website (www.silverstripe.com) on a Pentium 4 2.6ghz with 1.5 GB ram.

Also as the site grows to 1000+ pages, I assume the memory needs of the website will be getting very large.

Changed 13 months ago by sminnee

  • priority changed from medium to minor

Changed 10 months ago by sminnee

  • type changed from defect to enhancement
Note: See TracTickets for help on using tickets.