Tuesday 13 May 2008

XML Feeds - Made to make you work?

I decided last night to have a play with the OMG XML feed for their editiorials, intending to use it to display merchant information on my mortgage rates website. Basically, starting a financial directory that I can later add to. In the day I'd noticed another site that ranked quite well doing this, but manually updating the displayed information (the offer details were out of date).

Now OMG provide their editorials in 3 formats - javascripts, which I usually use, I-Frame, which I've never tried, and XML, which I'd not tried before from them. I've used XML from other providers plenty of times - other affiliate schemes, news readers etc. So I've got plenty of the basic code about.

The problem with Javascripts and I-Frames is that it doesn't add anything to your webpage as far as the search engines are concerned - to my knowledge, they just ignore these parts of the code. At most, they will actually follow the I-Frame (I have seen that in some of the sites I've built), but they the benefits of the content lay with the provider. There's not lots of pages of text that I can create and add further bits to.

So taking the XML feed in real time seems a good idea. Why not just copy the text?Well, as I mentioned above the site I saw yesterday did this and said that the insurance offer was a 15% discount, whereas it is now 20%. OMG don't like affiliates taking the text alone as out of date offers and rates are displayed. And I don't want to create a large nightmare for myself of continually having to update text - it's bad enough when they don't update the dynamically served content and email me to complain (you know who you are if you are reading this!!!).

So last night I started to build into my standard code the OMG feed. It looked as though it would be easy - not exaclty a hard layout to use. But I was playing with it for hours as it just wasn't working. 'ROOT' was appearing at the end of the text - it's actually from the closing tag (<\ROOT>) and control characters were appearing where they shouldn't be - something like &#xA; instead of carriage returns. I thought it should be simple, change &#xA; to <br>. But it just didn't work. I could remove the & on it's own and the #, but the string just wouldn't go.

Then when I looked at the output in notepad it appeared that the text was full of non-printable characters. Between every displayed character appeared a space in notepad, which didn't appear on the screen. Not knowing what these were, made it very difficult to remove them. I suppose some sort of regular expression could have done it - just thought of that now!

What their purpose is and whether it's something I was doing wrong, I just don't know. But something wasn't right. It could be that they are there intentionally to make sure that the text isn't read by the search engines - either to stop them caching text which goes out of date or to prevent problems with Google's duplicate content filter. Either way, it was a pain I was trying to sort for about 4 hours. I had hoped to get the feed going in 30 minutes and spend the rest of the time getting most of the pages up. It wasn't to be.

Would appreciate any more thoughts on this problem and hearing what others have done.

No comments: