Scrubyt perks and rules
-It will skip over tbody for xpaths. Don’t include it in your xpath.
# correct content '//body/table/tr/td # incorrect content '//body/table/tbody/tr/td
-It will skip over font[@size=n] but not over just plain font
-extractor.to_xml will output to the xml you specify within the:
extractor = Scrubyt::Extractor.define do .. end
-extractor.to_text i’m not sure how to work

Peter Szinek | July 5, 2008
Well the first problem is in fact a Hpricot problem… Firefox tries to make (quite successfully) the document well-formed valid (X)HTML during the DOM building, and though Hpricot also has its own bag of tricks, the two produce a different DOM.
Therefore you can’t use the XPaths directly from FireBug in scRUBYt! (well, unless the source *really* contains a tbody tag, and it’s not inserted by FF later. So it’s incorrect to say you need to *always* remove the tbody tag, because if it’s there in the page source, Hpricot will also parse it)
The second problem: scRUBYt! guesses if an example is an XPath or not - and the regular expression telling of XPaths from non-XPaths is quite weak, and in these cases scRUBYt! thinks the example is just text and this doesnt work of course.
In these cases you should force the example type:
stuff “font[@size=n]“, example_type => :xpath
and if you do this, scRUBYt! will do an XPath matching.
Let me know if this works.
Harry | July 16, 2008
OH, I think this is the real reason that I have’nt been successfully scraped any site with tb.tbody thing.