Scrubyt perks and rules

-It will skip over tbody for xpaths. Don’t include it in your xpath.

# correct
content '//body/table/tr/td

# incorrect
content '//body/table/tbody/tr/td

-It will skip over font[@size=n] but not over just plain font

-extractor.to_xml will output to the xml you specify within the:

extractor = Scrubyt::Extractor.define do
..
end

-extractor.to_text i’m not sure how to work

Comments

  1. Peter Szinek | July 5, 2008

    Well the first problem is in fact a Hpricot problem… Firefox tries to make (quite successfully) the document well-formed valid (X)HTML during the DOM building, and though Hpricot also has its own bag of tricks, the two produce a different DOM.

    Therefore you can’t use the XPaths directly from FireBug in scRUBYt! (well, unless the source *really* contains a tbody tag, and it’s not inserted by FF later. So it’s incorrect to say you need to *always* remove the tbody tag, because if it’s there in the page source, Hpricot will also parse it)

    The second problem: scRUBYt! guesses if an example is an XPath or not - and the regular expression telling of XPaths from non-XPaths is quite weak, and in these cases scRUBYt! thinks the example is just text and this doesnt work of course.

    In these cases you should force the example type:

    stuff “font[@size=n]“, example_type => :xpath

    and if you do this, scRUBYt! will do an XPath matching.
    Let me know if this works.

  2. Harry | July 16, 2008

    OH, I think this is the real reason that I have’nt been successfully scraped any site with tb.tbody thing.