Scrubyt screen scraping tutorials

I am beginning to use scrubyt gem and in my opinion the documentation and examples are pretty bad. The name isn’t the greatest either. It’s difficult to type so be careful in the following examples. (should have just called it scruby or scrubby or something. the t is annoying.)

**However**, I am starting to understand how scrubyt works, and it really is one heck of an awesome and powerful gem.

I hope the following small scripts and basic tutorial will help speed your learning process for scrubyt.

Install Scrubyt

sudo gem install scrubyt

Create your first scrubyt script
Create a .rb file to test scrubyt. Call it scrubyt_example.rb and save it in your documents/scripts folder or documents/ruby folder or wherever you like.

Edit that file with the following code

require 'rubygems'
require 'scrubyt'

data = Scrubyt::Extractor.define do
  fetch 'http://google.com'
  title '//head/title'
end

data.to_xml.write($stdout, 1)

Test run the file
In your terminal cd to the folder where you saved the scrubyt_example.rb

cd ~/documents/code/ruby

& then run the file with the ruby command in your terminal from inside that folder

ruby scrubyt_example.rb

Ok, way to go, you just learned how to use the xPath (//head/title) to define the part of the website page you wanted. You can get deeper and deeper into the site content by creating just about any xPath you want (i.e. //body/title/table/tr/td).

NOTE: I am using firebug in firefox 3 to view the path to these html elements. You could just as easily view source, but there is a good chance you’ll make a mistake. I recommend using firebug.

Now let’s get a little more complicated by basically creating a for loop with scrubyt.

require 'rubygems'
require 'scrubyt'

extractor = Scrubyt::Extractor.define do
  # go to website
  fetch 'http://scottmotte.com/archive'
  # move down to the ul and call it content and for each ul *do* the following
  content '//body/div/div/div/ul' do
    # grab the content of the li element and call it post
    post '/li'
  end
end
puts extractor.to_xml

This script loops through all the li items on my archive page at http://scottmotte.com/archive

Ok, way to go. Now let’s learn how to login somewhere. This is actually pretty darn easy with scrubyt. We are also going to click a link to change pages, and spit out the url of the page so that we know we have actually changed pages. (Of course, you’ll need to change this to a website you have login access to [just don't try gmail, it uses heavy javascript and so there is another trick to doing that that I haven't tried yet. Scrubyt is not good at handling javascript because www::mechanize is not.]. As well, your fill_textfield names will be different depending on the website. For example, in youtube they are simply ‘username’ and ‘password’. View source or use firebug to find out.)

# example from http://scrubyt.rubyforge.org/files/README.html
require "rubygems"
require "scrubyt"

data = Scrubyt::Extractor.define do

  fetch 'http://www.investors.com/logOffConfirm.aspx?REG=un'
  fill_textfield 'htmUserName', '***your username***'
  fill_textfield 'htmPassword', '***your password***'
  submit

  click_link 'Today In IBD'
  click_link 'Today In IBD'

  click_link 'The Big Picture'

  url "href", :type => :attribute #this part isn't technically correct
end

data.to_xml.write($stdout, 1)

Note: for some reason, I had to do the click_link ‘Today In IBD’ twice. This seems to be a bug or something weird with investors.com. Either way, you might have a similar problem so try doing the click_link twice.

Note 2: the url “href”, :type => :attribute isn’t technically correct, but it will spit out the present url in your terminal. Just don’t use it for production or in a rails app yet. I’m still learning and figuring out how to make this correct.

Comments

  1. Peter Szinek | July 5, 2008

    Hi Scott,

    Unfortunately you are right about the documentation - it sucks :-( It was fairly good when the very first version was released, because scRUBYt! didn’t know much at that time, but since then even some of that changed and a lot of new stuff was added, sometimes breaking backwards compatibility… and I have to agree that because of this, the docs/tutorials are basically meaningless right now (or at least would need a lot of revamping, beefing up, mashing up with all the info in the forum, filling in the gaps etc).

    Unfortunately I am pretty caught up with my new startup and don’t have time for all this right now. There have been a few guys who got enthusiastic about scRUBYt!, wrote a few posts, started answering questions on the forum or even started to work on a documentation - unfortunately all these efforts ended too quick without a real result.

    We even have a trac up (https://code.adelao.it/trac/scrubyt) but not enough time to edit it :(

    Today scRUBYt! knows about 30% of the final product I am imagining right now, so even though it was and is used in production already to scrape large sites with tens of thousands of records in some cases, it’s still pretty rough around the edges.

    Anyways… have fun and don’t hesitate to ask questions directly or on the forum.

    Cheers,
    Peter

  2. scott | July 5, 2008

    Thanks for the reply Peter and thanks for the link to the trac. All the examples on there are very very helpful.

    For anyone interested here’s the link: https://code.adelao.it/trac/scrubyt/browser/trunk/examples?rev=2

    (go ahead and trust the ssl certificate. I think it’s just self assigned)

    Best of luck with your startup.