Scrubyt screen scraping tutorials
I am beginning to use scrubyt gem and in my opinion the documentation and examples are pretty bad. The name isn’t the greatest either. It’s difficult to type so be careful in the following examples. (should have just called it scruby or scrubby or something. the t is annoying.)
**However**, I am starting to understand how scrubyt works, and it really is one heck of an awesome and powerful gem.
I hope the following small scripts and basic tutorial will help speed your learning process for scrubyt.
Install Scrubyt
sudo gem install scrubyt
Create your first scrubyt script
Create a .rb file to test scrubyt. Call it scrubyt_example.rb and save it in your documents/scripts folder or documents/ruby folder or wherever you like.
Edit that file with the following code
require 'rubygems' require 'scrubyt' data = Scrubyt::Extractor.define do fetch 'http://google.com' title '//head/title' end data.to_xml.write($stdout, 1)
Test run the file
In your terminal cd to the folder where you saved the scrubyt_example.rb
cd ~/documents/code/ruby
& then run the file with the ruby command in your terminal from inside that folder
ruby scrubyt_example.rb
Ok, way to go, you just learned how to use the xPath (//head/title) to define the part of the website page you wanted. You can get deeper and deeper into the site content by creating just about any xPath you want (i.e. //body/title/table/tr/td).
NOTE: I am using firebug in firefox 3 to view the path to these html elements. You could just as easily view source, but there is a good chance you’ll make a mistake. I recommend using firebug.
Now let’s get a little more complicated by basically creating a for loop with scrubyt.
require 'rubygems'
require 'scrubyt'
extractor = Scrubyt::Extractor.define do
# go to website
fetch 'http://scottmotte.com/archive'
# move down to the ul and call it content and for each ul *do* the following
content '//body/div/div/div/ul' do
# grab the content of the li element and call it post
post '/li'
end
end
puts extractor.to_xml
This script loops through all the li items on my archive page at http://scottmotte.com/archive
Ok, way to go. Now let’s learn how to login somewhere. This is actually pretty darn easy with scrubyt. We are also going to click a link to change pages, and spit out the url of the page so that we know we have actually changed pages. (Of course, you’ll need to change this to a website you have login access to [just don't try gmail, it uses heavy javascript and so there is another trick to doing that that I haven't tried yet. Scrubyt is not good at handling javascript because www::mechanize is not.]. As well, your fill_textfield names will be different depending on the website. For example, in youtube they are simply ‘username’ and ‘password’. View source or use firebug to find out.)
# example from http://scrubyt.rubyforge.org/files/README.html require "rubygems" require "scrubyt" data = Scrubyt::Extractor.define do fetch 'http://www.investors.com/logOffConfirm.aspx?REG=un' fill_textfield 'htmUserName', '***your username***' fill_textfield 'htmPassword', '***your password***' submit click_link 'Today In IBD' click_link 'Today In IBD' click_link 'The Big Picture' url "href", :type => :attribute #this part isn't technically correct end data.to_xml.write($stdout, 1)
Note: for some reason, I had to do the click_link ‘Today In IBD’ twice. This seems to be a bug or something weird with investors.com. Either way, you might have a similar problem so try doing the click_link twice.
Note 2: the url “href”, :type => :attribute isn’t technically correct, but it will spit out the present url in your terminal. Just don’t use it for production or in a rails app yet. I’m still learning and figuring out how to make this correct.

Peter Szinek | July 5, 2008
Hi Scott,
Unfortunately you are right about the documentation - it sucks
It was fairly good when the very first version was released, because scRUBYt! didn’t know much at that time, but since then even some of that changed and a lot of new stuff was added, sometimes breaking backwards compatibility… and I have to agree that because of this, the docs/tutorials are basically meaningless right now (or at least would need a lot of revamping, beefing up, mashing up with all the info in the forum, filling in the gaps etc).
Unfortunately I am pretty caught up with my new startup and don’t have time for all this right now. There have been a few guys who got enthusiastic about scRUBYt!, wrote a few posts, started answering questions on the forum or even started to work on a documentation - unfortunately all these efforts ended too quick without a real result.
We even have a trac up (https://code.adelao.it/trac/scrubyt) but not enough time to edit it
Today scRUBYt! knows about 30% of the final product I am imagining right now, so even though it was and is used in production already to scrape large sites with tens of thousands of records in some cases, it’s still pretty rough around the edges.
Anyways… have fun and don’t hesitate to ask questions directly or on the forum.
Cheers,
Peter
scott | July 5, 2008
Thanks for the reply Peter and thanks for the link to the trac. All the examples on there are very very helpful.
For anyone interested here’s the link: https://code.adelao.it/trac/scrubyt/browser/trunk/examples?rev=2
(go ahead and trust the ssl certificate. I think it’s just self assigned)
Best of luck with your startup.