Scraping Gmail with Mechanize and Hpricot

This quick tutorial will show you how to use mechanize and hpricot to login to gmail and return a list of Unread emails.

Installation of required tools

gem install mechanize --include-dependencies

This will install both mechanize and hpricot.

Usage

Using mechanize to login to gmail

Before we can scrape our gmail account, we will need to login. Mechanize is a lib for “automating interaction with websites”. It can store and send cookies as well so once we login our script will now have a session to putter around in as if it was a web browser. AWESOME!

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get 'http://www.gmail.com'

form = page.forms.first
form.Email = '***your gmail account***'
form.Passwd = '***your password***'

page = agent.submit form

Above you can see we have instantiated a Mechanize class. This object can be thought of as the user agent which can get web pages, click links, fill out and submit forms. We can use Hpricot methods on our page object to parse the html it contains.

Forcing Gmail into basic mode

Gmail uses a lot of fancy javascript and ajax functionality and as such is one of the premier web2.0 sites on the net. Our little script doesnt have a built in javascript engine so it wont understand any of the crazy js thats thrown at it. Instead we will need to force gmail into Basic Mode which is HTML only.

After logging in gmail will try to redirect us to http://mail.google.com/mail?ui&auth=DC8F…. we need to follow this link. Using hpricot we can search for the meta redirect and grab the href attribute then have mechanize follow the link.

page = agent.get page.search("//meta").first.attributes['href'].gsub(/'/,'')

Note we need to strip the single quotes from around the url, i used gsub for this.

The returned page will try to use javascript to load the interface but it will not work for use. Thankfully a noscript tag is included in the source and contains a helpful clue.

<noscript><font face="arial">JavaScript must be enabled in order for you to use Gmail in standard view. 
However, it seems JavaScript is either disabled or not supported by your browser.
To use standard view, enable JavaScript by changing your browser options, then <a href="">try again</a>.

<p>To use Gmail's basic HTML view, which does not require JavaScript, 
<a href="?ui=html&zy=n">click here</a>.</p></font>

<p><font face="arial">If you want to view Gmail on a mobile phone or similar device 
<a href="?ui=mobile&zyp=n">click here</a>.</font></p></noscript>

notice: ‘To use Gmail’s basic HTML view, which does not require JavaScript’ and it supplies a link with these GET vars ?ui=html&zy=n

Next step is to pass the above GET vars to the current url and we are in basic mode where we can scrap to our hearts content.

page = agent.get page.uri.to_s.sub(/\?.*$/, "?ui=html&zy=n")

A simple puts page.root should show us the html output of our gmail account.

Scrape!

Want to get a list of all your unread emails? This quick snippet will do the job.

page.search("//tr[@bgcolor='#ffffff']")  do |row|
  from, subject = *row.search("//b/text()")
  url = page.uri.to_s.sub(/ui.*$/, row.search("//a").first.attributes["href"])
  puts "From: #{from}\nSubject: #{subject}\nLink: #{url}\n\n"

  email = agent.get url #have the agent follow the email link for furthur parsing.
end

Full source

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new

page = agent.get 'http://www.gmail.com'
form = page.forms.first
form.Email = '***your gmail account***'
form.Passwd = '***your password***'
page = agent.submit form

page = agent.get page.search("//meta").first.attributes['href'].gsub(/'/,'')
page = agent.get page.uri.to_s.sub(/\?.*$/, "?ui=html&zy=n")
page.search("//tr[@bgcolor='#ffffff']")  do |row|
  from, subject = *row.search("//b/text()")
  url = page.uri.to_s.sub(/ui.*$/, row.search("//a").first.attributes["href"])
  puts "From: #{from}\nSubject: #{subject}\nLink: #{url}\n\n"

  email = agent.get url
  # ..
end

Enjoy.



About this entry


Comments

  1. Avatar

    Chuck Bergeron

    Posted: 1 day later:

    Thanks Corban! This was a great introduction to a tool I had no idea existed.

  2. Avatar

    Matthew M. Boedicker

    Posted: 1 day later:

    Gmail provides an atom feed for your inbox. There is no need to scrape it.

    Just go to

    https://mail.google.com/mail/feed/atom

    and login using basic HTTP auth to get your unread mail in a nice parseable format.

  3. Avatar

    Corban Brook

    Posted: 1 day later:

    I also use the feed to view my incoming mail. This was more of an example of how you can login to gmail with mechanize. Once logged into to your gmail account you can scrape anything. Get your address book, check your Gtalk log transcripts. Change your gmail options.

    Gmail also has a webservice API so there are many ways of getting at the information. My hope was this tutorial would show you some basics with mechanize. I know I can think of many sites that I subcribe to which dont offer RSS or WS APIs which contain helpful data I could scrap and create fun and interesting mashups.

  4. Avatar

    Peter

    Posted: 1 day later:

    Hi Corban,

    Maybe you are interested in my web-scraping toolkit, scRUBYt! which is based on Mechanize and Hpricot. scRUBYt! it is much more easy to use because you don’t have to deal with all the ugly stuff (i.e. XPaths, HTML tags attributes etc.) but the system learns them from examples. Check it out at http://scrubyt.org.

    I would like to ask you whether I can use your plain-old-HTML trick in scRUBYt! to login to gmail – I am planning to replace Mechanize with FireWatir in the future which can handle Javascript – but until then I would like to workaround this with your method. What do you think?

  5. Avatar

    Chip

    Posted: 9 days later:

    Corban,

    I have been looking for a long time for a way to extract (scrape) information from gmail. Over the months, I have sent myself links to websites, the subject header describes the link, and the link is in the body of the email. There are now hundreds of emails, and I want to pull them out to create bookmarks in firefox with them.

    I have no idea where to start, since my only programming experience comes from autohotkey…

    Any help or pointers?

  6. Avatar

    Nitai

    Posted: 11 days later:

    Thanks Corban! It opened my ways to think, but i have a bigger problem. Can you help me to scrap the orkut, another service provided from google. It’s not simple like gmail cause in the first step, ‘login to orkut’, the mech agent returns a page with a Java Script redirection plus a JS function within a state machine. At this point we should write a ruby code to do the same that JS function, but i’m not shore that i wrot its rigth cause a ‘bad request’ is throwed.

    Any Help??

  7. Avatar

    Corban Brook

    Posted: 11 days later:

    Well seems there has been some interest in this basic example.

    I think I will expand on it and write up another example. Nitai, you were having trouble with orkut. I will try to tackle that. Stay tuned.

  8. Avatar

    Corban Brook

    Posted: 11 days later:

    I havent logged into my orkut account in over a year… seems there are about 100 messages from brazilian women. Let me find a good example:

    “hello I look you in comunnity I add you kissss sorry but I don’t speak englis very well”

    and

    “FROM: Renata TO: Corban MY name is Renata, is part of its community “EBM”, added you in my list of friends! Kisses!”

    Fun.

  9. Avatar

    Nitai

    Posted: 11 days later:

    Hello Corban, I’m Brazilian, so if you need some help with portuguese i’ll do it.

    :D

  10. Avatar

    Corban Brookk

    Posted: 11 days later:

    Nitai

    Go figure, thats freaking brilliant.

    Well I stayed up last night and figured out how to login to orkut. When I have a chance today Ill post a new article… noon my time. EST.

  11. Avatar

    Nitai

    Posted: 11 days later:

    Corban, I think a piece of my code could help you>>>

    agent = WWW::Mechanize.new agent.user_agent_alias = ‘Linux Mozilla’ page = agent.get(‘https://www.orkut.com/’)

    #there is a internal frame within the real logon form page = agent.get(page.iframes0.src)

    form = page.forms.first form.Email = #email# form.Passwd = #passwd#

    page = agent.submit form

    at this point we got that page within the JS function i had mentioned before.

    way to go…

  12. Avatar

    Nitai

    Posted: 11 days later:

    Sorry the format code…

  13. Avatar

    Corban Brookk

    Posted: 11 days later:

    Ok Nitai, I posted the new article: http://schf.uc.org/articles/2007/02/26/breaking-into-orkut-with-mechanize

About

    Buildingsky.net is comprised of Corban Brook and Maciek Adwent. We build experimental web applications.

    We are interested in computer science, ruby-lang, javascript, web technologies, audio synthesis, finance/economics.

Contact

Projects

Categories