Scraping data from the web

Nothing is more annoying then wanting to change web services and not being able to export your data. You would think in this digital day and age that exporting your data would be a default feature. Think again.

First thing you can do is check if there is an API you can use. When here is one available, you can write a small client ( or check if there is already one available ) and start fetching your data this way. But what if there isn’t? What to do then?

Enter Mechanize

As the mechanize repo states, it is a library used for automating interactions with websites. It also automatically stores and sends cookies, follows redirects, can follow links and submit forms.

So basically, it is a scriptable browser! So lets say we are using a SAAS platform that we need to login to. The data we need is displayed in a table list.

So lets start cracking.

Lets log in

First we need to install the mechanize library. It runs on ruby 1.9.2 or greater, and we’ll need Nokogiri as well:

gem install Nokogiri
gem install mechanize

Once everything is installed we can start with the coding part. Open up your favorite editor / IDE.

require 'mechanize'
require 'csv'

client = Mechanize.new
client.get('https://saas-app.com/login') do |page|
  page.form_with(name: 'login-form') do |f|
  f.login_user = 'user'
  f.login_pass = 'pass'
  end.submit
end

So what did we just do? We first required the mechanize library as well as csv. This way, we can store the data we scraped directly into a CSV document. We then initialized a new Mechanize instace that we call client.

Now that we have a mechnize instance, we can retrieve the login page of the application we need to login to. We tell mechanize to search for a form named “login-form”, fill in the login_user and login_pass form fields and then submit the form. login_user and login_pass are the name of the form fields.

Fetching the data

Now that we are logged in properly, we can start retrieving our data. The data at hand can be found on page “https://saas-app.com/my-data” . This page contains the data table we need. But a single page is limited to 20 items, but we currently hold 200 items. So that means we have about 10 pages to process.

So on each page you will find a table containing data. The table also contains some columns that we don’t need to scrape, so we will omit these. So we want to collect the following data:

  • first name: column 4
  • last name: column 5
  • phone number: column 7
  • e-mail address: column 9

Lets take a look at the following snippet:

CSV.open("/tmp/data.csv", "w") do |csv|
  1.upto(10) do |page_number|
    client.get("https://saas-app.com/my-data/page/#{page_number}") do |page|
      document = Nokogiri::HTML::Document.parse(page.body)

      rows = document.xpath('//table[@class="data-list"]/tr')
      rows.drop(1).collect do |row|
        first_name = row.at_xpath('td[4]/text()').text
        last_name = row.at_xpath('td[5]/text()').text
        phone = row.at_xpath('td[7]/text()').text
        email = row.at_xpath('td[9]/text()').text

        csv << [first_name, last_name, phone, email]
      end
    end
  end
end

We first start off by opening our CSV file so we can write our data into it. Then we start our iteration. The first iteration will loop over the pagination, from page 1 up until page 10.

Inside our page iteration, we tell Mechanize to grab the appropriate page. From there, Nokogiri will take over the show. You will need Nokogiri so you can parse the HTML content.

So we parse the response body through Nokogiri and collect all the table rows from the table that has the class name “data-list”. From the collections of rows, we drop the first row. Why you ask, because the first row here contains the table headers.

Then, it is time to extract the data from the desired table cells. The first and last name are displayed in the 4th and 5th table cell, while the phone number and e-mail address are located in the 7th and 9th table cell.
In the end, we store all the scraped data into a CSV file.

Everything from selecting the table rows and table cells are done by using XPath expressions. Even though this simple example limits itself to retrieve pages and extract data from a table, you could easily expand it to visit a detail page and extract data from a form as well.

As you can see, Mechnize is indeed a powerful library when you need to collect data from HTML pages in a dynamic way. So keep it in mind. You never know when you’ll need it.