Finding dead links in PDF files

Some time ago, I was involved in a project that contained vast number of documents (PDF files).  Although the files where still mostly accurate, they contained a lot of links to external resources. Over time, a lot of these links changed, resulting into dead links.
If you don’t have a lot of PDF files, you could perhaps start checking them manually, but when the library holds more then 5000 PDF files, you should be leaning more towards an automated way 🙂 .

To tackle this problem, I needed to figure out which tools I would use. Of course my weapon of choice would be Ruby in conjunction with some Rake tasks. Check out my previous post about some rake basics.

Next, I needed a way to scan the PDF files content and extract the links. This way, I could test each URL for its response. To do this, I decided to use the linux program “pdftohtml”. Pdftohtml can convert PDF files into HTML, XML and PNG images. You will most likely find the program in your distro package repository.

Converting the PDF files to HTML makes us able to “read” the content and extract links. The easiest way in Ruby, is to use Nokogiri. Nokogiri allows you to parse HTML/XML documents.

So as a starter, make sure you have installed both the pdftohtml package on your computer / server and install the Nokogiri gem.

Convert to html

Now that we have all the software installed. First step is to convert the PDF files into HTML. We have gathered all the PDF files into one directory. Rake allows us to easily loop through all files in a specific directory and execute a shell command:

desc 'Convert PDF to HTML'
task :convert_to_html do
  Dir["pdfs/*.pdf"].each do |file|
    filename = File.basename(file)
    sh "pdftohtml -noframes 'pdfs/#{filename}' 'html/#{filename}.html'"
  end
end

As you can see, here we loop over all PDF files inside the pdfs directory. We first take the basename of the filename so we can name the converted html file in the same way.

Once we have the basename, we execute the pdftohtml program to convert that specific PDF file into a HTML document.

Search the html for links

Once we have converted all the PDF files, we can start by searching for all the links inside those HTML documents.

Of course we need to store the result somewhere. You can do this inside a database, but for this example, lets just store it in a simple CSV file.

So we first open a new csv file and add the headers. For now, we will only store the filename and the URL. The status column is for the next step.

Then we start by iterating over all the HTML files. We open each HTML file and pass its content to Nokogiri.

So we collect all link objects and then iterate over them to collect the href attribute values. We make sure we keep them unique and delete them if they contain an empty value.

Once we have all the URL’s, we iterate once again to test their response code and add them to the CSV file.

desc 'Collect all links'
task :collect_links do
  require 'net/http'
  
  CSV.open('link-list.csv', 'a+') do |csv|
    csv << %w(file url status)
    Dir["html/*.html"].each do |file|
      doc = Nokogiri::HTML(open(file))
      links = doc.css('a')
      refs = links.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if {|href| href.empty? }
      hrefs.each do |link|
        result = Net::HTTP.get_response(URI.parse(link.to_s))
        csv << [File.basename(file, '.html'), link, result.code]
      end
    end
  end
end

So at the end, you hold a list of all the links found in your PDF documents, with their status code. You can always filter out the links that don’t work anymore (40X, 50X) response code so you can focus more easily on the dead-links, but the principle remains the same.