x
11224742 10207744386832061 1134591581572583218 n

From PDF to Image

over 3 years ago by Alex

Popup

There are situations when we want to convert a PDF into a set of images which we can then nicely display on a web page or on mobile devices. In Rails this is quite easy to do, but there a few things you should know.

In your gemfile add the following gem:

gem "docsplit”

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts. For instance plain text (via OCR if necessary), page images or thumbnails in any format, PDFs, single pages and document metadata (title, author, number of pages and so on).

Dependencies for Docsplit

brew install graphicsmagick
brew install poppler

Graphicsmagick provides a robust and efficient collection of tools and libraries which support reading, writing and manipulating an image in over 88 major formats including popular ones like JPEG, PNG, PDF.

Poppler is a PDF rendering library based on the xpdf-3.0 code base.

Dependencies for Poppler

Download and install XQuartz on your MacBook.

XQuartz is an open-source project an effort to develop a version of the X.Org X Window System that runs on OS X together with supporting libraries and applications.
Install Ghostscript with the following command:

brew install ghostscript

Ghostscript is an interpreter for the PostScript language and for PDF.

Once the dependancies are installed, we can take a look at the algorithm which converts a PDF to a PNG image.

require 'open-uri'
class ImageProcess
  include ActionView::Helpers::UrlHelper

  def self.convert_pdf_to_png(page_image_id, page_id)
    page_image = PageImage.find(page_image_id)

    # The above line needs to be changed to suit your needs and the file represents the effective PDF file.
    file = page_image.image

    # Create the temp folders and files to store the newly created image files.        
    upload_path = "#{Rails.root}/tmp/files/#{Time.now.to_i}"
    output_path = "#{upload_path}/output"
    text_file_name = "#{output_path}/text"
    
    file_name = "#{upload_path}/#{file.file.filename}"
    
    FileUtils.mkdir_p upload_path

    File.open(file_name, "wb") { |f| f.write(open(file.url).read) }

    FileUtils.mkdir_p output_path

    # Split the PDF file into images.
    Docsplit.extract_images(file_name, size: '2000x', format: [:png], output: output_path)

    Dir.glob("#{output_path}/*.png") do |image_file|
      /_(\d*)\.\w+$/ =~ image_file

      # Store the newly created image files (the below line needs to be changed to suit your needs).        
      PageImage.create(page_id: page_id, image: File.new(image_file), image_type: "image", page_number: $1)
    end

    # Remove the temporarily created files.        
    FileUtils.rm_rf(upload_path)
  end
end

Heroku Deployment
Because we rely on GraphicsMagick to convert PDF into images, deploying on Heroku requires adding a custom buildpack for this set of libraries. You can read more about this following this link: Docsplit on Heroku

Useful resources

http://documentcloud.github.io/docsplit/
http://poppler.freedesktop.org/
http://www.graphicsmagick.org/
http://xquartz.macosforge.org/landing/
http://www.ghostscript.com/

Did you like this post? Share it with your friends!