Friday, September 4, 2009

Convert PDF pages to Image files

Another comment question we get in email is if Solid Framework can convert PDF pages into image files. Solid Framework can be used to convert PDF pages into image files, and we use this feature to create page thumbnail images and the main page view for PDF Navigator. Here is a diagram of how this works:


You can download the sample project [zip file] to see this in action yourself. The project contains both Visual Studio 2005 and Visual Studio 2008 solutions. Those without Microsoft Visual Studio can use Visual C# 2008 Express Edition for free to work with the sample project.

Earlier we talked about using a C# class library to allow you to use the scripting functionality of Solid PDF Tools Scan to PDF from the command line. We use this class again to parse out the command line arguments we need to convert the pages into image files:


  Arguments CommandLine = new Arguments(args);‍
  if (CommandLine["f"] == null)‍
  {
    ShowUsage();‍
    return -1;
  }‍
  else
    pdfFile = CommandLine["f"];‍

  if (CommandLine["p"] != null)‍
    password = CommandLine["p"];

‍  if (CommandLine["o"] == null)
  {‍
    ShowUsage();
    return -2;‍
  }
  else
    outputfolder = CommandLine["o"];

‍  // Note: We default to 96 dpi if the parameter was not provided.
  if (CommandLine["d"] != null)‍
    dpi = Convert.ToInt32(CommandLine["d"]);

  if (CommandLine["t"] != null)
  {‍
    switch (CommandLine["t"].ToUpper())
    {‍
      case "TIF":
      case "TIFF":‍
        imagetype = ImageType.TIFF;
        break;‍
      case "BMP":
        imagetype = ImageType.BMP;‍
        break;
      case "JPEG":‍
      case "JPG":
        imagetype = ImageType.JPG; ‍
      break;
      case "PNG":‍
      default:
        imagetype = ImageType.PNG;‍
        break;
    }‍
  }

  if (CommandLine["r"] != null)
  {‍
    pagerange = CommandLine["r"];
  }‍

  DoConversion(pdfFile, password, outputfolder, dpi, pagerange, imagetype);‍

The code above takes care of setting up the arguments to hand off to DoConversion. So lets say we have a pdf file at c:\mypdfs\pdftest.pdf that is encrypted with a user password of "mypassword" and we want to make JPEG images of pages 1-5, 7, 8 with a dpi of 127 and put these images in c:\myimages. The commandline would look like this:

PDFtoImage.exe -f:c:\mypdfs\pdftest.pdf -p:mypassword -o:c:\myimages -d:127
-t:JPG -r:1-5,7,8


Note: -p -d -t and -r are optional. No password is used if -p is missing. DPI will default to 96, and image type will default to PNG. If -r is missing, all pages will be used to make images.

The DoConversion function is the meat of the project. First we set the trial license:

  // Setup the license
  SolidFramework.License.ActivateDeveloperLicense();

It then loads the PDF file with password if supplied:

  // Load up the document
  SolidFramework.Pdf.PdfDocument doc =
    new SolidFramework.Pdf.PdfDocument(file, password);

  doc.Open();

After the document is open, we check to see if the output folder exists, and if it doesn't, we create it:

  // Setup the outputfolder
  if (!Directory.Exists(folder))
  {
    Directory.CreateDirectory(folder);
  }
  // Setup the file string.
  string filename = folder + Path.DirectorySeparatorChar +
    Path.GetFileNameWithoutExtension(file);

Now walk the Pages dictionary and finds the page items by following the references.

  // Get our pages.
  List<SolidFramework.Pdf.Plumbing.PdfPage> Pages =
    new List<SolidFramework.Pdf.Plumbing.PdfPage>(doc.Catalog.Pages.PageCount);

  SolidFramework.Pdf.Catalog catalog =
    (SolidFramework.Pdf.Catalog)SolidFramework.Pdf.Catalog.Create(doc);

  SolidFramework.Pdf.Plumbing.PdfPages pages =
    (SolidFramework.Pdf.Plumbing.PdfPages)catalog.Pages;

  ProcessPages(ref pages, ref Pages)

Then if a page range is specified, parse the argument into page number integers. For each page that is specified, or all if not specified.

  // Check for page ranges
  PageRange ranges = null;
  bool bHaveRanges = false;
  if (!string.IsNullOrEmpty(pagerange))
  {
    bHaveRanges = PageRange.TryParse(pagerange, out ranges);
  }

  if (bHaveRanges)
  {
    int[] pageArray = ranges.ToArray();
    foreach (int number in pageArray)
    {
      CreateImageFromPage(Pages[number], dpi, filename, number, extension, format);
      Console.WriteLine(string.Format("Processed page {0} of {1}", number,
      Pages.Count));
    }
  }
  else
  {
    // For each page, save off a file.
    int pageIndex = 0;
    foreach (SolidFramework.Pdf.Plumbing.PdfPage page in Pages)
    {
      // Update the page number.
      pageIndex++;

      CreateImageFromPage(page, dpi, filename, pageIndex, extension, format);
      Console.WriteLine(string.Format("Processed page {0} of {1}", pageIndex,
      Pages.Count));
    }
  }

We load each requested Page object and request a bitmap from that object. We then request that the bitmap object save itself to a file in the output directory with the requested ImageFormat type.

  private static void   CreateImageFromPage(SolidFramework.Pdf.Plumbing.PdfPage page,
    int dpi, string filename, int pageIndex, string extension,
    System.Drawing.Imaging.ImageFormat format)
  {
    // Create a bitmap from the page with set dpi.
    Bitmap bm = page.DrawBitmap(dpi);

    // Setup the filename.
    string filepath = string.Format(filename + "-{0}.{1}", pageIndex, extension);
    // If the file exits already, delete it. I.E. Overwrite it.
    if (File.Exists(filepath))
      File.Delete(filepath);

    // Save the file.
    bm.Save(filepath, format);

    // Cleanup.
    bm.Dispose();
  }

And there you have it. The requested images should have been created in the specified output directory. Since we are using the free developer trial license, each page image will have a watermark at the bottom if the page. To remove this watermark, read more about an annual license for the Solid Framework Tools Edition here ($250 or $500 per year depending on distribution, no royalties).

Have any thoughts that you'd like to share? Please contact us with your feedback.

2 comments:

Becker said...

FYI, it looks like there is something wrong with your project here since it appears to not grab the text on the first page, it creates a blank page (except for watermark). The second page on works fine, but page 1 seems to be blank.

Greg Greenaae said...

The project zip has been updated with the latest shipping build of SolidFramework dll. This resolves the first page blank issue.