Thursday, October 18, 2012

Using Java to Convert MS Office to PDF with OpenOffice

Introduction

I'm working on a project to convert MS Office documents, slide, spreadsheets, etc. into PDF.  The project actually involves reading email in .eml format, extracting the attachment and then converting the file to PDF, but I will skip that portion, since it is somewhat trivial with JavaMail - javax.mail (if there are enough requests i could show it in a later post).
I first tried all the usual paths with some success and some frustration:
  • Apache POI
  • Apache FOP / XSL-FO
  • iText
  • Docx4j
Each solution seemed to have some different limitations such as:
  • Can only convert the newer OpenXML formats of docx, pptx, etc.
  • Cannot read older binary format of .doc, etc.
  • No direct path from PowerPoint to PDF (pptx > svg > pdf)
  • Or just plain annoying to code
I started reading about OpenOffice running as a service and using the JODConverter library to interface from Java to OpenOffice. I was able to mock up a prototype relatively quickly (couple hours), which was very exciting.  Then I wanted a webpage with a decent interface, so people could actually use it.  Its been a long time since I looked at writing a Servlet, so I had to re-learn and put it all together.

Requirements

If you want to do everything I did, then you are going to need everything on the list.  If you want to just pick different pieces for your needs then feel free.  You may be able to get different versions of software to work, but this is what I used
  • Apache Tomcat Server 6 (6.0.24-45)
  • OpenJDK 1.6 (1.6.0_24 / IcedTea6 1.11.4)
  • JODConverter 2.2.2
  • OpenOffice (LibreOffice 3.4.5.2-16.1)
  • Apache Commons FileUpload / Commons IO (for handling file uploads)
Very Important
OpenOffice needs to be installed on the local machine and running as a service listening for connections.  It is similar on Windows and Unix, but we were using Unix for the prototype.  In a separate window, you can start OpenOffice listening on the localhost IP address and port 8100 with the following command:

soffice --headless --accept="socket,host=127.0.0.1,port=8100;urp;" --nofirststartwizard

The Code

Quick disclaimer...this code was just for a prototype and all of the necessary error checking has not been done. This was written just to get the basic functionality working.

Outline

package mypackage;
 
import java.io.*
import java.util.List;
 

import org.apache.commons.fileupload.*;
import org.apache.commons.fileupload.servlet.*;
import org.apache.commons.fileupload.disk.*;

import org.apache.commons.io.FilenameUtils;

import javax.servlet.*;
import javax.servlet.http.*;
 
import com.artofsolving.jodconverter.*;
import com.artofsolving.jodconverter.openoffice.connection.*;
import com.artofsolving.jodconverter.openoffice.converter.*;


 public class PDFConverter extends HttpServlet {
 
    public PDFConverter() {
     //required for servlet, not used
    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws 

       ServletException, IOException

    //required for servlet, we will use to do the actual file upload process and converting
    protected void doPost(HttpServletRequest request, HttpServletResponse response) throws  
        ServletException, IOException 

    //our code to send the converted file back to the user
    protected void streamFile(File outFile, HttpServletResponse response)
  
    //our code to save the uploaded file to a place we can use and with correct extension
    protected File saveFile(InputStream input, String inFilename) throws IOException

    //our code that connect to openoffice and does the conversion
    protected boolean PDFConvert(File inFile, File outFile)

    //our code to create a temp file and return file handle
    protected File createTempFile(String inFilename)
  
    //another version where the extension is given since we have to change to pdf
    protected File createTempFile(String inFilename, String tmpFileExt)

    //our code to get the file extension based on the string
    protected String getFileExt(String fileName)
 
}

Upload HTML page

Just a simple HTML page that allows the user to browse for a file and upload it.  You can create any page you want, but this is the minimum to create an interface for the user and tie the convert button to the servlet running at "/upload/PDFConverter".  Depending how you deploy to your Application Server (such as tomcat) this location may be different.
 
<html>
  <head><title>PDF Converter</title></head>
  <body>
    <form action="/upload/PDFConverter" method="post" enctype="multipart/form-data">
     Select file to convert:
    <input type="file" name="file" />
    <br/>
    <input type="submit" value="Convert to PDF"/>
    </form>
  </body>

</html>

doPost()

This is where it finally gets interesting.  The user uploads a file using the web page and it is processed by this routine in the servlet.  The doPost() function performs the following:
  1. Saves the upload and assigns it to the variable: tempInFile
  2. Creates a temp output file for the PDF: tempOutFile
  3. Converts to PDF with function PDFConvert() - the PDF file is now in tempOutFile
  4. Returns headers for PDF Content-Type and the filename
  5. Reads the PDF file from the disk and streams if with streamFile()

protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException  {
        
        try {
            List<FileItem> items = new ServletFileUpload(new DiskFileItemFactory()).parseRequest(request);
            for (FileItem item : items) {
                if (item.isFormField()) {
                    // Process regular form field (input type="text|radio|checkbox|etc", select, etc).
                } else {
                    // Process form file field (input type="file").
                    

                    // Save the uploaded file into a place OpenOffice will be able to read and with the right extension
                    File tempInFile = saveFile(item.getInputStream(),FilenameUtils.getName(item.getName()));

 
                    // Create a temp file with the correct extension of pdf so we can pass to openoffice
                    File tempOutFile = createTempFile(FilenameUtils.getName(item.getName()), ".pdf" );
                  
                    //our wrapper code to convert to PDF with OpenOffice
                    PDFConvert(tempInFile,tempOutFile);

                    //set response headers to PDF
                    response.setContentType("application/pdf");
                    response.addHeader("Content-Disposition", "attachment; filename=" + tempOutFile.getName());

                    //stream the output
                    streamFile(tempOutFile, response);
                }
            }
        } catch (FileUploadException e) {
            throw new ServletException("Cannot parse multipart request.", e);
        }

    }

 

saveFile()

This function just takes the uploaded file and saves it into a location we can pass to OpenOffice.  When files are uploaded they are given an extension of .tmp.  OpenOffice uses the file extension to figure out what the file is, so we make sure it is correct according to the file that was uploaded.  Not much to explain here, we could have probably just moved and/or renamed the file.

protected File saveFile(InputStream input, String inFilename) throws IOException {
    //our code which requests a temp file

    File tmpFile = createTempFile(inFilename);

    FileOutputStream fos = new FileOutputStream(tmpFile);

    BufferedOutputStream bos = new BufferedOutputStream(fos);

    BufferedInputStream bis = new BufferedInputStream(input);
    int aByte;
    while((aByte = bis.read()) != -1) {
        bos.write(aByte);
    }

    bos.flush();
    bos.close();
    bis.close();

    return(tmpFile);
}



PDFConvert()

And finally, this is the interface to JODConverter.  Just a note, as mentioned earlier, You need to start OpenOffice as a service on the local machine before this process with work.  Version 3 of JODConverter takes care of this, but we were using version 2 here. The process is very simple:
  1. Make a socket connection to OpenOffice on port 8100
  2. Instantiate an OpenOfficeDocumentConverter()
  3. convert the file passing the original (inFile) and our target (ourFile)
  4. The calling routine already has the outFile reference so we only need to notify if it worked or not
protected boolean PDFConvert(File inFile, File outFile) {
   try {
       // connect to an OpenOffice.org instance running on port 8100
       OpenOfficeConnection connection = new SocketOpenOfficeConnection(8100);
       connection.connect();

       // convert
       DocumentConverter converter = new OpenOfficeDocumentConverter(connection);
       converter.convert(inFile, outFile);

       // close the connection
       connection.disconnect();
           
       return(true);
   }
   catch(ConnectException ce) {
       ce.printStackTrace();
       return(false);
   }
}


The Rest of the Functions

The last 3 functions are included here, since they do not relate directly to the PDF conversion, but were used to assist with functionality.

//calls the other createTempFile with extension set to null
protected File createTempFile(String inFilename) {
    return( createTempFile(inFilename, null));
}
   
protected File createTempFile(String inFilename, String tmpFileExt) {
    try {
        String tmpFileStr = "converted_" + inFilename;
           
        // if extension wasnt given figure it out
        if(tmpFileExt == null){
            tmpFileExt = "." + getFileExt(inFilename);
        }
           
        File tmpFile = File.createTempFile(tmpFileStr, tmpFileExt);
       
        return(tmpFile);
    }
    catch(IOException e) {
        e.printStackTrace();
        return(null);
    }
       
}
   
protected String getFileExt(String fileName) {
    int pos = fileName.lastIndexOf('.');
    String ext = fileName.substring(pos+1);
       
    return(ext);

}

Wrapping It All Up

Here is the basic process flow of the program:
  1. When a user visits the starting page, they will be presented with a browse button or location input box to enter a file on the location disk.  
  2. When the convert button is pressed, the Servlet executes doPost().  
  3. doPost() copies the uploaded file to another location with the same extension as the uploaded file.
  4. createTemplFile() creates a new file with the extension of .pdf
  5. PDFConvert() is run with the uploaded file and the pdf file
  6. Content headers are returned to the browser
  7. The PDF file is streamed from the disk back through the response instance of the Servlet
Many additions can be made to the program, but this is the basic flow.  We really need a lot of error checking and good responses to any problems that will occur.  It would probably help to have some reasonable timeouts as well.

I hope this helps some people out.  This program was the culminations of several hours of researching on the web and pulling example code from many different locations.