Monday, December 5, 2016

Reading data from PDF's with CloverETL

Do you have PDF files that you need to be able to read and process data from? In most cases, this is a nice thing to be able to do with an existing toolset rather than have to purchase another tool to translate PDF's into a machine readable format. In this blog I will show you how to read PDF's from within CloverETL and hopefully you can apply that knowledge to any other data format that isn't supported as an 'out-of-the-box' feature with CloverETL.

First, let's take a look at a portion of the PDF.

Looking at the PDF, you can see that there's 8 columns that we need to be able to read: Description, Item#, Seg, Seq#, Len, DT, Rep, Table. If you have looked at something similar in the past, you know that you cannot read this file with an existing CloverETL out of the box component. However, CloverETL ships with a CustomJavaReader component (as well as CustomJavaTransformer and CustomJavaWriter) which easily extends the CloverETL Engine's capabilities by custom coding Java to fit your requirements. There are a few pre-prequisites for being able to do this that depend upon the problem that you are solving.

1. All .jar files must be accessible for the CloverETL Designer (and CloverETL Server if you have one). That means that you much import all .jar files into your sandbox and add the jar files to the
build path so CloverETL knows where these libraries exist. We recommend placing the external .jar files in the lib/ so that all developers and operators are in agreement of the placement of the jar files.

2. Rather than building your own java class, I would recommend starting with the template that CloverETL provides for reading, transforming, writing data. Open the Algorithm property for CloverETL to create a Java class for you that you can edit directly in that pop-up editor, or you can copy/paste the contents into a separate Java file within the designer.

3. After you have developed your custom reader, you can configure the CustomJavaReader to use your newly created Java class. The configuration depends upon which option you selected above. If you created your own class outside of the component, you can use the Algorithm class property to configure your CustomJavaReader.

4. Create the rest of your graph as your requirements dictate (my example is less useful because I am only showcasing that you can read PDF's with CloverETL).

Here is what my graph looks like:

The last execution below shows data on the edge from within the PDF:

If you would like to see the custom java code or graph used to create this example, I would be more than happy to share it with you. However, using this approach, you can quickly read any format that CloverETL cannot natively read and stream it into a graph or jobflow as you would for any other input data.

No comments:

Post a Comment