Thursday, December 15, 2016

Data Governance with CloverETL

Let's start by defining data governance. According to wikipedia (because wikipedia knows all), "Data governance is a control that ensures that the data entry by an operations team member or by an automated process meets precise standards, such as a business rule, a data definition and data integrity constraints in the data model. " So at the end of the day, an organization's data governance policy is specifically created to standardize the data model to improve operational efficiency. If you are given a project where you need to create a database schema, tables, columns, etc, you have your recipe in order to complete the task by following the data governance policy. However, the recipe is only as good as the cook who is creating it. 

Data governance isn't a 'glory' topic. In fact, it's quite the opposite! Developers (database, application, etc) and DevOps absolutely hate projects that involve implementing or following an organizations data governance policy. Why? It's time consuming and some standards they may not agree with (to name a couple). Would you rather have the policy up on your monitor while you are creating a new data feed or create the feed as quickly as possible? I would venture a guess that you go with the latter because it's more realistic in today's pace of business. What if you work for an organization where it's required by law? Some data governance policies are put in place to satisfy regulatory laws. Think about financial companies. We need a policy for maintaining data for banks, lending, and stock markets for handling sensitive data. 

Have you thought about trying to implement a data governance policy and failed? Perhaps you implemented your first data governance policy 5 years ago. I bet you that the policy isn't being followed for all of your data systems. There are decisions made everyday that break data governance policies and almost no way to detect these small violations. I have a perfect solution that you never would have thought of. Use CloverETL to define your data governance policy! Most people think of CloverETL as a data integration software tool and not something that CloverETL could handle. But think of this this way. Your data governance policy is the set of business rules that you create graphs for and the data that you are validating is actually the structure and syntax of your data systems. 

You can use CloverETL to validate all database schemas that were created on your operational systems and then report data governance violations for the structures. Better yet, run the solution on the development and QA environments so that you catch the governance violations before they even touch production. So what can CloverETL do for you? 

-Check all syntax for data systems including: database naming conventions and data types. 
-Validate the structure of your data based on your business rules for data. 
-Verify and report on user access to databases for each of your operational databases. CloverETL can -Monitor the quality of your data. 

The solution can run daily, weekly, monthly, yearly or any other interval you deem relevant. This is a great use case and policy to automate because once you spend the time upfront automating the process, you can enforce the policy without much effort. If you have any questions about data governance with CloverETL, please don't hesitate to reach out. 

Monday, December 5, 2016

Reading data from PDF's with CloverETL

Do you have PDF files that you need to be able to read and process data from? In most cases, this is a nice thing to be able to do with an existing toolset rather than have to purchase another tool to translate PDF's into a machine readable format. In this blog I will show you how to read PDF's from within CloverETL and hopefully you can apply that knowledge to any other data format that isn't supported as an 'out-of-the-box' feature with CloverETL.

First, let's take a look at a portion of the PDF.

Looking at the PDF, you can see that there's 8 columns that we need to be able to read: Description, Item#, Seg, Seq#, Len, DT, Rep, Table. If you have looked at something similar in the past, you know that you cannot read this file with an existing CloverETL out of the box component. However, CloverETL ships with a CustomJavaReader component (as well as CustomJavaTransformer and CustomJavaWriter) which easily extends the CloverETL Engine's capabilities by custom coding Java to fit your requirements. There are a few pre-prequisites for being able to do this that depend upon the problem that you are solving.

1. All .jar files must be accessible for the CloverETL Designer (and CloverETL Server if you have one). That means that you much import all .jar files into your sandbox and add the jar files to the
build path so CloverETL knows where these libraries exist. We recommend placing the external .jar files in the lib/ so that all developers and operators are in agreement of the placement of the jar files.

2. Rather than building your own java class, I would recommend starting with the template that CloverETL provides for reading, transforming, writing data. Open the Algorithm property for CloverETL to create a Java class for you that you can edit directly in that pop-up editor, or you can copy/paste the contents into a separate Java file within the designer.

3. After you have developed your custom reader, you can configure the CustomJavaReader to use your newly created Java class. The configuration depends upon which option you selected above. If you created your own class outside of the component, you can use the Algorithm class property to configure your CustomJavaReader.

4. Create the rest of your graph as your requirements dictate (my example is less useful because I am only showcasing that you can read PDF's with CloverETL).

Here is what my graph looks like:

The last execution below shows data on the edge from within the PDF:

If you would like to see the custom java code or graph used to create this example, I would be more than happy to share it with you. However, using this approach, you can quickly read any format that CloverETL cannot natively read and stream it into a graph or jobflow as you would for any other input data.