Tuesday, August 1, 2017

Java debugging in CloverETL

CloverETL allows you to create a simple data transformation with just Drag&Drop UI, but sometimes you need something more complex and then writing code in CTL (Clover Transformation Language) comes in handy.

You can say that CTL is simplified Java, with some limitations - no inheritance, no exceptions, focused on a standard issues found in data manipulation processes.

And when you are writing code, you would like to have some debugging functionality. Before version 4.3.0 it was a bit old school. printLog function all over the place and digging through log output in Console view.

Since 4.3.0 CTL debugging was introduced and you can read about it in this great blog post http://blog.cloveretl.com/code-debugging-cloveretl-designer .

CloverETL also offer functionality of writing your transformation in Java if needed. Common use cases are implementing non-existent functionality in CTL or re usability of existing Java code. Java transformation could be used pretty much anywhere you can use CTL (Reformat, Denormalizer, Normalizer components etc.).

Debugging functionality is implemented not only for CTL, but for Java code too. Whole debugging experience should be really familiar if you ever debugged Java code in the Eclipse.

I won't repeat content of a mentioned blog, which describes all available functionality, this should only be a short brain dump to remind me that CTL debugging is in CloverETL designer out of box, you need just put 'breakpoint' into your ctl code and run your graph in Debug mode. What I keep forgetting is if you need to debug Java code, you need to enable Java debugging in Window > Preferences. (It uses communication on port 4444 by default, so if you are running CloverETL on a tightly locked box you would need to enable that port on your firewall!)

Allow me to steal picture worth of thousands words:

Tuesday, July 25, 2017

CloverETL Releases Version 4.7.0.M1

Version 4.7 Milestone release is all about this new feature called a Data Service. Unfortunately, you don't have access to the new data service unless you have already purchased a CloverETL Server or Cluster (or are planning to). Let's talk about what exactly the data service is...

The developers have given end users the ability to expose any graph as a REST API endpoint so users can create smarter applications. The REST service will allow you to process data, perform business logic, or create custom transformations at runtime and expose the result set to your application.

When I think about this service, it's truly amazing. Let's take Salesforce as an example. If you want to get fancy, you can create a force.com application for Salesforce that's built using Apex (a proprietary Java-like programming language). Let's say you would like to integrate your customer support tickets with the Salesforce accounts, but don't want to store that information in Salesforce (because, well it costs money). So, you can create a force.com application within Salesforce, expose a CloverETL graph as a REST API that will go to your customer support portal, get the latest tickets from the account in Salesforce where the request was made, and display all of the tickets for the customer on-demand! That is just one example of where the data service would be relevant.

There are a number of other use cases that I would be happy to share of where the Data Service could be relevant. Think big: IoT, Edge Computing, iPaaS.

One other improvements from this release:
-Graphs and jobflows are now automatically validated immediately after a change is made in your designer. If you have ever executed a graph and received an error, you know that the error symbol always remained on the component until you actually saved your graph again to reset the state. That has been improved.

Thursday, March 30, 2017

CloverETL 4.5.0 is released!

I haven't written in a while and wanted to update everyone on what is new in the latest CloverETL production release (4.5.0). I will prioritize the new features from most functional and powerful, to less exciting (in my opinion of course).

1. Restart-ability of jobs/graphs from the CloverETL Server. Have you ever run a job and it failed? I hope so because you cannot truly develop and/or test a solution properly without failing. Now, you can restart jobflows and graphs from within your Server Console. If you passed parameters into the graph/jobflow, you can re-use the same values! This is a great accomplishment and something that the users desperately needed. Just think, you are responsible for operations of the CloverETL solution, and your job fails due to a network outage from connecting to the database. You can easily restart the job and continue on with your day.
Restart a graph from CloverETL Server

2. Salesforce Wave Connector - If you are immersed in the Salesforce ecosystem, you will undoubtedly heard of Wave Analytics. Now, CloverETL has added support for writing directly to your Wave Analytics dashboard directly from CloverETL.

Salesforce Wave Analytics Component

3. Event Listeners UI change - The CloverETL developers have now consolidated all listeners into one main Event Listeners screen where you can see all of your listeners in one view.

4. Added batch size and concurrency mode to the Salesforce components. Now, you can limit how much data and if you want to send everything in parallel or serial mode. You may have not noticed this to be an issue for you, but if you are seeing errors when you are writing to Salesforce, try these new features out to control data being written to Salesforce. 

As always, make sure you plan for the upgrade! If you have a development environment, upgrade in your development environment before you try it out in production. 

Let me know if you have any questions or would like to see the full list of changes. 

Thursday, December 15, 2016

Data Governance with CloverETL

Let's start by defining data governance. According to wikipedia (because wikipedia knows all), "Data governance is a control that ensures that the data entry by an operations team member or by an automated process meets precise standards, such as a business rule, a data definition and data integrity constraints in the data model. " So at the end of the day, an organization's data governance policy is specifically created to standardize the data model to improve operational efficiency. If you are given a project where you need to create a database schema, tables, columns, etc, you have your recipe in order to complete the task by following the data governance policy. However, the recipe is only as good as the cook who is creating it. 

Data governance isn't a 'glory' topic. In fact, it's quite the opposite! Developers (database, application, etc) and DevOps absolutely hate projects that involve implementing or following an organizations data governance policy. Why? It's time consuming and some standards they may not agree with (to name a couple). Would you rather have the policy up on your monitor while you are creating a new data feed or create the feed as quickly as possible? I would venture a guess that you go with the latter because it's more realistic in today's pace of business. What if you work for an organization where it's required by law? Some data governance policies are put in place to satisfy regulatory laws. Think about financial companies. We need a policy for maintaining data for banks, lending, and stock markets for handling sensitive data. 

Have you thought about trying to implement a data governance policy and failed? Perhaps you implemented your first data governance policy 5 years ago. I bet you that the policy isn't being followed for all of your data systems. There are decisions made everyday that break data governance policies and almost no way to detect these small violations. I have a perfect solution that you never would have thought of. Use CloverETL to define your data governance policy! Most people think of CloverETL as a data integration software tool and not something that CloverETL could handle. But think of this this way. Your data governance policy is the set of business rules that you create graphs for and the data that you are validating is actually the structure and syntax of your data systems. 

You can use CloverETL to validate all database schemas that were created on your operational systems and then report data governance violations for the structures. Better yet, run the solution on the development and QA environments so that you catch the governance violations before they even touch production. So what can CloverETL do for you? 

-Check all syntax for data systems including: database naming conventions and data types. 
-Validate the structure of your data based on your business rules for data. 
-Verify and report on user access to databases for each of your operational databases. CloverETL can -Monitor the quality of your data. 

The solution can run daily, weekly, monthly, yearly or any other interval you deem relevant. This is a great use case and policy to automate because once you spend the time upfront automating the process, you can enforce the policy without much effort. If you have any questions about data governance with CloverETL, please don't hesitate to reach out. 

Monday, December 5, 2016

Reading data from PDF's with CloverETL

Do you have PDF files that you need to be able to read and process data from? In most cases, this is a nice thing to be able to do with an existing toolset rather than have to purchase another tool to translate PDF's into a machine readable format. In this blog I will show you how to read PDF's from within CloverETL and hopefully you can apply that knowledge to any other data format that isn't supported as an 'out-of-the-box' feature with CloverETL.

First, let's take a look at a portion of the PDF.

Looking at the PDF, you can see that there's 8 columns that we need to be able to read: Description, Item#, Seg, Seq#, Len, DT, Rep, Table. If you have looked at something similar in the past, you know that you cannot read this file with an existing CloverETL out of the box component. However, CloverETL ships with a CustomJavaReader component (as well as CustomJavaTransformer and CustomJavaWriter) which easily extends the CloverETL Engine's capabilities by custom coding Java to fit your requirements. There are a few pre-prequisites for being able to do this that depend upon the problem that you are solving.

1. All .jar files must be accessible for the CloverETL Designer (and CloverETL Server if you have one). That means that you much import all .jar files into your sandbox and add the jar files to the
build path so CloverETL knows where these libraries exist. We recommend placing the external .jar files in the lib/ so that all developers and operators are in agreement of the placement of the jar files.

2. Rather than building your own java class, I would recommend starting with the template that CloverETL provides for reading, transforming, writing data. Open the Algorithm property for CloverETL to create a Java class for you that you can edit directly in that pop-up editor, or you can copy/paste the contents into a separate Java file within the designer.

3. After you have developed your custom reader, you can configure the CustomJavaReader to use your newly created Java class. The configuration depends upon which option you selected above. If you created your own class outside of the component, you can use the Algorithm class property to configure your CustomJavaReader.

4. Create the rest of your graph as your requirements dictate (my example is less useful because I am only showcasing that you can read PDF's with CloverETL).

Here is what my graph looks like:

The last execution below shows data on the edge from within the PDF:

If you would like to see the custom java code or graph used to create this example, I would be more than happy to share it with you. However, using this approach, you can quickly read any format that CloverETL cannot natively read and stream it into a graph or jobflow as you would for any other input data.

Tuesday, November 1, 2016

CloverETL Milestone Review

I couldn't resist updating the blog with the latest features that CloverETL is releasing in the coming months for the production release. We'll call this a review of the CloverETL 4.4.M1 that is available to all existing customers in a test release only. It is not advised to upgrade your production systems until the full release of 4.4.0 is out.

Remote file event listener (out of the box functionality added)!! Yes, it was possible to handle remote file event listening prior to this update, but now, you can setup a file event listener for remote servers simply by configuring the server configuration (and reacting to the event in any way necessary).

Listener for failed listener - This is quite interesting and pretty easy to use. You can now configure 1 failed listener to listen for any failures in event handling/scheduling and react to the scenario. They are allowing for wildcards and matching of all events setup, or you can setup failed listeners for individual events configured on your CloverETL server. For example, if you have an event setup to grab a file from a remote FTP site, but your network becomes unresponsive. The failed listener can notify you that you have a failure contacting your event source when attempted to retrieve file changes when attempting to execute a file event listener and will allow the user to have a quicker response time to the outage.

Salesforce Updates:
If you are not aware, CloverETL has created a connector to Salesforce and is available in the production release of 4.3, however, the latest milestone release has seen a number of improvements using the Salesforce Connector.

1. Included a new Salesforce Writer using the SOAP API. This will allow you to write data into Salesforce which are considered micro batches. However, if you are working with large data sets being migrated into Salesforce, it is recommended that you use the Bulk API writer to limit the number of API calls required from Salesforce.

2. Updated - Milestone M2 will include the Salesforce SOAP API Reader. This will allow for subquery's as well as function calls directly in your SOQL query.

3. A more user friendly SOQL editor which will sort all objects and fields for the objects. This makes it much easier to search and find the appropriate object.

AWS Updates:
-Redshift driver bundled with CloverETL Designer. This will now ship with the product so you no longer have to setup your own driver in Designer.
-ParallelReader support has been added for S3 so you can now read data from an S3 bucket in parallel and improve performance.

Wednesday, September 28, 2016

CloverETL and Football - Are they related?

I was recently asked to compare CloverETL to football, and I thought this would be a fun exercise to share with a wider audience. This concept of relating normal everyday things to technology is not uncommon, and in fact has been scientifically proven that the concepts are easy to learn if you can relate them to the individual who are trying to comprehend all of the terms. Why is this the case? Most research shows that if you can speak the same language it’s easier to understand and comprehend the topic. That makes sense doesn’t it, but why are more people not doing this? The answer: it’s very difficult to find a commonality between your audience, so you have to make more generalizations than you would like. However, I don’t have that problem here because I am only talking to people who may not understand everything about CloverETL, but understand American football (the world’s most popular sport).

Web Application Container (Tomcat, JBOSS, etc) - A web application container will host web applications. When relating this term to football, I think of the stadium for the same reason. The stadium is hosting football games. Fairly straightforward and simple.

Web Application - A web application is a software application that runs in a web browser. When relating this term to football, I think of it as an actual act of playing football because this can happen in many different formats: NFL, Canadian Football, Soccer, Rugby, and many others could be considered in this.

Java - A popular programming language that is used for creating distributed applications. I think this relates to the entity of the NFL because the NFL is the governing body of American football and is a particular brand of football. Java is one of the most widely used programming languages today, and the NFL is a brand of American football that is the most popular professional sport in the world.

CloverETL Designer - The CloverETL Designer is an engine-based ETL desktop application where a user can utilize the drag-and-drop interface to build data workflows. The CloverETL Designer can be thought as the Head Coach because the head coach must build all the data workflows from the ground-up. Yes, they will utilize others to help with the process, but the entirety of the CloverETL Designer acts as a head coach would for a football team.

CloverETL Server - The CloverETL server is a web server application which can schedule and orchestrate various CloverETL jobs. The server is in charge of managing the entire load of the system and defining all aspects of your data integration workflows. The analogy for the CloverETL Server in football terms, in my opinion, would be the General Manager. The general manager is in charge of all aspects of football operations: player management, coach management, contracts, salary cap, and many other areas.

CloverETL Cluster - The CloverETL cluster is a set of CloverETL server instances running concurrently to improve performance, add high availability, and allow for greater scalability. The CloverETL Cluster can be related to an NFL Franchise owner because they are always looking for ways to improve performance and efficiency throughout the entire organization. They are also responsible for managing all personnel as well as all financial implications that come with running a business.

CloverETL Engine - The CloverETL engine is an embedded application which is responsible for the execution of all graphs/jobflows. This closely resembles the Quarterback of the team. The quarterback must be the interpreter of all plays on/off the field for their teammates much like the engine would do as it receives instructions from both the Designer, Server, or Cluster.

Sandbox - The Sandbox is where the CloverETL project lives. This will store all jobflows, graphs, database connections, metadata, and data files. Essentially, the sandbox will contain all information that will be used for a particular project. I think this mostly resembles the playbook that is used by the coaching staff and players to execute on the field. The playbook contains all plays, options for plays, personnel, and scouting reports for a successful game.

Jobflow - The CloverETL jobflow is the orchestration layer for conditional job execution. I believe this relates to a game plan that is designed by the coaches for a successful game. This requires proper architecting and planning using the playbook to come up with the best possible game plan to be successful on the field.

Graph - The CloverETL graph is defined as a workflow that is designed for a specific business rule. This is typically where CloverETL interacts with data -- reading from data sources, transforming data, and writing to another data source to satisfy a business requirement. When you think of a CloverETL graph in terms of football, I would consider a graph as a play that is called by coaches and executed by the players. Each play must be carefully architected and executed as designed in order to be successful, but is only a small piece of the entire solution (or game plan).

Palette - The CloverETL Palette is a pre-defined list of components which are available for drag-and-drop use to design your CloverETL graphs/jobflows. I like to think about the palette as your roster where you can utilize available players for use in plays.

Component - A CloverETL component is an out of the box functionality that is pre-programmed to complete a specific task. Due to the nature of what a component actually does, it’s only natural to relate a component to an individual player. Each player has a position, skill, and task that they are given when they are signed onto a team.

Edge - An edge in CloverETL connects components and is where data will flow along. This is more of a conceptual topic than a physical topic which is why I am relating an edge to a Coordinator who will tell each of the players know how they must interact with each other. They must connect 2 individual players (or components), but also have to be aware of the entire play (or graph). If you are looking for more of a physical comparison, I would have to say an edge would be considered like the football. The football directly relates to data that will be passed along the edge. This is the most precious item in the environment (both football for the game, and data for integration) and should be handled with care.

Metadata - The CloverETL definition of metadata is the structure of data that connects two components. This is probably the most difficult concept to relate to football, but I think the closest thing in football to this is the huddle. The huddle is where a play is called in from the sidelines to the quarterback and the quarterback must let the other players know the play call as well as be able to describe the play to individual players if they do not understand the call. This concept physically defines how the players will interact during the next play, and the huddle is where the understanding occurs.

Phases - The CloverETL phasing defines in order the execution order of components in graphs/jobflows. I think this closely resembles a down in football because downs must also go in order. A team has 4 plays to reach the 10-yard mark before they turn the ball over.

Successful Execution - A successful execution for CloverETL means that your graph and/or jobflow was successful in running. You did not have any errors or unforeseen consequences as a result of your design. In my opinion, this would be equivalent to scoring a touchdown on a drive. Some may think that this would mean winning the game, but you must remember that this is only a short term victory and you must continue to improve, expand and define other ways to be successful.

Failed Execution - A failed execution for CloverETL means that a graph and/or jobflow failed to execute as designed. This could be a problem with the design of the graph or an unforeseen consequence in your solution. This directly relates to a turnover in football as it was a result that you did not want or plan for. This is where you have an opportunity to improve, learn, and grow from this experience both on the football field and when designing your jobflows/graphs.

To sum this up, are all of my analogies perfect, probably not. Would you like to argue against some of my logic? I hope so because that means that I sparked your interest and did my job.