Thursday, December 15, 2016

Data Governance with CloverETL

Let's start by defining data governance. According to wikipedia (because wikipedia knows all), "Data governance is a control that ensures that the data entry by an operations team member or by an automated process meets precise standards, such as a business rule, a data definition and data integrity constraints in the data model. " So at the end of the day, an organization's data governance policy is specifically created to standardize the data model to improve operational efficiency. If you are given a project where you need to create a database schema, tables, columns, etc, you have your recipe in order to complete the task by following the data governance policy. However, the recipe is only as good as the cook who is creating it. 

Data governance isn't a 'glory' topic. In fact, it's quite the opposite! Developers (database, application, etc) and DevOps absolutely hate projects that involve implementing or following an organizations data governance policy. Why? It's time consuming and some standards they may not agree with (to name a couple). Would you rather have the policy up on your monitor while you are creating a new data feed or create the feed as quickly as possible? I would venture a guess that you go with the latter because it's more realistic in today's pace of business. What if you work for an organization where it's required by law? Some data governance policies are put in place to satisfy regulatory laws. Think about financial companies. We need a policy for maintaining data for banks, lending, and stock markets for handling sensitive data. 

Have you thought about trying to implement a data governance policy and failed? Perhaps you implemented your first data governance policy 5 years ago. I bet you that the policy isn't being followed for all of your data systems. There are decisions made everyday that break data governance policies and almost no way to detect these small violations. I have a perfect solution that you never would have thought of. Use CloverETL to define your data governance policy! Most people think of CloverETL as a data integration software tool and not something that CloverETL could handle. But think of this this way. Your data governance policy is the set of business rules that you create graphs for and the data that you are validating is actually the structure and syntax of your data systems. 

You can use CloverETL to validate all database schemas that were created on your operational systems and then report data governance violations for the structures. Better yet, run the solution on the development and QA environments so that you catch the governance violations before they even touch production. So what can CloverETL do for you? 

-Check all syntax for data systems including: database naming conventions and data types. 
-Validate the structure of your data based on your business rules for data. 
-Verify and report on user access to databases for each of your operational databases. CloverETL can -Monitor the quality of your data. 

The solution can run daily, weekly, monthly, yearly or any other interval you deem relevant. This is a great use case and policy to automate because once you spend the time upfront automating the process, you can enforce the policy without much effort. If you have any questions about data governance with CloverETL, please don't hesitate to reach out. 





Monday, December 5, 2016

Reading data from PDF's with CloverETL

Do you have PDF files that you need to be able to read and process data from? In most cases, this is a nice thing to be able to do with an existing toolset rather than have to purchase another tool to translate PDF's into a machine readable format. In this blog I will show you how to read PDF's from within CloverETL and hopefully you can apply that knowledge to any other data format that isn't supported as an 'out-of-the-box' feature with CloverETL.

First, let's take a look at a portion of the PDF.




Looking at the PDF, you can see that there's 8 columns that we need to be able to read: Description, Item#, Seg, Seq#, Len, DT, Rep, Table. If you have looked at something similar in the past, you know that you cannot read this file with an existing CloverETL out of the box component. However, CloverETL ships with a CustomJavaReader component (as well as CustomJavaTransformer and CustomJavaWriter) which easily extends the CloverETL Engine's capabilities by custom coding Java to fit your requirements. There are a few pre-prequisites for being able to do this that depend upon the problem that you are solving.

1. All .jar files must be accessible for the CloverETL Designer (and CloverETL Server if you have one). That means that you much import all .jar files into your sandbox and add the jar files to the
build path so CloverETL knows where these libraries exist. We recommend placing the external .jar files in the lib/ so that all developers and operators are in agreement of the placement of the jar files.

2. Rather than building your own java class, I would recommend starting with the template that CloverETL provides for reading, transforming, writing data. Open the Algorithm property for CloverETL to create a Java class for you that you can edit directly in that pop-up editor, or you can copy/paste the contents into a separate Java file within the designer.

3. After you have developed your custom reader, you can configure the CustomJavaReader to use your newly created Java class. The configuration depends upon which option you selected above. If you created your own class outside of the component, you can use the Algorithm class property to configure your CustomJavaReader.

4. Create the rest of your graph as your requirements dictate (my example is less useful because I am only showcasing that you can read PDF's with CloverETL).

Here is what my graph looks like:



The last execution below shows data on the edge from within the PDF:




If you would like to see the custom java code or graph used to create this example, I would be more than happy to share it with you. However, using this approach, you can quickly read any format that CloverETL cannot natively read and stream it into a graph or jobflow as you would for any other input data.

Tuesday, November 1, 2016

CloverETL Milestone Review

I couldn't resist updating the blog with the latest features that CloverETL is releasing in the coming months for the production release. We'll call this a review of the CloverETL 4.4.M1 that is available to all existing customers in a test release only. It is not advised to upgrade your production systems until the full release of 4.4.0 is out.

Remote file event listener (out of the box functionality added)!! Yes, it was possible to handle remote file event listening prior to this update, but now, you can setup a file event listener for remote servers simply by configuring the server configuration (and reacting to the event in any way necessary).

Listener for failed listener - This is quite interesting and pretty easy to use. You can now configure 1 failed listener to listen for any failures in event handling/scheduling and react to the scenario. They are allowing for wildcards and matching of all events setup, or you can setup failed listeners for individual events configured on your CloverETL server. For example, if you have an event setup to grab a file from a remote FTP site, but your network becomes unresponsive. The failed listener can notify you that you have a failure contacting your event source when attempted to retrieve file changes when attempting to execute a file event listener and will allow the user to have a quicker response time to the outage.

Salesforce Updates:
If you are not aware, CloverETL has created a connector to Salesforce and is available in the production release of 4.3, however, the latest milestone release has seen a number of improvements using the Salesforce Connector.

1. Included a new Salesforce Writer using the SOAP API. This will allow you to write data into Salesforce which are considered micro batches. However, if you are working with large data sets being migrated into Salesforce, it is recommended that you use the Bulk API writer to limit the number of API calls required from Salesforce.

2. Updated - Milestone M2 will include the Salesforce SOAP API Reader. This will allow for subquery's as well as function calls directly in your SOQL query.

3. A more user friendly SOQL editor which will sort all objects and fields for the objects. This makes it much easier to search and find the appropriate object.

AWS Updates:
-Redshift driver bundled with CloverETL Designer. This will now ship with the product so you no longer have to setup your own driver in Designer.
-ParallelReader support has been added for S3 so you can now read data from an S3 bucket in parallel and improve performance.


Wednesday, September 28, 2016

CloverETL and Football - Are they related?


I was recently asked to compare CloverETL to football, and I thought this would be a fun exercise to share with a wider audience. This concept of relating normal everyday things to technology is not uncommon, and in fact has been scientifically proven that the concepts are easy to learn if you can relate them to the individual who are trying to comprehend all of the terms. Why is this the case? Most research shows that if you can speak the same language it’s easier to understand and comprehend the topic. That makes sense doesn’t it, but why are more people not doing this? The answer: it’s very difficult to find a commonality between your audience, so you have to make more generalizations than you would like. However, I don’t have that problem here because I am only talking to people who may not understand everything about CloverETL, but understand American football (the world’s most popular sport).

Web Application Container (Tomcat, JBOSS, etc) - A web application container will host web applications. When relating this term to football, I think of the stadium for the same reason. The stadium is hosting football games. Fairly straightforward and simple.

Web Application - A web application is a software application that runs in a web browser. When relating this term to football, I think of it as an actual act of playing football because this can happen in many different formats: NFL, Canadian Football, Soccer, Rugby, and many others could be considered in this.

Java - A popular programming language that is used for creating distributed applications. I think this relates to the entity of the NFL because the NFL is the governing body of American football and is a particular brand of football. Java is one of the most widely used programming languages today, and the NFL is a brand of American football that is the most popular professional sport in the world.

CloverETL Designer - The CloverETL Designer is an engine-based ETL desktop application where a user can utilize the drag-and-drop interface to build data workflows. The CloverETL Designer can be thought as the Head Coach because the head coach must build all the data workflows from the ground-up. Yes, they will utilize others to help with the process, but the entirety of the CloverETL Designer acts as a head coach would for a football team.

CloverETL Server - The CloverETL server is a web server application which can schedule and orchestrate various CloverETL jobs. The server is in charge of managing the entire load of the system and defining all aspects of your data integration workflows. The analogy for the CloverETL Server in football terms, in my opinion, would be the General Manager. The general manager is in charge of all aspects of football operations: player management, coach management, contracts, salary cap, and many other areas.

CloverETL Cluster - The CloverETL cluster is a set of CloverETL server instances running concurrently to improve performance, add high availability, and allow for greater scalability. The CloverETL Cluster can be related to an NFL Franchise owner because they are always looking for ways to improve performance and efficiency throughout the entire organization. They are also responsible for managing all personnel as well as all financial implications that come with running a business.

CloverETL Engine - The CloverETL engine is an embedded application which is responsible for the execution of all graphs/jobflows. This closely resembles the Quarterback of the team. The quarterback must be the interpreter of all plays on/off the field for their teammates much like the engine would do as it receives instructions from both the Designer, Server, or Cluster.

Sandbox - The Sandbox is where the CloverETL project lives. This will store all jobflows, graphs, database connections, metadata, and data files. Essentially, the sandbox will contain all information that will be used for a particular project. I think this mostly resembles the playbook that is used by the coaching staff and players to execute on the field. The playbook contains all plays, options for plays, personnel, and scouting reports for a successful game.

Jobflow - The CloverETL jobflow is the orchestration layer for conditional job execution. I believe this relates to a game plan that is designed by the coaches for a successful game. This requires proper architecting and planning using the playbook to come up with the best possible game plan to be successful on the field.

Graph - The CloverETL graph is defined as a workflow that is designed for a specific business rule. This is typically where CloverETL interacts with data -- reading from data sources, transforming data, and writing to another data source to satisfy a business requirement. When you think of a CloverETL graph in terms of football, I would consider a graph as a play that is called by coaches and executed by the players. Each play must be carefully architected and executed as designed in order to be successful, but is only a small piece of the entire solution (or game plan).

Palette - The CloverETL Palette is a pre-defined list of components which are available for drag-and-drop use to design your CloverETL graphs/jobflows. I like to think about the palette as your roster where you can utilize available players for use in plays.

Component - A CloverETL component is an out of the box functionality that is pre-programmed to complete a specific task. Due to the nature of what a component actually does, it’s only natural to relate a component to an individual player. Each player has a position, skill, and task that they are given when they are signed onto a team.

Edge - An edge in CloverETL connects components and is where data will flow along. This is more of a conceptual topic than a physical topic which is why I am relating an edge to a Coordinator who will tell each of the players know how they must interact with each other. They must connect 2 individual players (or components), but also have to be aware of the entire play (or graph). If you are looking for more of a physical comparison, I would have to say an edge would be considered like the football. The football directly relates to data that will be passed along the edge. This is the most precious item in the environment (both football for the game, and data for integration) and should be handled with care.

Metadata - The CloverETL definition of metadata is the structure of data that connects two components. This is probably the most difficult concept to relate to football, but I think the closest thing in football to this is the huddle. The huddle is where a play is called in from the sidelines to the quarterback and the quarterback must let the other players know the play call as well as be able to describe the play to individual players if they do not understand the call. This concept physically defines how the players will interact during the next play, and the huddle is where the understanding occurs.

Phases - The CloverETL phasing defines in order the execution order of components in graphs/jobflows. I think this closely resembles a down in football because downs must also go in order. A team has 4 plays to reach the 10-yard mark before they turn the ball over.

Successful Execution - A successful execution for CloverETL means that your graph and/or jobflow was successful in running. You did not have any errors or unforeseen consequences as a result of your design. In my opinion, this would be equivalent to scoring a touchdown on a drive. Some may think that this would mean winning the game, but you must remember that this is only a short term victory and you must continue to improve, expand and define other ways to be successful.

Failed Execution - A failed execution for CloverETL means that a graph and/or jobflow failed to execute as designed. This could be a problem with the design of the graph or an unforeseen consequence in your solution. This directly relates to a turnover in football as it was a result that you did not want or plan for. This is where you have an opportunity to improve, learn, and grow from this experience both on the football field and when designing your jobflows/graphs.

To sum this up, are all of my analogies perfect, probably not. Would you like to argue against some of my logic? I hope so because that means that I sparked your interest and did my job.

Wednesday, September 14, 2016

Evernote to Apple Notes


Migrate from Evernote to Apple Notes and keep notes tags, location and other metadata


Do you want to migrate your notes from Evernote to Apple Notes and keep all notes metadata like tags, location, web page source, author and other valuable information?
Until now, this was a problem, as Apple Notes do not support tags, location, notes author, web page source and other metadata. That information is lost during import of Evernote exported notes into Apple Notes desktop.
In an effort to constantly improve the lives of our existing and potential users, we prepared an example CloverETL graph that will migrate your Evernote notes into Apple Notes desktop app in four steps:

  1. Reading your exported Evernote notes (.enex file)
  2. Looking for notes metadata (note tags, location, author, note source, etc.)
  3. Copying those metadata into note text in user and machine readable form
  4. Generating a new .enex file, which can be easily imported into Apple Notes desktop

Here is a detailed user manual how convert your notes:
  1. Download and install free 45-days trial of CloverETL 4.3.0
  2. Download and import example CloverETL migration graph
  3. Export your notes from Evernote desktop
  4. Run example CloverETL migration graph
  5. Import your notes into Apple notes desktop


1) DOWNLOAD FREE 45-DAYS TRIAL OF CLOVERETL DESIGNER 4.3.0

You can download 45-days trial of CloverETL 4.3.0 for OSX / macOS, Windows or Linux here.

2) INSTALL CLOVERETL 4.3.0 DESIGNER

Installation of CloverETL Designer is pretty simple. In OSX open downloaded .dmg file and follow installation instructions. Once you accept user agreement, please perform typical OSX installation by copying CloverETL Designer App into Application folder.




When you open CloverETL Designer icon, you may receive security notification, please click on “Open”.


CloverETL Designer will ask you to chose disk and folder of workspace, where you will store:
  • CloverETL example graph
  • Notes you exported from Evernote (as .enex file)
  • Processed .enex file ready to be imported to Apple Notes
In this manual we use example folder /Users/YourUserName/Documents/CloverETLworkspace.

Once you register your copy of CloverETL Designer you can start to work with CloverETL Designer.


3) DOWNLOAD AND IMPORT EXAMPLE CLOVERETL MIGRATION GRAPH
Download example CloverETL project files: Evernote Example.

In CloverETL Designer chose “File –> Import –> CloverETL –> Import external CloverETL projects” and press button “Next”.


Select archive (downloaded) example file and press button “NextExample CloverETL graph is now imported into your CloverETL Designer. 



In you CloverETL workspace, you will see new folder:
/Users/YourUserName/Documents/CloverETLworkspace/EvernoteExample

There are several sub-folders. Three of them are important for you:
  • /data-in/ you will save all your Evernote exported .enex files there
  • /data-out/ CloverETL will create updated .enex files there
  • /graph/  CloverETL transformation graph is located there.
After import, go to left-upper part of CloverETL Designer screen called “Navigator”. Go to folder /graph/ and click on graph “EvernoteTransformer”.  You should see this screen:




4) CHANGE PROPERTIES OF CLOVERETL DESIGNER
In CloverETL Designer go to “CloverETL Designer -> Preferences -> CloverETL -> ETL Runtime” and:
  • Change parameter “Max heap size” into 3000 (MB) . This allows you to process large files up to 250 MB.
  • Set-up CloverETL properties file as "EvernoteExample/CloverETL.properties".
  • Click "Apply”.
  • CloverETL Designer runtime will be restarted (aprx. 10-15 sec.).



5) EXPORT YOUR NOTES FROM EVERNOTE DESKTOP
  1. Select group of notes or whole notebook (recommended) and select “Export Notes from”  (important: size of  one note including attachments should not exceed 25MB).
  2. Save your notes as Evernote export “.enex” export file into your CloverETL Designer workspace folder into /data-in/ directory in your workspace.
  3. You can save as many files as you want.
  4. Go back to CloverETL Designer.


6) Run example CloverETL migration graph
Click  on green “Play” button to execute ETL graph.




You will be asked to enter file name. Please enter name of one of your exported files in  /data-in/ directory in your workspace. Press “OK” to run graph.



Once graph run has been successfully executed, console window (bottom center window) turn light green and you will receive console message:
INFO  [JobFinalizer_5] Finished Status: FINISHED_OK”.




7) IMPORT YOUR NOTES INTO APPLE NOTES
  1. Open Apple Notes, chose “File -> Import to Notes ->”.
  2. Go to your CloverETL Designer workspace into /data-out/ directory.
  3. Select updated .enex file you want.

8) KNOWN LIMITATIONS

  • The graph was built in CloverETL Designer 4.2.0.
  • Size of one note including attachments should not exceed 25MB.
  •  Only one .enex file can be processed in one graph execution.
  • Encrypted Evernote notes are transformed blank.
  • Evernote encrypted notes are migrated blank.
  • The XSD schema of Evernote .enex file is predefined and ready to use without any changes.
  • Changes in notes design are done by Apple Notes app, not by CloverETL.


9) TROUBLESHOOTING:
Console message
Solution
File is unreachable:
You entered wrong file name of Evernote file to be migrated
SAX parsing exception
Realocation of CloverBuffer failed. 

Requested capacity is 99900090 and maximum capacity is 80000000.
Edit both values in file “/EvernoteExample/CloverETL.properties/

Restart CloverETL runtime or re-launch CloverETL designer
Java heap space Heap:
Go to “CloverETL Designer -> Preferences -> CloverETL -> ETL Runtime

Change parameter “Max heap size” into larger volume.  Please be noted, that CloverETL runtime max allocation depends on your macbook real RAM size

Restart CloverETL runtime or re-launch CloverETL designer


FINAL REMARKS  
Did you enjoy the experience with Clover? Let us know in the comments.

Do you want to use Clover for something different? Visit the academy (add hyperlink: http://www.cloveretl.com/learn/walkthrough/) to get a taste of how to create graphs yourself and start implementing your own solutions.
Remember, imagination is your only limit.

Monday, August 29, 2016

How to get unique values in list with CTL

Last article in our "How to" series was about how to get sheet names from Excel file. Today I will venture into CTL (Clover Tranformation Language - scripting language oriented on data migration use cases used in CloverETL)

You might know that if you need to get only unique records by some key value you should use Dedup component in CloverETL.
But what if you have CTL variable as a list and you would like to get only unique values?

Data types in CloverETL

CloverETL contains couple of standard data types that you can use in CTL (Clover Transformation Language) programming.
There are simple data types:

  • boolean
  • byte
  • cbyte
  • date
  • decimal
  • integer
  • long
  • number

There are also couple of complex types:

  • record
  • list
  • map

List

I will focus today on "list". List is a container for multiple values of same data type. Eg. you could have list of strings, list of integers, but not list of strings AND integers.
To be more precise, list is a ordered sequence of elements, where elements could be accessed by their position only.
List container is omnipresent in almost all programming languages. In PHP they call it Indexed Array, in Java you might be familiar with ArrayList implementation of AbstractList class and you can find "list" in Python too.

Declaration and usage

You can declare and assign values to list CTL variable like this:
string[] myList = ['a', 'b', 'c'];
Clover list is accessible with use of index value:
myList[1] == 'b';//we are in java, index starts at 0
You can also use list as a stack (or queue) and use known pop and  poll functions:
string[] myList = ['a', 'b', 'c'];
string val = pop(myList);
//val == 'c' AND myList == ['a', 'b']
And there is a lot of helpful Container functions.

Uniqueness

CloverETL's list doesn't have any uniqueness enforcement, it is not for example Java Set.
So it might happen that there would be multiple same values.
string[] myList = ['a', 'b', 'c', 'a', 'c', 'b'];
You can use in CTL similar trick to get only unique values from list as you might know from different languages. You will convert list to map (set of tuples, "key" => "value" pairs) and get only keys from this map. Conversion to map will take care of duplicates automatically as CloverETL's map can contain only one value per key.


Declaration and usage of map

map[string, string] urls;
urls['Google'] = 'www.google.com';
urls['Microsoft'] = 'www.microsoft.com';

toMap usage 

toMap function in CTL could have two signatures:
map[<type of key>,<type of value>] toMap(<type of key>[] keys, <type of value>[] values);
map[<type of key>,<type of value>] toMap(<type of key>[] keys, <type of value> value);
First signature expects 2 lists, one will be used as a keys, second as a values. Mapping from keys to values will be done automatically by order. (Length of both lists needs to be same!)
string[] companies = ['Google', 'Microsoft', 'Apple'];
string[] urls = ['www.google.com', 'www.Microsoft.com', 'www.apple.com'];
map[string, string] companyUrls = toMap(companies, urls);
//companyUrls['Microsoft'] == 'www.Microsoft.com'

Second signature expect 1 list and 1 value, this value would be automatically assigned to all keys.
string[] addresses = ['john@example.com', 'john.doe@example.com'];
map[string, boolean] enabledAddresses = toMap(addresses, TRUE);
//enabledAddresses['john@example.com'] == TRUE;
So in our case:
 string[] myList = ['a', 'b', 'c', 'a', 'c', 'b'];
map[string, string] myListAsMap = toMap(myList, myList);//we don't care about values
string[] uniqueMyList = getKeys(myListAsMap);
//uniqueMyList == ['a', 'b', 'c']
You can use printLog() funtion to check content of variables (used in commented lines with '==' which denotes what is expected to be in the variable) - values will be printed into log during runtime or new functionality in CloverETL 4.3.0 called CTL debugging.


I hope this short blogpost showed you something new and useful, all comments welcomed. See you at next "How to" article :).

Tuesday, August 16, 2016

How to get all sheet names from Excel

Imagine this use case: You have an Excel file with dynamically changing number of Sheets, how you could get list with names of all sheets?

You might know that in CloverETL there is SpreadsheetReader component that can read .xlsx, .xls files. You know that you can use its graphical view to map content from an Excel file to a metadata. You might not know that you can read from multiple sheets with same SpreadsheetDataReader. It has one requirement though, all sheets need to have same looking records, there is no way how to use single SpreadsheetDataReader to parse records with different metadata! 
You can use wildcards '?' and '*' in Sheet property of a SpreadsheedDataReader to specify multiple sheets to be read. So '*' character in Sheet will be first step of our example.


Next step would be creating metadata and mapping. To get list of all sheet names you need read at least one record from the sheet. The value is not important, so I created metadata called DummyRecord with 'dummy' field. I am using this metadata in Mapping property editor where I select A1 column as a source as I expect this cell will always contain something. I used "Map by order" button to map content of A1 to "dummy" field and "Data offset (global)" set to 0 - this is to read value directly from A1 cell, other way component would read data from A2. Notice orange label of A1 cell and yellow label of A2 before you set Data offset (global) to 0. Orange means header, yellow denotes first line of data.

After this step we would read content of A column from all sheets in our excel. But we need only one record per sheet. Luckily there is Advanced property just for this: "Max number of records per spreadsheet".


No I have 1 record per sheet. But still no sheet names! 
You might notice that there is no functionality inside of Mapping editor to pull sheet name. For that we will use Autofilling property in Metadata.

Autofilling property is handy feature that creates 'virtual' fields that are not in source. These fields are filled automatically from list of possible configurations. Fields with configured Autofilling property could carry information for example about order of the record in the file, name of the source file or sheet name. And that is exactly what we need!
Lets go to metadata editor behind SpreadsheetDataReader, add there new field, name it originally 'sheetName', name of the field is not important, Autofilling property is. For that select newly created field, scroll in right column to Autofilling property and choose sheet_name. 


Finally I have 1 record per sheet with its sheet name.

I decided to make this fancier, so I wrapped this functionality in Subgraph, parametrized File URL property of SpreadsheetDataReader to select parsed file more easily. I created new metadata without dummy field in the subgraph and used Reformat to convert records between these metadata.
Result is generic subgraph component that produces list of sheet names from selected excel file.
I highly recommend check CloverETL documentation about Autofilling property: http://doc.cloveretl.com/documentation/UserGuide/topic/com.cloveretl.gui.docs/docs/autofilling.html , it is not so known feature that could help you if you know about.

See attached project for your reference: Project

Wednesday, June 29, 2016

Connecting CloverETL and Amazon SQS

In a recent blog on CloverETL's blog called Building own components in CloverETL, we’ve seen how to build your own component than to have some sort of production-ready solution. In this article, I’d like to introduce a plugin based on things covered there – more specifically, connector to Amazon SQS – AWS’s implementation of queuing service. Unfortunately, SQS does not come with native JMS interface (although adapter is available), so we cannot connect to it using standard CloverETL JMS connector – I tried, but failed pretty bad since this adapter requires some specific configuration which can be done on Java code level only and not even supports JNDI. If you’d find any way how to use SQS via JNDI, please share your findings in the comments below – would be very helpful not only for me but for other people as well.
Good news for users of version 4.2.0 and above – Amazon SDK (required for this connector to work) is bundled with the product itself so you don’t need to download it separately, others unfortunately needs to download this SDK from Amazon’s website along with some of its dependencies (all of those are provided in an attached project).

Introducing Amazon SQS connector family

Figure 1: Amazon SQS plugin components
There are two components:
  1. AmazonSqsReader – taps into the interface and retrieves defined number of messages/all pending messages
    • Component provides two types of metadata: messages (first output port) and message attributes (second optional output port)
  2. AmazonSqsWriter – pushes new messages into the queue
    • First input port accepts any metadata, but requires Body attribute option to be set – contents of the field will be passed as actual message;
    • If second input port is used; both ports needs to be sorted by Join key which becomes a mandatory attribute, same thing applies to Attribute mapping
    • Second input port is used for message parameters – every message may contain up to 10 (Amazon SQS limit) of parameters

Plugin installation

Introducing this plugin into your project is fairly simple:
  1. Take attached zip file and decompress it to your project’s root directory
  2. Open workspace.prm file and add new parameter PLUGINS_DIR with value ${PROJECT}/plugins
  3. (Version 4.2.0 and newer) Add <classpathentry kind="lib" path="plugins/amazonsqs/lib/cloveretl-amazonsqs-connector-0.2a.jar"/> into your .classpath file
  4. (Version 4.1.x) Add all libraries from plugins/amazonsqs/lib directory (covered in previous article)
  5. Set master password for your ETL Runtime in Window → Preferences (or in Configuration section of your CloverETL server)
Note this plugin won’t work in any version prior 4.1.0!
Figure 2: Setting up master password in CloverETL Developer

AmazonSqsReader

Does as its name suggest – connects to SQS interface to gets messages waiting in a queue. It is able to delete the messages upon read.

Shared configuration options between both components:

  • Queue name – name of a queue to tap into (i.e. MyCloverQueue),
  • Queue endpoint – URL of region where queue resides (i.e. https://sqs.us-east-1.amazonaws.com),
  • AWS Access key – access for an account which is supposed to be used to utilize this interface (i.e. ZQRTP1NM0ZDWKEYRVC3U),
  • AWS Secret key – key for credentials provided in access key field.

Configuration specific to reader:

  • Message threshold – maximum amount of messages which will be read from an interface
  • Remove read messages – whether or not messages read from the queue should be removed upon successful read attempt

Output port data:

Reader uses metadata propagation to push pre-defined metadata out of the component – it is not possible to change them.
  • Out0 (mandatory): message output
  • Out1 (optional): message attribute output
    • Field messageId corresponds to id of message output – those are IDs assigned by SQS
This component works in a batch mode, which means that it tries to read as many messages as possible through the interface until threshold is reached or all messages are read.

AmazonSqsWriter

Also does as name would suggest – flushes data stream into SQS queue. It supports both message and message attribute data stream; method of linking those two data streams is very similar to algorithm used in ExtMergeJoin component – this means that if optional second input is connected to the component, both data streams needs to be sorted!

Configuration specific to writer:

  • Body attribute – field from input port 0 containing message body (default field name is “body”)
  • Join key – (Mandatory when input port 1 is connected) key used to link message with its attributes
  • Attribute mapping – (Optional when input port 1 is connected) maps input metadata to SQS message attribute object
This component is NOT running in batch mode – every message is sent separately which may slow down processing.

Conclusion

Plugin introduced here was created to satisfy needs I had during project’s development, so it may be not optimized for use in every situation. Making writer batch mode-ready is my next to-do improvement, this I do plan to implement over the summer if beer and hikes won’t keep me too busy :)

Download plugin version 0.2a

Wednesday, June 22, 2016

CloverETL - Data Partitioning with speed

CloverETL has released a new feature in their latest 4.2.x release that will improve your performance exponentially! Take a look at the video here: CloverETL video - Data Partitioning

This new improvement makes it much easier to partition data and process in parallel with a simplified solution.

Take a look at the solution below with how you would have to handle data partitioning on your own:

And with the latest CloverETL release:

Benefits of this include:

1. Easier Maintainability - Only have to make the change in one location versus the number of partitioned data flows. 

2. Performance - By allowing CloverETL to partition your data and process in parallel, your performance will improve drastically. Based on the above example, I cut my processing time down by 75%!! 


Tuesday, June 14, 2016

CloverETL - Topics for discussion

imgres.jpg


I am soliciting ideas for topics of discussion from all of our readers. What features of CloverETL are you most interested in? Which topics do you regularly struggle with? Are there challenges that you face on a day to day basis when developing your solution in CloverETL? We would like to hear from you! You can either comment below this blog post, or send an e-mail to cloveretltalks@gmail.com.

Monday, June 13, 2016

CloverETL - XML files without an XSD

Have you ever come across an XML file without an XSD that you need to process? I am sure you have! We encountered this first hand a few weeks ago. We received a number of XML files (more than 100k for those of you keeping track) where we had no idea whether the structure was the same throughout the entire data set. How can you make sense of each XML element without a proper XSD file? We built a clever little solution in CloverETL that will read each XML tag in the entire dataset, the level of each XML tag, and generate an XPath for each XML element that can be used for processing each XML file. Does this sound interesting yet?

We were able to accomplish this using the XSLTransformer.


Above is the main part of the graph which we used to read the XML tags. Below is the Xslt definition that we used within CloverETL in order to parse through each XML tag.


Once the tags have been processed and each XPath was created for your elements, you can then process your XML files with the CloverETL XMLReader component.

Wednesday, May 4, 2016

CloverETL - Replacing Legacy Software

What does legacy data software mean to you: old software that’s currently outdated or existing software that works? Or, I should ask, are you a developer or a business stakeholder? No matter which side of the discussion you are on, replacing legacy software is always a difficult conversation between developers and business stakeholders. On one side, business stakeholders see little value in developing a solution for something that is already working and generally follow the motto ‘if it’s not broke, don’t fix it’. Whereas, developers are maintaining the legacy software on a day-to-day basis (and they most likely haven’t designed nor developed the software) and are always looking for ways to improve the software. As a developer, I always see the value in upgrading your software because technology is constantly changing and evolving. The requirements and constraints your systems had 3 years ago are no longer constraints in today’s world. Today we’ll outline an existing legacy system process and software, and show that it can easily be migrated into a data integration solution using CloverETL.

Background
I used to work as a government contractor at the National Oceanic and Atmospheric Administration where my job was maintaining and enhancing the data ingestion system. The data ingestion system, at a high level, gathers water level data from NOAA tide stations and transmits them to local read-out ground systems where the data ingestion system would then acquire the water level data for processing. Once the data arrived on the data ingestion servers, software would decode the messages, calculate the water level value based upon the raw data, and insert the data into the database. Can you count the number of software programs that were needed for this process? Disregarding the upstream software, there is a special scripting language that acquires the data, a Fortran program to decode the messages, another Fortran program to quality control and add offsets, and a C program to insert the data into the database. Do you know how to code in Fortran or compile Fortran code? Making a change to these programs takes weeks to fully compile and test. This is a huge problem with maintaining legacy software because the technology is outdated, the business requirements are not fully understood, and costs too much money to enhance the existing software.

Wouldn’t it be nice to consolidate all of the software into a packaged solution that can be easily customized for your data needs? Using CloverETL, you can design your solution to follow the same process that’s already in place with the same check-pointing that exists in your system today.   

Existing Process Flow:




Using CloverETL, it’s possible to achieve the same results with the same process flow that you have already defined.

Benefits of using a CloverETL solution for your legacy data software needs:
  • Update your codebase to a modern architecture approach
  • Write custom logic components using Clover Transformation Language (CTL) or Java
  • Metadata propagation between your business objects and processes which can cut down on the amount of processing your systems are doing
  • Near real-time data handling (file event listeners and/or scheduled events)
  • Removes clunky, old programming languages
  • Allows you to update your business requirements in a timely fashion

This is one example from my previous experience where I know CloverETL would help to make a difference for business stakeholders and developers. Please let me know if you have any additional questions.