Monday, August 29, 2016

How to get unique values in list with CTL

Last article in our "How to" series was about how to get sheet names from Excel file. Today I will venture into CTL (Clover Tranformation Language - scripting language oriented on data migration use cases used in CloverETL)

You might know that if you need to get only unique records by some key value you should use Dedup component in CloverETL.
But what if you have CTL variable as a list and you would like to get only unique values?

Data types in CloverETL

CloverETL contains couple of standard data types that you can use in CTL (Clover Transformation Language) programming.
There are simple data types:

  • boolean
  • byte
  • cbyte
  • date
  • decimal
  • integer
  • long
  • number

There are also couple of complex types:

  • record
  • list
  • map

List

I will focus today on "list". List is a container for multiple values of same data type. Eg. you could have list of strings, list of integers, but not list of strings AND integers.
To be more precise, list is a ordered sequence of elements, where elements could be accessed by their position only.
List container is omnipresent in almost all programming languages. In PHP they call it Indexed Array, in Java you might be familiar with ArrayList implementation of AbstractList class and you can find "list" in Python too.

Declaration and usage

You can declare and assign values to list CTL variable like this:
string[] myList = ['a', 'b', 'c'];
Clover list is accessible with use of index value:
myList[1] == 'b';//we are in java, index starts at 0
You can also use list as a stack (or queue) and use known pop and  poll functions:
string[] myList = ['a', 'b', 'c'];
string val = pop(myList);
//val == 'c' AND myList == ['a', 'b']
And there is a lot of helpful Container functions.

Uniqueness

CloverETL's list doesn't have any uniqueness enforcement, it is not for example Java Set.
So it might happen that there would be multiple same values.
string[] myList = ['a', 'b', 'c', 'a', 'c', 'b'];
You can use in CTL similar trick to get only unique values from list as you might know from different languages. You will convert list to map (set of tuples, "key" => "value" pairs) and get only keys from this map. Conversion to map will take care of duplicates automatically as CloverETL's map can contain only one value per key.


Declaration and usage of map

map[string, string] urls;
urls['Google'] = 'www.google.com';
urls['Microsoft'] = 'www.microsoft.com';

toMap usage 

toMap function in CTL could have two signatures:
map[<type of key>,<type of value>] toMap(<type of key>[] keys, <type of value>[] values);
map[<type of key>,<type of value>] toMap(<type of key>[] keys, <type of value> value);
First signature expects 2 lists, one will be used as a keys, second as a values. Mapping from keys to values will be done automatically by order. (Length of both lists needs to be same!)
string[] companies = ['Google', 'Microsoft', 'Apple'];
string[] urls = ['www.google.com', 'www.Microsoft.com', 'www.apple.com'];
map[string, string] companyUrls = toMap(companies, urls);
//companyUrls['Microsoft'] == 'www.Microsoft.com'

Second signature expect 1 list and 1 value, this value would be automatically assigned to all keys.
string[] addresses = ['john@example.com', 'john.doe@example.com'];
map[string, boolean] enabledAddresses = toMap(addresses, TRUE);
//enabledAddresses['john@example.com'] == TRUE;
So in our case:
 string[] myList = ['a', 'b', 'c', 'a', 'c', 'b'];
map[string, string] myListAsMap = toMap(myList, myList);//we don't care about values
string[] uniqueMyList = getKeys(myListAsMap);
//uniqueMyList == ['a', 'b', 'c']
You can use printLog() funtion to check content of variables (used in commented lines with '==' which denotes what is expected to be in the variable) - values will be printed into log during runtime or new functionality in CloverETL 4.3.0 called CTL debugging.


I hope this short blogpost showed you something new and useful, all comments welcomed. See you at next "How to" article :).

Tuesday, August 16, 2016

How to get all sheet names from Excel

Imagine this use case: You have an Excel file with dynamically changing number of Sheets, how you could get list with names of all sheets?

You might know that in CloverETL there is SpreadsheetReader component that can read .xlsx, .xls files. You know that you can use its graphical view to map content from an Excel file to a metadata. You might not know that you can read from multiple sheets with same SpreadsheetDataReader. It has one requirement though, all sheets need to have same looking records, there is no way how to use single SpreadsheetDataReader to parse records with different metadata! 
You can use wildcards '?' and '*' in Sheet property of a SpreadsheedDataReader to specify multiple sheets to be read. So '*' character in Sheet will be first step of our example.


Next step would be creating metadata and mapping. To get list of all sheet names you need read at least one record from the sheet. The value is not important, so I created metadata called DummyRecord with 'dummy' field. I am using this metadata in Mapping property editor where I select A1 column as a source as I expect this cell will always contain something. I used "Map by order" button to map content of A1 to "dummy" field and "Data offset (global)" set to 0 - this is to read value directly from A1 cell, other way component would read data from A2. Notice orange label of A1 cell and yellow label of A2 before you set Data offset (global) to 0. Orange means header, yellow denotes first line of data.

After this step we would read content of A column from all sheets in our excel. But we need only one record per sheet. Luckily there is Advanced property just for this: "Max number of records per spreadsheet".


No I have 1 record per sheet. But still no sheet names! 
You might notice that there is no functionality inside of Mapping editor to pull sheet name. For that we will use Autofilling property in Metadata.

Autofilling property is handy feature that creates 'virtual' fields that are not in source. These fields are filled automatically from list of possible configurations. Fields with configured Autofilling property could carry information for example about order of the record in the file, name of the source file or sheet name. And that is exactly what we need!
Lets go to metadata editor behind SpreadsheetDataReader, add there new field, name it originally 'sheetName', name of the field is not important, Autofilling property is. For that select newly created field, scroll in right column to Autofilling property and choose sheet_name. 


Finally I have 1 record per sheet with its sheet name.

I decided to make this fancier, so I wrapped this functionality in Subgraph, parametrized File URL property of SpreadsheetDataReader to select parsed file more easily. I created new metadata without dummy field in the subgraph and used Reformat to convert records between these metadata.
Result is generic subgraph component that produces list of sheet names from selected excel file.
I highly recommend check CloverETL documentation about Autofilling property: http://doc.cloveretl.com/documentation/UserGuide/topic/com.cloveretl.gui.docs/docs/autofilling.html , it is not so known feature that could help you if you know about.

See attached project for your reference: Project