Thursday, July 19, 2018

Limiting user access to view logs only in CloverETL server

When you are running production installation of a CloverETL server you might have separation of responsibilities or different permissions for different people.

I was asked recently how to enable specific people to only access logs for job runs on the server. This is common use case, you have some support personnel that need to have access to logs if something goes wrong in any of the scheduled processes. You don't want to overwhelm them with too many options, you don't want them to modify any existing process etc.

CloverETL set permissions on Group level, not on specific user level, eg. you cannot give John and Amy different permissions, you need to separate them first into two groups.

Whenever you will create new group it will have all permissions removed by default. You could tell that by red cross icons on the permission tree.

In our use case, you want to enable only "Unlimited access to execution history" for this particular group.


This configuration will allow all user assigned to this group view only Execution History tab, check previous runs, see their Tracking information (how many records were processed) and see or download log for particular run.

You might wonder what is doing Limited access to execution history list. That one gives you more control, With unlimited access group members will see all content of execution history, for all sandboxes. 
Limited access allows you to show history only for sandboxes that group has read access to, eg. if you limit access to Sandbox A to Group A and John is not member of Group A, he won't be able to see any runs of processes from that project even if he has access to Execution history. 

By default all sandboxes are visible to all groups, if you want to change it, you need to do in Permissions section of Sandboxes tab.

Currently visibility granularity is on sandbox level, eg. you cannot limit visibility for a specific graph only.

And this a very limited view that John will have if he is only member of a group which has only permissions for Execution history:


As you could seen in one of the previous pictures, granularity of permissions on CloverETL server is pretty elaborate, so go check documentation page for more details how you can configure access permissions for your users to your liking.

Thursday, July 5, 2018

Auto start Tomcat and CloverETL on EC2 AWS Linux AMI

In one of the previous blog posts I installed evaluation CloverETL server on Amazon's EC2 instance. This installation is useful for evaluation, quick setup and I mentioned at the end of article that you might want to set CloverETL server to start up automatically if host gets restarted.

This article will show you one way how.

Prerequisites:

Last time we installed Tomcat and deployed server war to:

/clover/CloverETLServer.4.5.1.Tomcat-8.0.30/bin

(In retrospect keeping name of version in the directory name wasn't greatest idea, that's when you will try to upgrade, but again..evaluation installation.)

Steps:


First thing I will do is to create init script:
sudo vim /etc/init.d/clover

#!/bin/bash ### BEGIN INIT INFO # Provides:        tomcat8 # Required-Start:  $network # Required-Stop:   $network # Default-Start:   2 3 4 5 # Default-Stop:    0 1 6 # Short-Description: Start/Stop Tomcat server ### END INIT INFO
PATH=/sbin:/bin:/usr/sbin:/usr/bin
start() {  sh /clover/CloverETLServer.4.5.1.Tomcat-8.0.30/bin/startup.sh }
stop() {  sh /clover/CloverETLServer.4.5.1.Tomcat-8.0.30/bin/shutdown.sh }
case $1 in   start|stop) $1;;   restart) stop; start;;   *) echo "Run as $0 <start|stop|restart>"; exit 1;; esac


There might be different versions of init script, this one makes sure that you will run in only after network interface is set up on the instance.

Last step is to put this init script to be run after reboot. Common way to do is with update-rc.d command, but that one is not installed on Amazon Linux AMI for some reason.

Another way that worked for me was:
sudo chmod 755 /etc/init.d/clover
sudo chkconfig --level 345 clover on

After these steps Tomcat and CloverETL server should automatically after reboot of the host.

(I used heavily answers here https://askubuntu.com/questions/223944/how-to-automatically-restart-tomcat7-on-system-reboots )




Friday, June 29, 2018

Healthcheck for CloverETL server

In modern cloud oriented world there is a need for periodical healthchecks of servers or services. Let's imagine that you have CloverETL server deployed on AWS and you need to know that your host machine, application container and CloverETL server is running.

AWS has you covered for first 2 points with healthchecks on load balancer .

This might not cover edge cases when your host machine and application container (it might be Tomcat or any other supported application containers) are running, but CloverETL server is not.

There are multiple ways how to check health or "liveness" (if thats even a word).

Use HTTP API call

CloverETL server supports HTTP api for multiple operations, you are interested in 'cluster_status' .
Please ignore name, same operation will work on cluster or even on single installation.

You can use health check functionality of load balancer to ping that endpoint to get status. 

Disadvantage of this approach is that by default HTTP API is protected by HTTP Basic authentication , what might be a problem for some health check services.

Calling accessibility page

CloverETL server has page that could be used without any authentication.


Calling this page will result in one of 3 states:
  1. OK/200
  2. 500 
  3. 503
This way you can use any health check application or service to learn if your CloverETL is living.


Thursday, June 21, 2018

Flat files and newlines for different OS

CloverETL can read, as one of the many sources, flat files. Eg. files without hierarchical structure, data stored in human readable format. Simple example is still popular csv.

Csv means comma separated values, eg columns of data are separated by ',' (comma) delimiter.

CloverETL has for such files a FlatFileReader component which can read csv with different delimiters ('|' or ';' are another popular ones). This component can read a flat files with different delimiters, it can read them not only from local system but also from remote ones (ftp, sftp, S3).

For each file you want to read with a FlatFileReader you will need to have a metadata. Reader provide easy way how to create metadata for existing file via Extract metadata functionality.



This option will parse the file and produce metadata (description) of the file. Eg. list of fields, their datatypes etc.

One of the issues you might encounter in real world is that you created your metadata from one version of flat file, but in reality files could come from various sources, with various OS. Each operation system implements its own newline delimiters.


Extract metadata will get new line from that one file you triggered it on. But don't worry, there is an way how to be prepared for files from different OS.

You just need to:

  1.  edit created metadata (double click on the edge with the metadata)
  2. click on first row with name of metadata to get properties in right hand side column
  3. select last option in Record delimiter field

This option will allow you to read files from different OS without issues. (You can even write to that field and use delimiters which are not in the dropdown, just give it try!)

Friday, June 1, 2018

Connecting to Amazon Neptune with CloverETL


Connecting to Amazon Neptune with CloverETL


Just at the end of May 2018, AWS pushed their new service, Neptune, for general availability (as promised on reInvent 2017). I decided to play with it right away and try connect to Neptune with my “weapon of choice”, CloverETL.

What is AWS Neptune? Let me steal directly from AWS as they surely have better copywriters than I ever will be: “Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets.”

So Neptune is managed (as other RDS databases) graph (like Neo4j) database. Graph databases use structures like “nodes” (or "vertices") and “edges” to store and represent data and relationships between objects.
Figure 1 via Neo4j.com

Graph databases could be helpful whenever relationships between objects get “a bit complicated” or just plainly massive. Common use cases of usage of graph database would be social networks (you want to traverse through all followers and theirs follower of specific person), fraud detection (you want to uncover fraud rings) etc.

For this exercise I spun up Neptune cluster, following https://docs.aws.amazon.com/neptune/latest/userguide/quickstart.html , and CloverETL server on EC2, you can use this http://www.clovertalks.com/2018/01/how-to-install-evaluation-cloveretl.html for info.
(both Neptune and EC2 instance are living in same VPC and security group.)

Amazon Neptune supports two different graph query languages, Gremlin and SPARQL, I will be using Gremlin as I am more familiar with that one.

Check https://tinkerpop.apache.org/gremlin.html for more information and great tutorials for Gremlin.
Copyright by https://tinkerpop.apache.org

There are multiple ways how to use Gremlin for connecting to Neptune, you can write some code in Java or Groovy(for that we could use a CustomJavaComponent), Python (you can call Python scripts from CloverETL with a ExecuteScript component or use https://blog.cloveretl.com/python-scripts-cloveretl), you might want to use .Net or Node.js.

I will use another functionality of Amazon Neptune which is HTTP endpoints and REST interface. 


Document tells me that I can call Amazon Neptune with HTTP request, they are using ‘curl’ in the documentation. Same job in CloverETL is doing a HTTPConnector.

So lets do quick POST call like this one:

I created a graph on my CloverETL server in my CloverETL Designer, pasted a HTTPConnector from Palette on the right.
(Tip: Use Shift + Space for Add component dialog, this could make you a bit more efficient.)

I connect first output port of the HTTPConnector to a Trash component for now. 

In the HTTPConnector I am configuring URL property with value of HTTP endpoint of Amazon Neptune (https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-endpoint.html) , Request method and payload in Request content.

You might notice that I created parameter for the endpoint as this one might change whenever I will recreate Neptune cluster.

I created simple graph (like "real graph in graph database" not CloverETL graph) in my Neptune cluster following https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-console.html .
I might be doing all these steps in CloverETL, but using Gremlin's REPL console is a bit faster. CloverETL will help us later to enhance and automatize data ingestion and can produce reports with data in different formats which is not really possible via Gremlin console.

I am using a json document "{"gremlin" : "g.V().limit(1)"}" to run query which will return me first vertex from the graph.

After I run this graph I can check response on the edge coming from HTTPConnector directly in the CloverETL Designer.

REST API is communicating via JSON format, record produced by HTTPConnector will contain response content and headers.

I will use functionality of  CloverETL which allows me to parse JSON documents returned from REST API with another components directly in the same process.

There are two, JSONReader (uses xpath) or JSONExtract (uses SAX parser and something like "json scheme").

JSONExtract can parse JSON document, build scheme for it and show user visual tree that will help him or her map elements out of JSON elements which are of interest.

For this I will copy value from content field of record (value will be available after successful run of the graph) and store it as file, NeptuneResponse.json in data-in subfolder. 

(Tip: Whenever I work with JSON I like to add some JSON plugin that will make JSON manipulation easier. I am using that CloverETL is build on top of a Eclipse platform, so there will be a bunch of plugins available. Currently I am using Json Tools.)


I will prepare for parsing of the response from HTTPConnector by placing JSONExtract component directly on edge of CloverETL graph, just before Trash component.

Now I need to point JSONExtract to read saved file by configuring its File URL property.
(Tip: Just drag the file from data-in folder onto JSONExtract component, that's the easiest way.)

Then I open Mapping dialog by clicking on ellipsis on Mapping property and let component produce visual tree by clicking on a Generate tree structure button.


In the mapping document I will map all field which are interested of me.

This will produce 2 connected output ports with appropriate metadata.

Last thing for JSONExtract is switching value of the File URL  property from physical file to different approach when JSON document will be provided dynamically on the CloverETL edge (by running the graph) and parsed automatically at the runtime.

Syntax for it is: port:$0.content:discrete (thats the value that should be in File URL now).

I mapped out 2 ports, first one is containing parsed value returned from Neptune, second port contains response type (200 for everything went OK, different code and error message other way) for possible error handling.

This combination of components could be easily reused by wrapping them into Subgraph, which would allow bigger re-usability, but that's question for some next article. 














Thursday, January 18, 2018

How to install evaluation CloverETL server on AWS in ten minutes or less

How to install evaluation CloverETL server on AWS in ten minutes or less
If you like to investigate and play with different tools and software like me you know that nothing is
more boring than spending too much time spinning up a tool for the first time just to play with.

In CloverETL you have two parts you can experiment with: CloverETL Designer, an IDE build
on the Eclipse platform and CloverETL Server, a Java web application that can be run in any
modern application container.


Designer installation should be pretty straight forward. we have bundled all the needed parts
in an installer, so you just need to point wizard to folder where you would like to install Designer.
a Next step is if you would like to include standalone installation of JDK (highly recommended) and
then just keep hitting Ok or Next button whenever you see one in the wizard to continue
with default installation.




Just a side note: you can also install Designer as an Eclipse plugin. This would be helpful in case  
you are familiar with Eclipse, you are running fine tuned deployment of Eclipse, and you would like
to enrich it with CloverETL functionality without the need to have two different IDEs and keep
switching between those two.
Check http://doc.cloveretl.com/documentation/UserGuide/topic/com.cloveretl.gui.docs/docs/
system-requirements-for-designer.html#plugin-installation for more information.


But let say, you have your Designer installed, you experimented with it to your heart's content and
you would like to investigate neat functionalities of a server as scheduling, event listening,
exposing graphs to outside application. This means installation of a CloverETL server.


There are couple of ways where and how to install, but I’ll focus in this article on an installation
on AWS.


This approach should allow you an easy deployment and easy modifications of server resources
for first steps, features investigation and performance experiments etc.


I will show you in next paragraphs detailed step by step of one installation. At the end of this article
you should have working a CloverETL server deployment. I would highlight additional steps for
recommended installation in the end.


Requirements:

AWS account - if you don’t have it, go to https://aws.amazon.com/ and create one
Licence and war file for CloverETL server -  if you don’t have it, go to
http://www1.cloveretl.com/try-designer-with-server and register there - or send an email to
sales@cloveretl.com where someone will help you get everything you need on a CloverETl side.


For an easy installation I would recommend using bundle of CloverETL server and Tomcat application
container which you can find on https://www.cloveretl.com/customer after you login with your
CloverETL account credentials. If you had this bundle and license already downloaded, you could
safely skip this section and continue directly to an AWS part.


CloverETL download

If you have a CloverETL account, you can go to https://www.cloveretl.com/customer/licenses
where you should see all available trial licences and downloads for Designer and Server installations.

You will see two options after clicking on ‘Download CloverETL Server’, you can select a current
Production release or if you are a maverick, an adventurer and you want to play with version that is
on the bleeding edge, you can choose Milestone release.
(Think about it as a release candidate before a production version, it's full of brand new features,
but we don’t recommend it for production deployment.)


I will select Production release.
There is an another menu where I can select an appropriate package. CloverETL server can run on
any J2EE application server as Apache Tomcat, JBoss, Jetty or other. You can only select an
installation file (.war) appropriate for your server, but in my article I am focusing on a fresh installation
so I would highly recommend using a ‘CloverETL Server All-In-One bundle’.
This bundle consists of Tomcat app container (at the time of writing this article it’s version 8),
deployed CloverETL server and a small Derby backend database ready to run.



In a next step I will just click on ‘Download as .zip archive’, which will download bundled installation file
together with a documentation and some helpful utilities. I will use ‘Copy to clipboard’ button later
to copy license text that I’ll use after installation to register my CloverETL server.


AWS part

I will login to https://console.aws.amazon.com with my AWS credentials, select an EC2 link in the
‘Services’ section. I will continue with clicking on a ‘Launch Instance’ button.



I can select operation system of virtual machine that will host CloverETL server in section 1.
You can select whichever OS you prefer, any flavour of Linux or Windows, for sake of our article
I picked official an Amazon version of Linux.



On the next page you will find list of all instance types AWS is offering you to use. You can select
a small machine that is part of a ‘Free tier’, but that would limit you in your experimentation with
CloverETL to basic use cases on limited set of data only.
You can check system requirements for server http://doc.cloveretl.com/documentation/UserGuide/
topic/com.cloveretl.server.docs/docs/system-requirements-for-cloveretl-server.html .


In our example I will pick one of ‘General purpose’ family, a m4.xlarge looks like a nice machine
that could have decent power to handle all my wild tests and examples.
(You can start with smaller machine, but I wouldn’t go for machines with a 4GB of memory or less
as CloverETL tries to run transformations  in memory as possible which allows us great performance.)





In next section, ‘3. Configure Instance’ you can select if you would like to start your EC2 instance
into  ‘EC2 Classic’ or specific VPC. If you expect working with additional cloud tools and sensitive
data I would highly recommend to create separate VPC. More information about VPC at
https://aws.amazon.com/vpc/


Section 4 lets you assign disk space for your EC2 instance. Installation of CloverETL and Tomcat
is not really big, it will be in lower hundreds of MB, but you need take into account a size of data
that you would be processing, some necessary place for temporary and log files. I will start
on the smaller side with 32GB. AWS allows me to attach additional volumes really easily later if
needed.
I can skip section 5 for now (you can name your instance later), I need to focus on section ‘6.
Configure Security Group’. This part configures security and allows application from outside world
communicate with your EC2 installation. By default you would get a port 22 open for SSH connection.
(Console access to your server, we will use later.)

CloverETL is an J2EE application, you can access it and communicate with it with use of an internet
browser. Bundled installator starts CloverETL by default on port 8083, so I need to open that one too.
Check info icon next to Source column for details, I would recommend to tight access a bit down,
to your IP, or subnet of IPs. ‘0.0.0.0/0, ::/0’ means that port is open for whole wide world, which might
be really dangerous.

You can add or edit security group rules later even on a running instance, but don’t forget about them!


You can just review your configuration on the next page  and click on ‘Launch’ button which will spin
up a server instance.
Last necessary step of this process is to select an existing SSH key to access your new started
EC2 instance. (Or you can create new pair. In that case you would need to download private key file
and store it securely.)


AWS will try its best to point you to importance of these ssh keys, so save a private key to
your machine, you will need it in next steps!
If you would miss this step, or misplace the private key, an instance will be practically unusable and
inaccessible. It wouldn’t be such a big deal, you would just need to start this setup process from
beginning.


At this time you should have new fresh EC2 instance starting up, you will find it under link
‘Running Instances’ of an EC2 Dashboard. (There might be direct link to view you new fresh instance
after pressing “Launch” button.




If you skip section 5, new instance would be without any name, you can click to a empty space in
‘Name’ column to name it properly. If you click on a selection rectangle in the list you will get
additional information about your instance.



Now you need to upload a CloverETL server bundle zip file to the server. For that you will need
to connect to the instance with a SSH client (do you remember open port 22?). There is great
documentation how to do it http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/
AccessingInstancesLinux.html
I would just highlight important points here.


I will use a Putty client. Documentation for using this tool is here
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html .
(If you will decide to use Putty, you would need to convert primary key you downloaded to appropriate
format via PuTTygen, consult higher mentioned link to the documentation.)


First you need to fill in an user and a hostname of the EC2 instance you want to connect to. User
should be ‘ec2-user’, hostname is value from ‘Public DNS (IPv4)’ from your instance (values need to
be separated by a ‘@’ character).


Second step is to provide private key. Go to ‘Connection > SSH > Auth’, locate .ppk file (you
downloaded .pem file during spinning up of the instance, see AWS documentation how to use
PuTTygen to get .ppk file -
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html#putty-private-key)

Hit Open (or click back to Session link in the left column to Save connection information if you would
like to reuse it).


Putty will connect you to your EC2 instance, just hit ‘Yes’ on ‘PuTTy Security Alert’ dialog to cache
server host key. If you configured everything correctly, you should be connected to the server’s ssh
console. If not, check http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/
TroubleshootingInstancesConnecting.html


My Amazon instance recommends me to trigger automatic update, so I will follow with triggering:
sudo yum update


Next step is to copy a CloverETL bundle to the instance itself, I would use my favourite WinSCP for
that (I’ll upload bundle to /tmp folder on the instance).
‘Host name’ is from ‘Public DNS (IPv4)’ again, user is ‘ec2-user’, click on Advanced and locate ssh
private key (same as in previous step with a Putty).


I’ll locate a zip file of the CloverETL bundle and will copy it into /tmp folder.


Now I can switch back to Putty console and go to /tmp folder.


Unzip the bundle with command:
unzip cloveretl-server-tomcat8_bundle-4.5.1.zip


This will unzip couple of files, I am interested only in a ‘CloverETLServer.4.5.1.Tomcat-8.0.30.zip’ for
now.


On production server I would now recommend to create specific user for clover and assign him
ownership to all folder I will create in later steps, for evaluation I will skip this.
I will create new folder ‘clover’ and unzip ‘CloverETLServer.4.5.1.Tomcat-8.0.30.zip’ it there.
You should have now folder ‘/clover/CloverETLServer.4.5.1.Tomcat-8.0.30’ with deployed files of
Tomcat and CloverETL server.


Java

If you remember System requirements for CloverETL server, http://doc.cloveretl.com/documentation/
UserGuide/topic/com.cloveretl.server.docs/docs/system-requirements-for-cloveretl-server.html , you can
see that we need a Oracle JDK, you can double check version of java on your instance with:
java -version
You may notice that we have unsupported OpenJDK currently . We need to install Oracle JDK instead.
This could be a bit cumbersome as Oracle forces you to go through their site and accept a License
agreement. I found nice workaround for that here
https://gist.github.com/rtfpessoa/17752cbf7156bdf32c59
(I assume that Oracle takes this as explicitly accepting License agreement, so don’t complain you
didn’t know about it ;).)


For current (Java8, build 131) steps would be:
sudo wget --no-cookies --header "Cookie: gpw_e24=xxx; oraclelicense=accept-
securebackup-cookie;"  
"https://edelivery.oracle.com/otn-pub/java/jdk/8u131-b11/
d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.rpm"
sudo rpm -i jdk-8u131-linux-x64.rpm
(notice cookie portion and apostrophes)
Check again with ‘java -version’ command if Java version got changed, if not continue with steps in
gist link.


Hold on, only couple steps left.


Now we need to configure memory available for a Tomcat (eg. CloverETL server). It should be
percentage of physical memory on host machine. We recommend something between 50-70% of
available physical memory, depending if the host is used for additional software or not.
In our case let’s make it easy, we have 16GB available, let’s give approximately 8GB to tomcat.
Type:
sudo vim /clover/CloverETLServer.4.5.1.Tomcat-8.0.30/bin/setenv.sh
And update Xmx property.
I like to explicitly set a JAVA_HOME environment property, so I will also add as a first line:
export JAVA_HOME="/usr/java/default"


Just for better feeling I would change owner of deployed files to ‘ec2-user’ with:
sudo chown ec2-user:ec2-user /clover -R


Finally I can start Tomcat and check if its running:
sudo /clover/CloverETLServer.4.5.1.Tomcat-8.0.30/bin/startup.sh
ps aux | grep clover


You should be seeing something like:


If you are seeing only one, much shorter line, consult logs:
cat /clover/CloverETLServer.4.5.1.Tomcat-8.0.30/logs/catalina.out


Now I can go back to my EC2 Dashboard and copy ‘Public DNS (IPv4)’ (same value we used as a
hostname in Putty, WinSCP), put it into a browser and add at the end “:8083/clover”.


You should be afterwards seeing front page of your freshly created CloverETL server on EC2.


You can click on ‘Activate server’ link, which will open a text field where you can copy license text
that you should have available from https://www.cloveretl.com/customer.


There might be additional configuration as:
  • Configure smtp (for emails)
  • Configure internal db (evaluation server runs on Derby DB, we recommend to use real DB for
  • production deployments)
  • Configure Tomcat as a service (or to make it start after instance reboot automatically)


These settings are out-of-scope of this article, but you should investigate them later.


I probably spent more than promised 10 minutes with all this screenshot taking and writing, but I hope
that you would be able to spin up a working CloverETL server for your testing and evaluation with this
blogpost much, much faster.