Migrating On-Premises Data Analytics Capabilities to Cloud – Myth or Reality?

Migrating On-Premises Data Analytics Capabilities to Cloud – Myth or Reality?

Data Analytics Capabilities
From an organizational point of view, apply data analytics to support decision making, supported by tools, technologies and well-defined business processes.

  • Data – Master Data, Operational and Analytical
  • Tools to process, analyze data and empower business units for decision making and evaluate continuously
  • Technologies to store, secure, manage and maintain data integrity
  • Business processes to define rules, data governance, control, and audit practices
  • People (Well not everything can be automated in the data analytics world!)

So what is actually being migrated to cloud?

  • People – NO
  • Business processes  – NO
  • Technologies – YES (partial to full-fledged)
  • Tools – YES (partial to full-fledged)
  • Data – YES (partial to full-fledged)

As far as the actual migration, in my opinion, it all depends on how we plan and execute. As organizations start growing in all directions, the complexity of the data and dependency on other data will only multiple.  This poses a great challenge for Cloud Migration but can be addressed with careful planning and data decoupling.

In theory and on paper migrating cloud looks good. So I thought lets put this theory into a proof-of-concept model and evaluate cloud features and offerings and how a simple cloud migration path looks like for an existing on-premises Data Analytics solution. To set the context, in this article I would like to focus on the Data Analytics solution. Cloud migration of an exciting Operational solution will be a different article.

For the purpose of proof-of-concept, I’m using Azure Cloud and Azure Data Factory.

What is the need to move on-premises Data Analytics solutions to Cloud?
Data is the most important asset of any organization. Data reflects how the organization is performing and where is it heading in the future. Data Analytics plays a crucial role in identifying any future opportunities apart from monitoring the current performance. With this in mind, imagine where Data Analytic Capabilities are up-to-date with the current trends rather than trying to catch up or running a few generations back.

With Big Data, Real-Time and Micro Batch, Data Analytics are almost running parallel to the Operational side of the business.  Organizations can now answer questions quickly, instead of “Wait and See”

Removing the Wait from the “Wait and See” is where Cloud comes into the picture. Most (not all of them) data professionals are very cautious thinkers and always concerned about the data availability and most importantly data movement.

Data movement is a valid concern and directly impacts the analytical work in progress. To clear the concerns, when Netflix is streaming at 4K, without the loss of frames, in my opinion, data movement won’t be a big factor anymore when we talk about Cloud Migration.

Here is a simple proof-of-concept I had validated using Azure Data Factory.

We will go more into Azure Data Factory later in this article but to keep it simple, my use case for this proof-of-concept is, lets say my organization have an operational on-premises SQL Server, on-premises Data Mart on a SQL Server and ETL investment is in Microsoft ETL stack (SSIS) with industry standard Analytical Tool Set for data analysis (Tableau, Excel etc)

This is use case has a very basic architecture but covers most of the aspects. Architecture gets complicated when we see a variety of non-Microsoft ETL tools and technologies.

From the use case perspective, scaling of this simple solution becomes a bottleneck when data starts growing, demanding more computing and storage space. This situation will have a direct impact on the day to day Data Analytics.

To remove the mental block about the data movement, I had to experiment with Azure Data Factory first. With a little bit of planning, I was able to establish a Copy Pipeline, which copies data from on-premises SQL server to the Azure SQL Server.  In simple terms, we have a Source and Target and in this case, both are identical with only one difference, Source is On-Premises and Target is in the Cloud.

If we are smart enough to architect it differently, there is an opportunity to decouple the app (Website or Internal Application) from using the on-premises database and point it to the database in the cloud. If we can achieve that, we not only eliminate the Copy Pipeline, we also solve the issue of Scaling.

Well, decoupling is not that simple, requires planning, identify and define all the dependencies. considerations.  As this article is about Data Analytics, decoupling apps from the on-premises database is a separate article.

Here is how Copy Pipeline looks like in Data Factory.


Visual representation in Azure Data Factory

Visual representation of Copy Pipeline Azure Data Factory

What is interesting Data Factory is, once the Copy Pipeline is defined, we can schedule it, monitor it and look at all the performance indicators about the Copy Pipeline, all in one place and without any additional effort to code it.

While I was building this I was thinking what if my on-premises server is very critical to the business app? What are the adverse effects of this Copy Pipeline which I just defined in Data Factory? To mitigate this, from an architecture perspective, we may be able to take advantage of the SQL Server Replication features and avoid Copy Pipeline copying data directly from the operation database.

By the way, if you have Code Eyes like me, here is the JSON behind the Copy Pipeline object. This gives me more comfort level because Copy Pipeline is a code snippet after all. The advantage I see here is, we can directly configure it and also understand the logic better.

{
    "name": "CopyPipeline_8y3",
    "properties": {
        "activities": [
            {
                "name": "Copy_8y3",
                "type": "Copy",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [
                    {
                        "name": "Source",
                        "value": "[dbo].[Customer]"
                    },
                    {
                        "name": "Destination",
                        "value": "[dbo].[Customer]"
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "SqlSource"
                    },
                    "sink": {
                        "type": "SqlSink",
                        "writeBatchSize": 10000
                    },
                    "enableStaging": false,
                    "dataIntegrationUnits": 0,
                    "translator": {
                        "type": "TabularTranslator",
                        "columnMappings": {
                            "Name": "Name"
                        }
                    }
                },
                "inputs": [
                    {
                        "referenceName": "SourceDataset_8y3",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "DestinationDataset_8y3",
                        "type": "DatasetReference"
                    }
                ]
            }
        ]
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

After I was comfortable with Copy Pipeline, my next task is to spin up a Azure SQL Server Data Warehouse database in the Cloud.  By the way Copy Pipeline has many features like Sink tables and Staging etc to improve the performance of the actual data movement.

Here is a screenshot of the Azure SQL Date Warehouse database. This is just for our visual reference that my Data Warehouse database is now sitting in the Cloud. While I was creating this database, I have used the AdventureWorksDW as the model database. We can create an empty one and start building it from scratch or generate database scripts from the existing on-premises data warehouse database and execute them in the cloud.

Quick recap.

With completing first 2 steps,

  • I now have a Copy Pipeline to copy data from my source (on-premises) SQL Server database to the SQL Server database in Azure Cloud.
  • I also created an Azure SQL Data Warehouse database in the cloud

My next task is to implement the ETL and land the data in Azure SQL Data Warehouse for Analytics.

Before I go ahead and define the ETL, I had to set up and configure SSIS environment, similarly what we have to do now on-premises. The only difference is this SSIS environment will be sitting in the Cloud.

What is interesting about this step is I get to select what I need (like processing power, nodes etc) for my specific requirement rather than simply sharing or piggybacking on a variety of on-premises SSIS Server environments. I can choose Node Size, Number of Nodes etc.

 

Question?
Do we really need this Integration Runtime Setup? Remember everything in Cloud will cost you. It depends and mainly influenced by your overall ETL Architecture. As I was planning to use SSIS (SQL Server Integration Services) packages I had to set up an SSIS environment in the Cloud to execute SSIS packages.

My next step is to create an Execute SSIS Package Pipeline. Here is the screenshot and notice that I had set it up to use the SSIS environment I had created in the previous step.

This screenshot is showing all the properties for an SSIS Package. If you notice it closely, there is a folder on the Cloud where we can drop-in the dtsx package. Yes, dtsx packages.!!! This is a very crucial game changer for me from both time and money perspective.

If your organization had invested heavily in the SSIS for the ETL and if you are ready to move to Cloud, you can move all your existing SSIS packages to the Cloud and execute them in the Cloud Integration Services Environment. You can still keep developing the SSIS packages locally (on-premises) and move them to Cloud when they are tested and Prod Ready. How cool is that!

Also, I have noticed a very interesting setting on the SSIS Package Pipeline. We can identify the sensitivity of the data for both Output and Input and when checked, SSIS log will not have this information logged.

My use case at this point may look like oversimplified but if closely notice I was following the same pattern I normally practice on-premises with all the same tools, with one major difference is that I get to choose what I need and what I can afford and I can scale up or scale down based on my business needs any time.

We are into a different territory now. No “Wait and See” it is now “Just Do and See”

Here comes the last step in my use case i.e. doing actual the Data Analysis using tools like Tableau and Excel.

I’m using Excel for this purpose. Excel has additional Cloud-enabled features and in fact, I’m using Office 365 subscription, which is cloud-based and never have to worry about upgrading every time. Excel has native data connections to support Azure SQL Database and Azure SQL Data Warehouse, which I feel really cool.

Here I’m using the Power Pivot in Excel connecting to the Cloud database (Azure SQL Data Warehouse). If you recollect from this article when I created that Azure SQL Data Warehouse in the cloud, I had opted to model it after AdventureWorksDW.

 

Hope my use case helps you in understanding the bits and pieces involved in migrating your Data Analytics solution to the Cloud.

Let’s revisit the main question for this article. Migrating Data Analytics Capabilities to Cloud – Myth or Reality?

In my opinion, Data Tools and Technologies are rapidly changing and the requirements and expectations around Data Analytics are fast changing too. With the current Cloud ETL Tool Set (I have only evaluated Azure at this point) and with the speed of data movement, Data Analytics in the Cloud is not a myth anymore and in fact, it will be the reality sooner than expected for organizations.  With careful planning and thoughtful architecture, I am sure organizations can reap the benefits of Data Analytics in the Cloud!

May The Cloud Be With You 🙂