Polybase in SQL Server 2019

Polybase in SQL Server 2019

I had the opportunity to use Polybase back in 2015 as part of a data warehouse project using PDW. Yes, PDW not SQL Server. It was an obvious choice because data was hosted on Hadoop (HDFS) system and we made conscious decision not to stage the data in PDW.  In order to stage data in PDW data still needs to be moved out of HDFS and should be landed in PDW.

From Analytical and ETL/ELT perspective moving data from one system to other systems purely for the purpose of staging did not make sense and Polybase was a perfect choice. Polybase in PDW was limited to access to data from HDFS at that time. SQL Server 2016 made it available but still limited to HDFS.

SQL Server 2019 takes it to the next level. I think Microsoft made a huge step forward by not only investing in Polybase supported it even extended further

Polybase in SQL Server 2019 supports data access from a variety of external systems, including HDFS (Big Data),  Relational (RDBMS) database, for example, Oracle, NoSQL (MongoDB) and famous ODBC sources. Polybase now can access data from other SQL Server.

From an implementation point of view, Oracle JRE 7 or above should be installed on the same box where SQL Server 2019 is installed with necessary ports open.

Creating an External Table (Yes it is called an External Table and it uses Polybase behind the scenes) is pretty straightforward.

CREATE EXTERNAL TABLE [dbo].[ExternalTableName] (
   -- define table columns
)
WITH (LOCATION='hadoop file location',
      DATA_SOURCE = hadoop_datasource_name,
      FILE_FORMAT = compressed,
      REJECT_TYPE = VALUE,
      REJECT_VALUE = 0
)

PolyBase is designed to optimize and reduce Data Movement by using a concept called push-down computation. This forces Polybase to execute SQL operations to the source system. For example, a GROUP BY operation using an External Table will force the GROUP BY operation on Hadoop instead of bringing the data across. In this scenario, Polybase will initiate the operation and heavy lifting is done by Hadoop by spinning Map and Reduce operation as necessary. Once Hadoop work is done results will be sent back to Polybase which intern displays the results.

In SQL Server 2019, Big Data Clusters the SQL Server engine has gained the ability to read HDFS files natively, and by using SQL Server instances on the HDFS data nodes to filter and aggregate data locally.

There is another option in SQL Server 2019 too. We can scale-out SQL Server Instances and add them as Polybase Group. Conceptually, one SQL Server instance can act as a Head Node and rest of them as Work Nodes similar to the Hadoop Architecture. Here the idea is to improve parallel processing and distributing the workload.

SQL Server 2019 is in CTP 2.1 now and in my opinion, will go through a good number of iterations before production ready as some of these features need to evaluated in the production environment and real scenarios.