What is Analytical Base Table?
Analytical Base Table (ABT) is a flat table that is used for building analytical models and scoring (predicting) the future behavior of a subject. Wikipedia
Let me walk through a simple real-world scenario and see how best we can use this LAG function in SQL Server. By the way, I am a big fan of LEAD and LAG functions. When it comes to predicting, the most important exercise, in my opinion, is to build a model with sufficient features for every subject closely resembling the actual reality.
But in reality, we don’t have all that information on hand and in most cases, we have no way of capturing the same. For example, in a fictional Pizza Shop, imagine that we just have 2 pieces of information about an Order.
Order Date & Time and the Total Number of Pizzas Ordered.
Pizza Shop owner wants to predict what his load (Orders) looks like at least 4 hours ahead on any given day. So how we do we setup/create an Analytical Base Table (ABT) to help the predictions.
Remember we just have few sets of features and in most cases does not reflect all the reality supporting these features. For example, we might notice a large single order in the last six months of data. It appeared only once in your data set. It could be due to a High School home game, a maybe local high school team made it to the playoffs for the first time and they won and placed the order as part of the celebrations. We don’t know that and we don’t have that info. You can apply Standard Deviation exercise and eliminate this type of data.
A simple approach is to come with some derived features. Apply BIN technique. If you are familiar with BINNING you can skip this part. In a simple explanation, Binning is just creating groups with a range of values to minimize the data noise.
Take a look at the data. Order Date has both date and time.
select top 10 * from PizzaOrders
Try adding a derived feature named, OrderHour. Instead of Date Time, we will be using this new feature (OrderHour) as an independent variable to predict the Orders (dependent variable). But by having just an hour and the total number of orders are not going really help us.
select top 25 OrderDate, datepart(hour,OrderDate) OrderHour, Orders from PizzaOrders
Try adding another derived feature name, WeekDayNumber. This might help us in predicting the possible number of orders for any given day in the week and for any given hour.
select top 25 OrderDate, datepart(dw,OrderDate) WeekDayNumber, datepart(hour,OrderDate) OrderHour, Orders from PizzaOrders
When it comes to predicting and from Machine Learning point of view, more features add more value. In this example below, I am using the LAG function in SQL Server to add another feature showing the total number of orders placed an hour ago on the same day. If you look at the results, you would notice that the first row does not have value (0 – Zero) in this new column. Reason for this Pizza Shop operates from 8am till 8pm.
select top 25 OrderDate, datepart(dw,OrderDate) WeekDayNumber, datepart(hour,OrderDate) OrderHour, Orders, lag(Orders,1,0) over (order by OrderDate, datepart(hour,OrderDate)) Hour1 from PizzaOrders
We can add one more feature to show the total number of orders 2 hours back on the same day. Notice that now LAG is using a number 2.
select top 25 OrderDate, datepart(dw,OrderDate) WeekDayNumber, datepart(hour,OrderDate) OrderHour, Orders, lag(Orders,1,0) over (order by OrderDate, datepart(hour,OrderDate)) Hour1, lag(Orders,2,0) over (order by OrderDate, datepart(hour,OrderDate)) Hour2 from PizzaOrders
We can add another feature showing the total number of orders for a given day (not by Hour, by day) which will add more value from prediction modeling.