Tuesday, April 13, 2010

Cases lost in Time

CLT

This post was inspired by a question on the MSDN data mining forum that we knew would come to us one day.  When developing the SQL Server Data Mining platform, we had made one of those design decisions that was kind of wonky, but made sense if you turn your head sideways and squint a bit.  It all resolved to the fact that since our Time Series algorithm was based on Decision Trees, we could use the Decision Tree viewer to show more information about your time series model than anyone had ever seen before – you could see a piecewise linear regressions for each distinct pattern over time – it was one of those “OMG – it’s full of stars….” moments.

monolith Anyway, one of the things that you get to see when using the Decision Tree Viewer is the number of cases or facts or rows or however you want to call them.  This information shows up in the Mining Legend, like this:

image So, when you create a time series model, you get the same kind of information – Total Cases = some number.  Nobody really considered that number too harshly in SQL Server 2005, but then we greatly improved the Time Series algorithm in 2008, and things changed.  The most obvious change is that we supplemented our 2005 decision tree algorithm, ARTXP, with a (fairly) standard implementation of the ARIMA time series algorithm.  A user noticed that if they created a model using only the ARIMA algorithm, the “Total Cases” number was higher than when they used ARTXP or the default blended mode.

S0, is ARTXP eating cases?  Is it ignoring valuable slices of time lost to eternity?  No, not really – like I said, if you turn your head and squint it really does make sense that ARTXP will have less “cases” than ARIMA.  The part that doesn’t make sense is that to satisfy the devil of “consistency” we kind of overloaded the term “cases”.  ARIMA – Auto Regressive Integrated Moving Averages – is more of what you would naturally think of in a forecasting algorithm – it performs calculations on time slice values to determine patterns and make forecasts.  ARTXP – Auto Regressive Trees with cross (X) Predict - on the other hand, doesn’t work in a “way you would naturally think” kind of way.  ARTXP decomposes the time slices into a series of “cases” that it then feeds to the decision tree engine.

Let’s examine how this works.  Let’s take a simple series with 10 values – this one should do:

11, 12, 13, 14, 15, 16, 17, 18, 19, 20

If we assume AR(4), that is, using 4 values to predict our “target”, we get “cases” that look like this:

Case Input Input Input Input Predict
1 11 12 13 14 15
2 12 13 14 15 16
3 13 14 15 16 17
4 14 15 16 17 18
5 15 16 17 18 19
6 16 17 18 19 20

You see that for each time (t), we need to take the previous values (t-1), (t-2), (t-3), and (t-4).  This means that the first four values of the series aren’t available as case targets – they are preceded by nothing.  In the end, for 10 time slices using AR(4), you end up with only 6 “cases” to analyze.  Whereas if you used ARIMA, it would simply use all the slices and the “Total Cases” would be 10.

So, like I said – turn your head and squint and it makes sense.  Of course, once you understand this, the “Total Cases” for the ARIMA models doesn’t make sense.  (cue evil laughter).  Yeah yeah – it doesn’t make sense, but you know what it means.

Anyway, for other cases lost in time, I realized I missed an important series in my digest of postings of yore – the incredible Time Series Reporting Stored Procedure series – a three-part series in four parts – go figure – it’s kind of like that cases lost in time in reverse, I suppose.  This series shows how to create a report that contains both the historical data and predicted data from a Time Series model.

TS Reporting Sproc Part 1
TS Reporting Sproc Part 2
TS Reporting Sproc Part 3
TS Reporting Sproc Part 4

I do believe that is the last of the digested posts of yesteryear.  I’ll have some more coming up as Predixion motors on!

No comments:

Post a Comment