Jamie's Junk

Continuing whatever comes to mind on data mining and predictive analytics

Wednesday, August 18, 2010

Closer to liftoff…

This week we officially launched our public beta, and itimage was one of those moments where you’ve pushed so hard running on adrenaline that when you’ve reached that summit you collapse because you can finally sleep a good sleep – if only for a moment.  With the beta launch we have people from around the world enjoying predictive analytics in the cloud via Predixion Insight.  It’s exciting watching from behind the scenes as customers launch asynchronous predictive tasks ranging in size from a few kilobytes to 100 megs.  The machinations of a Rube Goldberg contraption comes to mind as the pieces of the system coordinate – a user presses a button causing their data to be launched to the image cloud while simultaneously they are automatically provisioned across an array of servers.  The data shuttled seamlessly and invisibly between tasks on their behalf being dissected and analyzed before being dropped into a predictive report right back on their desktop. 

The movement from development to beta deployment is really wonderful for me personally.  If you haven’t yet seen it, go to our website and click on play video to get an overview of the company.  If there’s one word I can say about that piece of “marketing,” it is that it is sincere.   Go ahead – go watch it.  This is something I’ve been working toward a long time.  We’re releasing a version 1 product and we have a lot to do to fully reach our goals, but right now, any user, anywhere, can access powerfulimage easy-to-use predictive analytics without having to jump through hoops for procurement, acquisition, installation, management, etc. etc. etc.   By creating an Excel-native, subscription-based predictive service, we’re taking the traditional barriers barring people from even opening the door to predictive analytics and slashing them to the ground.  

So I was going to write a longer post explaining some more details about the product, but you should try it now (and anyway, Bogdan already wrote a great post with some feature details)  You can watch a demo that gives a lightning fast overview of the product here and then go download and enroll for the “free trial” beta.  I have it on good authority that there may be some interesting beta events for accomplished users, and you still have the opportunity to get in early, so don’t wait!

Sunday, August 1, 2010

Predixion on the brink….

   I officially started my career as “founding CTO” with Predixion on January 6, 2010, and now, just 7 months later we are on the brink of launching the VIP beta of our new product and service, Predixion Insight, on August 2.  WithLG1 a development team of only 5 people we’ve created, what I think, is a truly disruptive entry in the predictive analytics space, and we’re just getting started.

It’s been a very exciting time – meeting the co-founders of Predixion, deciding to venture off from Microsoft to start something new based on the ideas developed over the past several years, recruiting the best development team you could ask for, filming corporate videos at my house, meeting with customers, partners, and venture capitalists – there hasn’t been a boring day yet!

This last week we’ve moved to a new office space in Redmond and wrapped up the bits for our VIP beta.  This beta is limited to only 12 select people.  We ran two incredible online demos and some feedback we received: “let me say that I loved it”, “Can't wait to play with the product!”, “Based on what I saw yesterday, Insight is more like a coral reef than a warm bath!”

Anyway don’t be worried that you will be left out because you’re not part of our VIP beta – we are quickly filling up our next phase of the beta offered on a first come-first serve basis.   This phase will be launched on August 16th and you can sign up on our website.  We’ve been working hard and fast at making a product that you can use immediately, every day, without boundaries, and we’re on the brink of delivering it to you.  Over the next few weeks we will be creating collateral materials that make using Predixion Insight even easier.  Stay tuned, true believers, you’ll like what we have coming!

Friday, May 7, 2010

Bootstrapping Windows on GoGrid – getting your admin password on the box.

 I spent a lot of time this week working on trying to get our service running on GoGrid as a potential alternative to Amazon’s EC2.  They jury’s still out, but they seem to offer better hardware for the price.  There are a lot of other pro’s and con’s between the two services, but maybe that’s a subject for a future article – maybe after we make a final decision!  The nature of our service requires that we can perform on-demand machine requisitioning and provisioning.  Using Amazon’s EC2, certain aspects were easier than GoGrid, due to the nature of the way they handle server images “AMI” in Amazon lingo, “MyGSI” in GoGrid.  In short, the nature of the sysprep step performed on a newly provisioned machine at GoGrid causes some problems with certain services we need to run and user accounts we need to provision.
Part of the issue has to do with the way GoGrid provisions administrator passwords – on a newly provisioned machine, the administrator account will have a new password, which you would expect, but also, any additional administrators you create are on the image aren’t valid after provisioning.  So the GoGrid-provisioned password is pretty much all you have.  This is OK if you can interactively logon to the machine after provisioning, but not so OK if you want to do this automatically.  To solve this problem, I came up with a method to fetch the admin password from GoGrid itself from the machine after launch.  We trigger this via a web service call after the machine is launched, but presumably you could do this on a startup event as well – I haven’t experimented with that as of yet, but presumably it should work.
The difficulty in the solution is simply due to the limited information you have about your machine from your machine.  The basic approach is to call the GoGrid API to get the list of passwords from all your machines, and then find the password that matches the public IP of your machine.  In order to use this code, the first thing you need to do is to go to your GoGrid account page and add an API key which you will use to securely interact with the GoGrid service.  The type of API key should be System User, as that is required to fetch passwords.  This key will be embedded in your code on the GoGrid image, so you should take necessary steps to protect it.
In this solution I use the GoGridClient class from the GoGrid Wiki Documentation – copy that code and specify your api_key and shared secret.
The first task is to write a function to get the passwords from GoGrid (we wrote the GoGridIPType and GoGridIPState enums – they contain the values in the code):

public static string GetPasswordsRaw() // returns the raw XML as provided by GoGrid
{
    string returnValue = String.Empty; 
    try
    {
        GoGridClient grid = new GoGridClient(); 
        System.Collections.Hashtable parameters = new System.Collections.Hashtable();
        parameters.Add("format", "xml");
        string requestUrl = grid.getAPIRequestURL("/support/password/list", parameters);
        returnValue = grid.sendAPIRequest(requestUrl);
    }
    catch(Exception)
    {
    } 
   return returnValue;
}

After you have this function, you need a function to get the list of ip addresses from your machine and compare it to the ip addresses from GoGrid.  The function first grabs all of the ipaddresses from the local machine and then uses Xpath queries to isolate and iterate the password objects from the GoGrid response.  Then it uses more Xpath queries to grab the ipaddress and password from each object.  Finally it checks to see if the ipaddress matches any ipaddress on the machine and returns the associated password.

private string GetAdminPassword()
{
    // Fetch ip addresses for the local machine and store into a list
    List<string> ipaddresses = new List<string>();
    System.Net.IPHostEntry IPHost = System.Net.Dns.GetHostEntry(System.Net.Dns.GetHostName());
    foreach (System.Net.IPAddress ip in IPHost.AddressList)
    {
        // Only take the IPv4 addresses
        if (ip.AddressFamily == System.Net.Sockets.AddressFamily.InterNetwork)
        {
            Report("Found ip: {0}", ip.ToString());
            ipaddresses.Add(ip.ToString());
        }
    } 

    // Get the password information from GoGrid and load into an XML document
    string xml = GetPasswordsRaw();
    XmlDocument d = new XmlDocument();
    d.LoadXml(xml); 

    // Use Xpath to select the "password" objects
    string path = "/gogrid/response/list/object[@name='password']";
    XmlNodeList nodes = d.SelectNodes(path);
    foreach (XmlNode node in nodes)
    {
        // Extract the password and ipaddress from the password object
        XmlNode pwdnode = node.SelectSingleNode("attribute[@name='password']");
        XmlNode ipnode = node.SelectSingleNode
            ("attribute[@name='server']/object[@name='server']/attribute[@name='ip']" +
             "/object[@name='ip']/attribute[@name='ip']"); 

        // API Key passwords will not have an ipnode
        if (pwdnode == null || ipnode == null)
            continue; 

        string password = pwdnode.FirstChild.Value;
        string ipaddress = ipnode.FirstChild.Value;
        // Check to see if the ipaddress belongs to this machine
        if (ipaddresses.Contains(ipaddress))
            return password;
    }
    throw(new SystemException("Did not find password"));
}

Once you have the admin password, you can use it to impersonate the box admin as necessary to run additional code requiring such privileges.   It really helps in allowing us to automatically deploy boxes on GoGrid.  Given the creative commons license of the GoGrid API, the same technique should apply to other cloud providers as necessary.
Hope this helps with your cloud infrastructure deployments – love to hear your comments.

Friday, April 30, 2010

Cheers from the Predixion Dev Team!!

PDT

We’re assembled and ready to rock!  Have a great weekend!

-Jamie and the PX Devs

Tuesday, April 13, 2010

Cases lost in Time

CLT

This post was inspired by a question on the MSDN data mining forum that we knew would come to us one day.  When developing the SQL Server Data Mining platform, we had made one of those design decisions that was kind of wonky, but made sense if you turn your head sideways and squint a bit.  It all resolved to the fact that since our Time Series algorithm was based on Decision Trees, we could use the Decision Tree viewer to show more information about your time series model than anyone had ever seen before – you could see a piecewise linear regressions for each distinct pattern over time – it was one of those “OMG – it’s full of stars….” moments.

monolith Anyway, one of the things that you get to see when using the Decision Tree Viewer is the number of cases or facts or rows or however you want to call them.  This information shows up in the Mining Legend, like this:

image So, when you create a time series model, you get the same kind of information – Total Cases = some number.  Nobody really considered that number too harshly in SQL Server 2005, but then we greatly improved the Time Series algorithm in 2008, and things changed.  The most obvious change is that we supplemented our 2005 decision tree algorithm, ARTXP, with a (fairly) standard implementation of the ARIMA time series algorithm.  A user noticed that if they created a model using only the ARIMA algorithm, the “Total Cases” number was higher than when they used ARTXP or the default blended mode.

S0, is ARTXP eating cases?  Is it ignoring valuable slices of time lost to eternity?  No, not really – like I said, if you turn your head and squint it really does make sense that ARTXP will have less “cases” than ARIMA.  The part that doesn’t make sense is that to satisfy the devil of “consistency” we kind of overloaded the term “cases”.  ARIMA – Auto Regressive Integrated Moving Averages – is more of what you would naturally think of in a forecasting algorithm – it performs calculations on time slice values to determine patterns and make forecasts.  ARTXP – Auto Regressive Trees with cross (X) Predict - on the other hand, doesn’t work in a “way you would naturally think” kind of way.  ARTXP decomposes the time slices into a series of “cases” that it then feeds to the decision tree engine.

Let’s examine how this works.  Let’s take a simple series with 10 values – this one should do:

11, 12, 13, 14, 15, 16, 17, 18, 19, 20

If we assume AR(4), that is, using 4 values to predict our “target”, we get “cases” that look like this:

Case Input Input Input Input Predict
1 11 12 13 14 15
2 12 13 14 15 16
3 13 14 15 16 17
4 14 15 16 17 18
5 15 16 17 18 19
6 16 17 18 19 20

You see that for each time (t), we need to take the previous values (t-1), (t-2), (t-3), and (t-4).  This means that the first four values of the series aren’t available as case targets – they are preceded by nothing.  In the end, for 10 time slices using AR(4), you end up with only 6 “cases” to analyze.  Whereas if you used ARIMA, it would simply use all the slices and the “Total Cases” would be 10.

So, like I said – turn your head and squint and it makes sense.  Of course, once you understand this, the “Total Cases” for the ARIMA models doesn’t make sense.  (cue evil laughter).  Yeah yeah – it doesn’t make sense, but you know what it means.

Anyway, for other cases lost in time, I realized I missed an important series in my digest of postings of yore – the incredible Time Series Reporting Stored Procedure series – a three-part series in four parts – go figure – it’s kind of like that cases lost in time in reverse, I suppose.  This series shows how to create a report that contains both the historical data and predicted data from a Time Series model.

TS Reporting Sproc Part 1
TS Reporting Sproc Part 2
TS Reporting Sproc Part 3
TS Reporting Sproc Part 4

I do believe that is the last of the digested posts of yesteryear.  I’ll have some more coming up as Predixion motors on!

Tuesday, March 16, 2010

Executing DMX DDL from a linked server

Luckily before I left MSFT, I had the foresight to change my contact email on that old wet blog of mine that I’m no longer able to contribute to – no hard feelings.  I received a question which is something that has come up frequently enough that it just needs to be dealt with so for all future posts, you can just say “look at the Executing DMX DML from a link server post on Jamie’s new blog – the only blog that matters,” and be done with it.

So just for definition’s sake – DMX – Data Mining eXtensions to SQL, DDL – Data Definition Language, DMX DDL – DMX statements that create or modify objects!  You would think you can add two TLA’s and get an SLA, but that stands for “service level agreement” which has nothing to do with this post.  This post could also have been named “how to execute non-rowset returning commands on Analysis Services from SQL Server”, but not only do I digress, I like the actual title better with the dual unpronounceable acronyms..

Anyway, in my DMX Digest post, I referenced this post which showed how to execute DMX statements from SQL and put the results in SQL table.  In short (just in case you don’t want to click those links), you set up a linked server and then use OPENQUERY to make the DMX call.  One (well, at least one) adventurous reader sought fit to try other kinds of statements than queries – in particular a DROP MINING STRUCTURE statement.  The problem with DROP MINING STATEMENTS – and other DDL statements is that they don’t return a rowset, which is a requirement for OPENQUERY – which really wants some output columns to bind to.

The nice way to do this would be to take advantage of the SQL EXECUTE command, which, at least in SQL Server 2008, has been extended to execute commands on linked servers.  Such a command would look very elegant, like this:

EXECUTE ( 'DROP MINING STRUCTURE [MyMiningStructure]' )
AT MyDataMiningServer


Wow – that would be nice!  If only it worked, that is.  If you endeavor to try such a think you’ll get the pleasant response of “Server 'MyDataMiningServer' is not configured for RPC.”  What this means, evidently, is that the nice way of doing things isn’t going to happen.



But, never fear, we can take advantage of all that boundless flexibility built in to SQL Server Data Mining to make it happen.  All we need to do is to create some kind of statement that can be called from SQL Server’s OPENQUERY that executes a statement of our choosing.  And the way to do this is to write a stored procedure that executes a statement and returns some sort of table.  This is the really big hammer solution to the problem.



And what do you know, I happen to have that stored procedure right here…..



using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using Microsoft.AnalysisServices.AdomdServer;

namespace DMXecute
{
public class Class1
{
[SafeToPrepare(true)]
public DataTable DMXecute(string statement)
{
DataTable Result = new DataTable("Table");
Result.Columns.Add("Column", typeof(int));
if (Context.ExecuteForPrepare)
return Result;

AdomdCommand cmd = new AdomdCommand(statement);
cmd.ExecuteNonQuery();

return Result;
}
}
}



And calling it from SQL Server – easy



EXEC sp_addlinkedserver @server='MyDataMiningServer', -- local SQL name given to the linked server
@srvproduct='', -- not used
@provider='MSOLAP', -- OLE DB provider
@datasrc='localhost', -- analysis server name (machine name)
@catalog='MyDMDatabase' -- default catalog/database

GO

SELECT * FROM OPENQUERY(MyDataMiningServer,'CALL DMXecute.DMXecute("DROP MINING STRUCTURE [MyMiningStructure]")')

GO



Of course, you can execute any DMX or MDX statement you want there, so this is simply dangerous is general – you definitely shouldn’t be sending unvalidated user input through here for fear of SQL Injection style attacks.  A better way, in general, would be to write stored procedures that performed exactly the operations you need taking just the object name as a parameter.


Sunday, March 14, 2010

Gluten Free Waffles

OK, this posting has nothing to do with SQL Server, Data Mining or Predictive Analytics, or even Predixion.  It’s kind of a follow-up to my previous post – I’ve gotten some emails and other communiqué about my twins diet.

Every Sunday is family waffle day, and I’ve come up with a pretty good waffle recipe for the boys.  I have to make a “regular” batch for the older kids, April and myself and I make a gluten-free, casein-free batch for the boys.  Of course, I have to use a separate waffle iron to avoid contamination.

Anyway, the recipe I use is as follows:

Turn the waffle iron on to high to heat up while you prepare the ingredients.  In a medium-large bowl, mix all the dry ingredients.  In a separate medium bowl, beat the egg yolks up a little bit, and then add the vanilla, rice milk, and canola oil.   Pour the wet ingredients, save the egg whites, into the dry ingredients and mix well.

Using an electric mixer, beat the egg whites until stiff peaks form.  Gently fold the egg whites into the mixture so it is mixed but all the air doesn’t escape from the egg whites.

Pour 1/3 cup of mixture onto each waffle area of the iron.  Gluten-free waffles take a bit longer to cook then their glutinous counterparts – I usually increase the time by 1 minute, which means it takes 6 minutes for a batch on our waffle iron, but your mileage may vary.

NB:  I use a PAM cooking spray to keep the waffles from sticking.  PAM and all other cooking sprays contain soy lecithin.  Typically, we avoid soy, but it seems that my boys aren’t sensitive to small amounts of soy lecithin.  If your child is sensitive, you can brush on canola oil with a pastry brush or paper towel.

Makes 10-12 waffles.

Enjoy!