Saturday, November 12, 2011

Our own 11-11-11–the launch of Predixion Insight 2.0!

skyrimAlthough the weekend is likely to be more filled with Skyrim than Predixion Insight, today (being minutes before I posted this) we launched our groundbreaking Predixion Insight 2.0 cloud predictive analytics service.  Accompanied by the musical stylings of our own Matthew Meadows the dev team works out the last details in order to make sure we have a smooth deployment of the cloud platform.   As the final build draws to a close, the team chats about the distribution of Now & Laters , laughs about political gaffes, and the third thing, oops, I can’t remember the third thing.
The last few weeks as we’ve been able to step back and actually use the product we’ve built, it’s truly a delight.  Bogdan and I have both just stood in the hallway and related how fun it is.  Who would think that a predictive analytics business application could be “fun”?  The collaborative visualizations we’ve added to the product transform how we think about problems and how we talk about them.  The features even have changed how we conduct bug triages as we can link to the product directly from Visual Studio bug reports and have the issues presented right there.  No need to capture a bitmap or rely on the descriptive range of test prose – we can just link to the freakin’ issue!  Boom – there it is!
And it’s pretty too – overlaying tiles and animated graphics really transform the pattern discovery experience.  Being able to share your explorations with colleagues through commenting and actually see exactly what they were seeing when they made the comment is remarkable once you start using it.  I can’t imagine going back to a single analyst view of the world after using Insight 2.0 – it would just seem….stifling.
This release means a lot to me personally – my goal has always been to change the world through software, and with Predixion I believe I’m in the right place with the right team to make it happen.  We have an incredible development org right now that deserves all the credit in the world for building such an amazing product that will only get better.  I really feel honored to be working with this great team that made this possible:
  • Bogdan Crivat
  • Shuvro Mitra
  • Abdul Shameer
  • Duong Nguyen
  • Jeff Willis
  • Tatyana Yakushev
  • Matt David
  • Yimin Wu
  • Matthew Meadows
  • Raghu Ramachandran
Also – it goes without mentioning that the development team couldn’t make this happen in a vacuum and it’s the SoCal and roaming Predixion team members and of course our investors that create the environment that allows us to deliver our vision and technology to the world.  So without listing our company directory, I’ll just say that I appreciate all the pulls and tugs and support from all the other Predixionites that make this possible.
Anyway - 2.0 is really just the beginning from which we will continue to make dramatic changes in not just “predictive analytics” but how people value and appreciate their data and how data can make a difference in people’s lives.  And just to wrap it up, and since I made it today – here’s a tutorial video introducing the frame controls for Predixion Insight 2.0.
Predixion Insight 2.0 Viewer Controls Tutorial

Tuesday, November 8, 2011

Introducing Predixion Insight 2.0!


We are on the verge of releasing Predixion Insight 2.0 and I wanted to give everyone a heads up on some of the new capabilities.  It’s been a very exciting release and the product (if I may say so myself) is looking beautiful!  Every time I step back and actually use the product (rather than develop it) I’m stunned by what we’ve accomplished!

We have in short measure changed what it means to “do” predictive analytics.  Predixion Insight 2.0 provides real advances in discovery and collaboration that significantly improve how users will even be able to talk about their predictive work. 

In my mind there’s so much more road for us to travel, so much more we can – and will – be doing to transform predictive analytics to predictive intelligence – but this milestone puts quite a bit of that highway behind us.

I put together a short vblog showing some of the changes we’re releasing this week.  Later I’m sure we’ll have a nice polished version that’s messaged appropriately, but this one is just a fairly raw footage of me introducing 2.0 to you.  Enjoy and come back for more!.


A taste of Predixion Insight 2.0

Tuesday, August 16, 2011

Scoring R models against PowerPivot Data–Part 3–Evaluating and Scoring

This is part 2 of a three part series about integrating R and PowerPivot.
  • Part 1 – Intro (includes links to necessary components)
  • Part 2 – Creating a classification model in R
  • Part 3 – Evaluating and Scoring R models in Predixion
In my previous post, I walked through the creation of a classification model in R and imported the model into Predixion Insight.  The necessary files for this part are on my SkyDrive – in order to follow through with the steps in this phase you will need to import the PMML file into Predixion Insight if you don’t already have it.  (Go to Insight Analytics/My Stuff – click “Import”).  Also, open the Excel file “R Demo.xslx”.
In this installment I will show you how to create a profit chart using the R model against data in Excel, and then score the model against data in PowerPivot.
A profit chart allows you to determine how much profit (or loss) you will achieve by using a predictive model to select which cases – e.g. leads, customers, etc – to target.  The profit chart gives you information on how to apply the model to data in order to maximize your profit based on your cost metrics.
To launch the Profit Chart wizard, click on the Profit Chart button in the “Test” section of the Insight Analytics  ribbon.
Once the Profit Chart wizard appears (you will be prompted to log in to Predixion Insight if you haven’t already) skip past the intro screen and select the model you imported from R – it will be under the PMML Models node which appears after all of the Predixion datasets
Note that when you select your PMML model you will see information about the model in the description pane.  Not all PMML models support retrieving the probability of output states like the models created using Predixion Insight.  If a PMML model supports this functionality it will be listed in the Output section in the Description pane.  Retrieving the probability of the output states is required to use Profit Charts and Accuracy charts on PMML models.
The next step is to specify the business parameters of the profit chart – that is, the real factors that are important to how the model is expected to perform.  In this case, you need to choose how much the campaign is going to cost and how much you expect to receive from successful contacts.  Since there is a variable cost associated with contacting leads, you will need to click on the “…” button next to the individual cost field.  Specify your costs according to these screen shots and click next.
Clicking Next on the dialog will bring you to the Select Input Data page.  The R Demo Excel file has some prepared test data that you can use.  You can also use data from PowerPivot, or test data that is automatically separated in Predixion datasets.  If you were already looking at the test data when you launched the wizard, the table would already be selected – otherwise select the test data and click next to advance.
On the final page of the wizard, you need to specify the relationships between the model columns and the columns in your Excel spreadsheet.  Predixion Insight will automatically bind the columns based on the column names if possible.  At this point you can click the Finish button and be done with it, however, if you want to be able to easily rerun the profit chart with different parameters or even automate the process, you can click on the little arrow on the Finish button and select “Execute and Create a Visual Macro.”  This will not only start the Profit Chart, but also create a worksheet with a Profit Chart Visual Macro – that is, a worksheet with the executable description of the job you just performed.  This Visual Macro can be altered and re-run however you choose, and even executed from VBA or automated via SQL Server Integration Services
When the progress dialog comes up – it’s a good idea to dismiss it by clicking on the “Minimize to Task Pane” button which allows the operation to execute in the background.  When complete you can click on the “Results” button in the job chicklet which renders the Profit Chart into your workbook.  You can also click on the collaboration icon to send the job results directly to your friends and colleagues or publish to shared collections.
The actual Profit Chart itself shows us the maximum profit we can expect by using the R model to optimize our campaign.  Additionally it tells us the probability threshold we need to consider when selecting which leads to contact.  Specifically, in this case, we should only contact leads that have a probability of conversion, as dictated by the model, of 34.4% or greater.
Now that you know how to evaluate your R model – or more appropriately – how to evaluate the predictions from your R model, it’s time for the climax of this series – applying that R model to PowerPivot data.  Actually, it’s almost an anticlimax because it’s so simple, but let’s walk through the process and I’ll point out some interesting bits along the way.  To get started, click on the Query button on the Insight Analytics ribbon.
As before you select the model you want to query (the R model!) and then select the input data.  This time, you need to select that the Input Type is “Power Pivot Data.”  If you haven’t already, Predixion Insight will launch PowerPivot and then connect to PowerPivot to collect the list of tables.  One option you have with PowerPivot data that you don’t have with Excel data is that you can specify an arbitrary data filter.  This allows you to, for instance, only score the leads that were received within the last month or so rather than the entire dataset – very handy for continual deployment of your models!
After selecting your data, you have to map the model columns to the data columns as you did with the Profit Chart and then you get to the results page.  In the results page you choose what information you need from the model.  Since we are looking for leads with a probability to convert of at least 34.4%, all we need to do is to check the Probability of Yes checkbox.  Since we want to integrate the results directly into PowerPivot, we don’t need to check any of the boxes to copy source data.  You generally use that option when you need to move the results to a different destination.
Finally, you tell Predixion Insight what to do with the result.  In this case we are going to add the results from our R model back into PowerPivot.  Specify a table name and check the box with appends the result to the original table.  Predixion Insight automatically detects the best identifier column, so you don’t have to actually specify anything there.
Again, you can finish (even choose to create a Visual Macro) and either wait for the job complete or dismiss it and keep working while the job executes in the background.  Once the job finishes it will automatically update PowerPivot, or if you dismissed it, you click on the results to populate the PowerPivot table.  Like the ProfitChart result, you can share the results with colleagues by clicking on the collaboration button – in which case, it would probably be a good idea to add some of the additional source columns for context!  Note that you can also change the name of the job chicklet by selecting and typing over the default name.  This can be useful when sharing or if you want to revisit past job submissions.
Finally after you’ve fetched results you can see the results in your PowerPivot table.  In this screenshot you see that Predixion Insight created a new PowerPivot table called “R Query Results” and added a new calculated column to the source table that references the predicted result for each row.
Now you can analyze your PowerPivot data the typical way (PivotTables published to SharePoint!) integrating predictive results from R.  My first step would be to add a calculated column indicating which leads crossed the threshold indicated by the ProfitChart, but the possibilities are limitless!

Friday, May 20, 2011

Scoring R models against PowerPivot Data–Part 2–Creating an R classification model

This is part 2 of a three part series about integrating R and PowerPivot.

  • Part 1 – Intro (includes links to necessary components)
  • Part 2 – Creating a classification model in R
  • Part 3 – Evaluating and Scoring R models in Predixion

In this installment, I’m going to walk through the process of creating a predictive model in R, exporting the model as PMML and importing the model into Predixion Insight.  The demo requires packages from Togaware to create models and export them– you can go to the Togaware website and follow their instructions, or you can just launch the R console and enter:

> install.packages("RGtk2") 
> install.packages("rattle", dependencies=TRUE)

Now, you may be prompted to install additional packages and you may have to restart R after the library installation is complete – it’s probably a safe bet in any case.

After you’ve installed R and the correct libraries and are in the correct state of R preparedness, you can load the necessary libraries into R using these commands:

> library(rpart)
> library(pmml)

The next step is to load the training dataset into R.  As preparation for this article, I created a model (the same model as in our demo video) and used the technique in my previous post (for those who already read that, this was the point) to export the training set and saved the result as a .csv file.  I’ve put the training data file up on my SkyDrive if you just want to grab it.

This line reads the.csv file and puts the result into an object called “insurance”.  Note that R, due to it’s UNIX origins, uses forward slashes instead of backslashes as directory separators – also it’s case sensitive!

> insurance <- read.csv("Data/Insurance Demo/insurance.train.csv")

Next, I need to create a predictive model.  The rpart command creates a tree model on a dataset given a target column.  There are many additional parameters controlling how the tree is created, but for the purposes of this blog post, I’m just providing the shortest possible command.  This line creates a tree model and places it into an object called “TreeModel” that we can use later

> TreeModel <- rpart(Converted~.,insurance)

The pmml command converts the tree model to PMML, assigning the result to “TreePmml”.

> TreePmml<-pmml(TreeModel)

Finally, use write to output the pmml into a file – note again the forward slashes in the file path!

> write(toString(TreePmml),"Data/Insurance Demo/RModel.pmml")

This creates the file “RModel.pmml” that, if you’re curious, you can open up and examine in notepad or any other text editor.  Inspecting the text you’ll see that PMML is simply an XML representation of the patterns that were learned by the R algorithm – if you search for the word “node” you will find the structure of the tree itself.


Now that I have a model in PMML, I can launch Excel with Predixion Insight for Excel installed, switch to the Insight Analytics ribbon and click the My Stuff button.  In the My Stuff dialog, I need to click “Import PMML” to load the R model.  Note that PMML support is available in the Predixion Insight Free Trial and the Predixion Insight with PMML products.



After I click Import PMML and select and import my R model, it appears in the list of datasets and models under a special PMML node in all dialogs where you select a model, such as the My Stuff dialog or the accuracy chart and query wizards.  For most purposes PMML models are treated the same as native Predixion models in that you can import and export them, share them with other users, and publish them to shared collections.


In the next installment, I will show you how to use the R model in Predixion to generate accuracy charts, and, as promised, score data in PowerPivot!

Wednesday, May 18, 2011

Scoring R models against PowerPivot Data - Part 1 of 3–Introduction

This series is all about combining two completely different technologies that target two completely different audiences.   However, it turns out that those two audiences often have to serve the same master in looking for information and insight within their data.  And typically those two audiences don’t or can’t talk to each other, which makes driving common goals a bit difficult.

PowerPivot is a tool for preparing and creating data applications from a business analyst’s perspective.  You can bring data in from multiple sources, match them together, derive new information, create some calculations, and present the results in an attractive and meaningful business context.  You can download and find more information about PowerPivot at  Go ahead – it’s free!

R is a statistical language for preparing and creating data applications from a data scientist’s perspective.  You can import data, apply statistical tests and analysis, derive new information, and visualize the results in a scientific and statistical context.  You can download and find more information about R from  Go ahead – it’s free!

So the question remains, how can we make these two worlds collide in a meaningful way that takes the science performed in R and applies it to the business context of PowerPivot.  You knew I was going to say this, but Predixion Insight is the way!  Predixion Insight and Predixion Insight for Excel allows you to take predictive models created in R and apply them to data in PowerPivot (and Excel actually), thereby taking the scientific abstracts and providing them concrete business context.

In this three part series, I’m going to walk through the complete process of creating a model in R and then applying it to data in PowerPivot by means of Predixion Insight.  The four parts will cover the following topics:

  • Part 1 – Intro (you’re reading this now)
  • Part 2 – Creating a classification model in R
  • Part 3 – Evaluating and Scoring R models in Predixion

Friday, May 13, 2011

Querying the Training and Testing Sets

This article (as you could rather clearly surmise from the title) is about querying the training and testing sets of a Predixion Insight model.  This is a very useful technique and is handy in a variety of situations, but that is not the point of this article.  The point of this article will be revealed in a future posting, so for now you can sit back and learn how to retrieve training and testing sets.

Whenever you upload a rectangular dataset using Predixion Insight you have the option to automatically separate your data into training and testing sets.  The page of the wizard where you specify this information looks like this:


The parameters allow you to control how much data is set aside for testing up to a maximum.  Later this held out data is used to automatically create accuracy charts, profit charts and other fun stuff.

Once your data is uploaded to Predixion Insight, you can create as many models as you wish on the dataset with ease using any of the modeling tools or even visual macros for automation.  If, for any reason, you want to fetch the data from the Predixion Insight server back into your Excel workbook or PowerPivot table, you can use the Query tool on the Insight Analytics ribbon.


In the Query Wizard you select a dataset to query rather than a model.  Querying a model generally means fetching predictions or model patterns.  Querying a dataset simply means fetching the data that is cached in that dataset.


On the next page of the wizard you select which columns you want to fetch from the dataset.  You can click the checkbox at the top to retrieve all of them.  More importantly you can select a filter to limit which rows are returned.  This is where the magic comes in – there is a “special” filter clause that you can select called “Is Test Cases”.  Setting this filter clause to “true” causes only the test cases to be returned, whereas setting the clause to “false” returns only training cases.


And that’s it.  Although you now may have a new trick in handling your Predixion Insight datasets, you will have to wait until a future installment to learn why this detail is so important.  Until then – happy mining!

Wednesday, April 20, 2011

Profiling all PowerPivot tables at once using Predixion Insight VBA

In a previous post, I showed how to use Predixion Insight For Excel’s Data Profile feature to get descriptive statistics on PowerPivot tables – in that case using the great Practice Fusion data from the Windows Azure Data Marketplace.  In practice, however, this feature is so useful that I always find myself profiling every table every time I load up a dataset in PowerPivot, so it’s a bit cumbersome to go through the user interface (regardless of how attractive it really is) each time.
One thing that you may not be aware of is that Predixion Insight for Excel has a VBA API that allows you to not only interact with Predixion Insight objects, but with PowerPivot objects as well.  Using this API, I can easily create a simple VBA macro that executes a Data Profile for each attached PowerPivot table.
The first thing you need to do is to enable the developer ribbon in Excel since it’s hidden by default.  You can do this by going to File/Options and selecting “Customize the Ribbon” and checking the “Developer” option in the “Main Tabs” tree control.  Then switch to the developer ribbon and click the “Visual Basic” icon, which, in a throwback to turn of the century user interfaces, launches the VBA editor.
The VBA editor did not get the facelift the rest of Office 2007/10 did…
Once you’re in the VBA editor – you need to add a reference to the Predixion Insight VBA library.  Select Tools/References from the VBA editor and choose the Predixion Insight VBA API for Microsoft Excel.  (Hint – when the dialog appears, press “Q” and then scroll up – you will find it much quicker!)
To have a place to write some code, simply select the Insert menu and choose “Module”, which will create an editor that you can write code.  The code is very straightforward, you create your PowerPivot and Data Exploration objects from the Predixion VBA library, connect to PowerPivot to get the list of tables, and then get a data profile for each table that is returned.  You can copy the code below into the VBA editor as is.
Sub ProfileAllPowerPivotTables(AdvancedProfile As Boolean)
Dim PowerPivot As New PredixionVBA.PowerPivot
Dim DataExploration As New PredixionVBA.DataExploration

TableList = PowerPivot.GetPowerPivotTables()
For Each Table In TableList
Application.StatusBar = "Profiling PowerPivot Table: " + _
DataExploration.GenerateDataProfileForPowerPivot Table, "", _
AdvancedProfile, ""
Application.StatusBar = ""
End Sub

Sub ProfileAllPowerPivotTablesAdvanced()
ProfileAllPowerPivotTables (True)
End Sub

Sub ProfileAllPowerPivotTablesBasic() 
ProfileAllPowerPivotTables (False) 
End Sub 

As I stated, the code is fairly straightforward.  The other parameters for GenerateDataProfileForPowerPivot are an optional table filter and an optional list of columns, which you probably don’t need to get a quick profile of all the tables in a first shot.  Running either of the top level functions on the Predixion Sample Data workbook by clicking on the green “play” button generates all the profiles in about a minute with an index to boot.


Monday, April 18, 2011

Transactional Sampling with PowerPivot and Predixion Insight

Predixion Insight For Excel provides some excellent sampling tools that can be applied to data in Excel worksheets or in an embedded PowerPivot database.  If you believe that there are any analysis opportunities outside of pivot tables, sampling PowerPivot becomes very interesting.  Since very few tools outside of Predixion Insight can read PowerPivot data directly and you simply cannot move over 1 million rows from PowerPivot into an Excel worksheet, you are pretty much stuck if you want to perform varied analyses on that PowerPivot data.  By sampling PowerPivot data you can leverage any techniques that you know and love in native Excel for analysis as well as using any other analytical addins such as the Microsoft Solver Foundation.
Predixion Insight offers three kinds of sampling – percentage, row count, and balanced.  Percentage sampling allows you to specify a specific percentage and will produce a random sample with approximately that percentage of data rows from the source table.  Row count will produce a sample containing exactly the number of rows you request (assuming there are that many in the source table to begin with).  Balanced sampling is a special case since the purpose isn’t to reduce data volume, rather it is to accentuate a typically rare feature of the data for better pattern discovery – e.g. increase the balance of rows representing fraudulent vs non fraudulent cases.  Independent of the sampling type, the resulting sampled set can be created as a table in an Excel worksheet or in PowerPivot (ProTip: in a pinch, using Predixion Insight sampling is a great trick to move data back and forth between the two).
However, when you have transactional data, you really need to think about sampling a little differently.  In transactional data, information about a single transaction is contained across multiple rows.  Sampling individual rows would actually break up a transaction and change the nature of the data itself.  For example, a 20% sample of shopping cart data essential removes 8 items out of a 10 item basket – you would be very unlikely to find the mythical beer and diapers correlation that way!  Instead you need a way to sample that would take an entire transaction rather than individual rows.
In order to accomplish this, you need Excel and PowerPivot (obviously) as well as Predixion Insight for Excel and the source database from which you pull your transactional data. 
I’ll go through the entire sequence explicitly, but here are the high level steps:
  • Load transactional data into PowerPivot
  • Load distinct transactional id’s into PowerPivot
  • Create a sample of the transactional keys
  • Add a relationship to PowerPivot between the full data table and the sample
  • Add a calculated column indicating rows that are selected
  • Use that column as a table filter
In this example, I’m using some data from Last.FM regarding music artists and user submitted tags that I pulled from here.
You can see that the data is “transactional” with the “Artist ID” being the transaction ID with the “Tag” being the items in the transactions.
First get the data into PowerPivot by opening the PowerPivot window from Excel and selecting “From Database/From SQL Server” from the ribbon.  (If your data is in a different database, find it under “from other sources”)
This launches the table import wizard where you indicate the location and database containing the data to load, and then select the transactional table.
Next go back to the Table Import Wizard, but this time, select the “Write a query” option
The query is a simple “SELECT DISTINCT” of the transaction ID’s.  Note that the “Friendly Query Name” will end up as the table name in PowerPivot so it’s a good idea to set it here.
Then you go back to Excel, switch to the Insight Analytics ribbon and choose the “Sample Data” tool
Select “PowerPivot Data” as the input type and choose the result of the select distinct query as the table.
Choose the sampling method – in this case you will want to choose Random Sampling to reduce data volumes.
Then choose how many transactions you want in the output.  This is entirely up to you at this point.  For this example, I chose 10,000 transactions, but really it’s whatever you choose to be right for your scenario.
Finally, specify where you want the results – you need to add the results as a PowerPivot table, but you can change the name.  There’s no point in creating an unselected list as you will see later in the example.
When the sampling is complete you can open PowerPivot, switch to the sampled data page and see how many transactions you selected.  Since I chose 10,000 for this example, there are exactly 10,000 transaction ID’s in the result!
Now you need to create a relationship between the Selected Data table and the source table – in my example, that’s the LastFM table.  First, in the PowerPivot window, switch to the Design ribbon and click “Create Relationship.”
In the Create Relationship dialog, relate the transaction ID from the Selected Data table to the transaction ID of the full transaction table.
Next you need to add a calculated column to the full transaction table indicating for each row if that row has been selected or not.  There are many DAX expressions that will work to do this, I chose the straightforward “not blank related” approach. 
And that’s it!  You now have a sampled a transactional table – or at least labeled rows in a transactional table in such a way that you can easily select the sampled transactions.  In Predixion Insight for Excel, you can always filter a PowerPivot data source for any operation using the filter control on the data selection page.  Also, this is why you don’t need to keep the “unselected” samples since you can easily choose either the “selected” items or the “unselected” items in any filter.
Let me finish this off with one more ProTip.  It’s unlikely that you will want these extra tables – i.e. the distinct list and sampled list of transaction ids – to show up in your pivot table analyses.  While you can’t explicitly hide tables in PowerPivot, if you hide all the columns in a PowerPivot table it has the same effect.  So to effectively hide a table from pivot charts, go to the Design ribbon in PowerPivot and use the Hide and Unhide tool to hide all of the columns in the extra tables.

Thursday, March 3, 2011

Predixion Software on BITV

Bruno Aziza’s Business Intelligence TV spot on Predixion Software just aired on his blog, and despite the video thumbnail of me looking like I just sucked on a lemon, it came out pretty well.  During the video I discuss the application I built using the Practice Fusion Data from the Azure Marketplace.

The video below demonstrates the application and shows how I create the application starting with an almost empty workbook (there’s some graphics in there and a wee bit of VBA code). 

In case you didn’t know already, we have more videos on Predixion Insight at our YouTube channel that provide more information about Predixion and tutorials on how to use the product.
I’ll continue this series with some more discussion of the dataset as well as how to create the model and VBA code used in the video.

Wednesday, February 16, 2011

Prescription Finder Demo

This really isn’t a blog post – it’s just a post so you can find the related materials for the Prescription Finder demo that will appear in video links any time now.
Without a lot of additional explanations, just in case you’ve stumbled upon this accidentally, this is what the demo looks like.
You can access the demo files from the SkyDrive folder below.  To run the entire demo, you will need to import the included model into your Predixion Insight account (get a free trial if you don’t have one!)


Tuesday, February 8, 2011

Data Preparation Webinar Slide Deck

Here’s the deck that I presented at today’s webinar complete with the DAX formulas I added to the PracticeFusion PowerPivot workbook.  You can get the data from the Azure Marketplace or download one of my preloaded copies.
Included in the deck is the coupon code for 50% off the Predixion Insight Cloud service for 3 months – expires 3/15/2011!

Drop me a line or post a comment if you have any questions!  (I’ll update this post when the video is posted)
EDIT:  Webinar replay is available here

Sunday, February 6, 2011

The Full Data Profile - Azure Marketplace Data with PowerPivot and Predixion Insight Part 3

In the first post of the series I gave a screenshot of a basic Profile Data report from Predixion Insight, leaving creating the remaining reports as “an exercise for the reader.”  Of course, if you don’t currently have Excel 2010 installed on your system, you may worried that you wouldn’t be able to complete the assignment and might get a bad grade on the pop quiz that’s coming up after the next few posts.  (Just kidding).
Since I got permission to go ahead and post the data and results for my last post, I put the work-in-progress files up on SkyDrive.  To make it easier to see what Profile Data in Predixion Insight really does I embedded the results here for you to explore interactively.    If you want to feel like Gulliver in Lilliput poking through the castle windows you can click around in this interface, otherwise you can click on the “View full size workbook”image button for better access.

This workbook was created by downloading the Practice Fusion data as described here and running the Profile Data tool from Predixion Insight on each table in PowerPivot.  I ran Profile Data twice on each table so you can see the difference imagebetween the “Basic” and “Advanced” profile reports.  Then I just edited the names on  the worksheet tabs and the report titles.  The Insight Index is automatically created and updated to create a nice guide to the workbook – it also updates automatically as you change worksheet names and report titles, so it’s convenient as well.  If you delete rows in your index, or delete the whole page, or somehow do something irreversible to the worksheet (it is in an Excel worksheet, after all), you can regenerate the Insight Index from the Options dialog available under the Predixion menu button.
One more thing I had to do to host the workbook interactively in SkyDrive is to drop all the PowerPivot tables after creating the reports.  The Excel file was between 9 and 10 MB which is too large for viewing in the browser.

Advanced Data Profile

I explained a bit about the basic data profile at the start of this series, so I’ll just explain briefly the difference.  Essentially the difference is that you just get more stats with “Advanced” than you do with “Basic”, plus the report takes slightly longer to execute as it has more to compute.
For numerical columns, Advanced adds measures such as Kurtosis and Skewness which measure the “normality” of the data – that is – how well the data conforms to a bell curve, an approximate median, the mode, and the confidence interval.  For those who aren’t familiar with these terms, they are defined with links to appropriate Wikipedia articles in the column headers.
The Advanced Profile Data report also contains correlation and covariance matrices for numerical columns.  Looking at the correlation matrix for the SyncChart table, for example, shows a clear correlation between Weight and BMI, and a negative correlation between Height and BMI.
For categorical columns, the Advanced report adds the top and bottom 3 states present in the column, as well as a number representing the top 80% of the states in the data.  This number tells you how many values of the column cover 80% of the data.  Using the SyncChart example again, you can see that although there are 39 different physician specialties represented in the data, 80% of the visits are to only 3 types of physicians.
A good use of the top and bottom states is to let you know if the data is particularly useful.  For instance the PhysicianSpecialty row above shows that the bottom three states appear only once in the data.  This indicates that, for Predictive Analytics at least, you will not find any patterns regarding Naturopathy-Acupuncture, Dermatology, or Physical Medicine & Rehabilitation using this data.  A good practice to deal with data that shows up with way is to use the Clean Data/Outliers tool to rename such states to “Other”.  (Personally, with this data, I would also use Clean Data/Relabel to change blank states in this columns to “Unknown”).

How do get this stuff

Just to recap for those who haven’t read the other posts, the you can get the tools used to create these reports from the following locations: