Showing posts with label Data Exploration. Show all posts
Showing posts with label Data Exploration. Show all posts

Wednesday, April 20, 2011

Profiling all PowerPivot tables at once using Predixion Insight VBA

In a previous post, I showed how to use Predixion Insight For Excel’s Data Profile feature to get descriptive statistics on PowerPivot tables – in that case using the great Practice Fusion data from the Windows Azure Data Marketplace.  In practice, however, this feature is so useful that I always find myself profiling every table every time I load up a dataset in PowerPivot, so it’s a bit cumbersome to go through the user interface (regardless of how attractive it really is) each time.
One thing that you may not be aware of is that Predixion Insight for Excel has a VBA API that allows you to not only interact with Predixion Insight objects, but with PowerPivot objects as well.  Using this API, I can easily create a simple VBA macro that executes a Data Profile for each attached PowerPivot table.
The first thing you need to do is to enable the developer ribbon in Excel since it’s hidden by default.  You can do this by going to File/Options and selecting “Customize the Ribbon” and checking the “Developer” option in the “Main Tabs” tree control.  Then switch to the developer ribbon and click the “Visual Basic” icon, which, in a throwback to turn of the century user interfaces, launches the VBA editor.
image
The VBA editor did not get the facelift the rest of Office 2007/10 did…
Once you’re in the VBA editor – you need to add a reference to the Predixion Insight VBA library.  Select Tools/References from the VBA editor and choose the Predixion Insight VBA API for Microsoft Excel.  (Hint – when the dialog appears, press “Q” and then scroll up – you will find it much quicker!)
image
To have a place to write some code, simply select the Insert menu and choose “Module”, which will create an editor that you can write code.  The code is very straightforward, you create your PowerPivot and Data Exploration objects from the Predixion VBA library, connect to PowerPivot to get the list of tables, and then get a data profile for each table that is returned.  You can copy the code below into the VBA editor as is.
Sub ProfileAllPowerPivotTables(AdvancedProfile As Boolean)
Dim PowerPivot As New PredixionVBA.PowerPivot
Dim DataExploration As New PredixionVBA.DataExploration

PowerPivot.ConnectToPowerPivot
TableList = PowerPivot.GetPowerPivotTables()
For Each Table In TableList
Application.StatusBar = "Profiling PowerPivot Table: " + _
Table
DataExploration.GenerateDataProfileForPowerPivot Table, "", _
AdvancedProfile, ""
Next
Application.StatusBar = ""
End Sub


Sub ProfileAllPowerPivotTablesAdvanced()
ProfileAllPowerPivotTables (True)
End Sub


Sub ProfileAllPowerPivotTablesBasic() 
ProfileAllPowerPivotTables (False) 
End Sub 






As I stated, the code is fairly straightforward.  The other parameters for GenerateDataProfileForPowerPivot are an optional table filter and an optional list of columns, which you probably don’t need to get a quick profile of all the tables in a first shot.  Running either of the top level functions on the Predixion Sample Data workbook by clicking on the green “play” button generates all the profiles in about a minute with an index to boot.





image

Sunday, February 6, 2011

The Full Data Profile - Azure Marketplace Data with PowerPivot and Predixion Insight Part 3

image
In the first post of the series I gave a screenshot of a basic Profile Data report from Predixion Insight, leaving creating the remaining reports as “an exercise for the reader.”  Of course, if you don’t currently have Excel 2010 installed on your system, you may worried that you wouldn’t be able to complete the assignment and might get a bad grade on the pop quiz that’s coming up after the next few posts.  (Just kidding).
Since I got permission to go ahead and post the data and results for my last post, I put the work-in-progress files up on SkyDrive.  To make it easier to see what Profile Data in Predixion Insight really does I embedded the results here for you to explore interactively.    If you want to feel like Gulliver in Lilliput poking through the castle windows you can click around in this interface, otherwise you can click on the “View full size workbook”image button for better access.

This workbook was created by downloading the Practice Fusion data as described here and running the Profile Data tool from Predixion Insight on each table in PowerPivot.  I ran Profile Data twice on each table so you can see the difference imagebetween the “Basic” and “Advanced” profile reports.  Then I just edited the names on  the worksheet tabs and the report titles.  The Insight Index is automatically created and updated to create a nice guide to the workbook – it also updates automatically as you change worksheet names and report titles, so it’s convenient as well.  If you delete rows in your index, or delete the whole page, or somehow do something irreversible to the worksheet (it is in an Excel worksheet, after all), you can regenerate the Insight Index from the Options dialog available under the Predixion menu button.
One more thing I had to do to host the workbook interactively in SkyDrive is to drop all the PowerPivot tables after creating the reports.  The Excel file was between 9 and 10 MB which is too large for viewing in the browser.

Advanced Data Profile

I explained a bit about the basic data profile at the start of this series, so I’ll just explain briefly the difference.  Essentially the difference is that you just get more stats with “Advanced” than you do with “Basic”, plus the report takes slightly longer to execute as it has more to compute.
For numerical columns, Advanced adds measures such as Kurtosis and Skewness which measure the “normality” of the data – that is – how well the data conforms to a bell curve, an approximate median, the mode, and the confidence interval.  For those who aren’t familiar with these terms, they are defined with links to appropriate Wikipedia articles in the column headers.
The Advanced Profile Data report also contains correlation and covariance matrices for numerical columns.  Looking at the correlation matrix for the SyncChart table, for example, shows a clear correlation between Weight and BMI, and a negative correlation between Height and BMI.
image
For categorical columns, the Advanced report adds the top and bottom 3 states present in the column, as well as a number representing the top 80% of the states in the data.  This number tells you how many values of the column cover 80% of the data.  Using the SyncChart example again, you can see that although there are 39 different physician specialties represented in the data, 80% of the visits are to only 3 types of physicians.
imageimage
A good use of the top and bottom states is to let you know if the data is particularly useful.  For instance the PhysicianSpecialty row above shows that the bottom three states appear only once in the data.  This indicates that, for Predictive Analytics at least, you will not find any patterns regarding Naturopathy-Acupuncture, Dermatology, or Physical Medicine & Rehabilitation using this data.  A good practice to deal with data that shows up with way is to use the Clean Data/Outliers tool to rename such states to “Other”.  (Personally, with this data, I would also use Clean Data/Relabel to change blank states in this columns to “Unknown”).

How do get this stuff

Just to recap for those who haven’t read the other posts, the you can get the tools used to create these reports from the following locations: