In the first post of the series I gave a screenshot of a basic Profile Data report from Predixion Insight, leaving creating the remaining reports as “an exercise for the reader.” Of course, if you don’t currently have Excel 2010 installed on your system, you may worried that you wouldn’t be able to complete the assignment and might get a bad grade on the pop quiz that’s coming up after the next few posts. (Just kidding).
Since I got permission to go ahead and post the data and results for my last post, I put the work-in-progress files up on SkyDrive. To make it easier to see what Profile Data in Predixion Insight really does I embedded the results here for you to explore interactively. If you want to feel like Gulliver in Lilliput poking through the castle windows you can click around in this interface, otherwise you can click on the “View full size workbook” button for better access.
This workbook was created by downloading the Practice Fusion data as described here and running the Profile Data tool from Predixion Insight on each table in PowerPivot. I ran Profile Data twice on each table so you can see the difference between the “Basic” and “Advanced” profile reports. Then I just edited the names on the worksheet tabs and the report titles. The Insight Index is automatically created and updated to create a nice guide to the workbook – it also updates automatically as you change worksheet names and report titles, so it’s convenient as well. If you delete rows in your index, or delete the whole page, or somehow do something irreversible to the worksheet (it is in an Excel worksheet, after all), you can regenerate the Insight Index from the Options dialog available under the Predixion menu button.
One more thing I had to do to host the workbook interactively in SkyDrive is to drop all the PowerPivot tables after creating the reports. The Excel file was between 9 and 10 MB which is too large for viewing in the browser.
Advanced Data ProfileI explained a bit about the basic data profile at the start of this series, so I’ll just explain briefly the difference. Essentially the difference is that you just get more stats with “Advanced” than you do with “Basic”, plus the report takes slightly longer to execute as it has more to compute.
For numerical columns, Advanced adds measures such as Kurtosis and Skewness which measure the “normality” of the data – that is – how well the data conforms to a bell curve, an approximate median, the mode, and the confidence interval. For those who aren’t familiar with these terms, they are defined with links to appropriate Wikipedia articles in the column headers.
The Advanced Profile Data report also contains correlation and covariance matrices for numerical columns. Looking at the correlation matrix for the SyncChart table, for example, shows a clear correlation between Weight and BMI, and a negative correlation between Height and BMI.
For categorical columns, the Advanced report adds the top and bottom 3 states present in the column, as well as a number representing the top 80% of the states in the data. This number tells you how many values of the column cover 80% of the data. Using the SyncChart example again, you can see that although there are 39 different physician specialties represented in the data, 80% of the visits are to only 3 types of physicians.
A good use of the top and bottom states is to let you know if the data is particularly useful. For instance the PhysicianSpecialty row above shows that the bottom three states appear only once in the data. This indicates that, for Predictive Analytics at least, you will not find any patterns regarding Naturopathy-Acupuncture, Dermatology, or Physical Medicine & Rehabilitation using this data. A good practice to deal with data that shows up with way is to use the Clean Data/Outliers tool to rename such states to “Other”. (Personally, with this data, I would also use Clean Data/Relabel to change blank states in this columns to “Unknown”).