Jamie's Junk: PowerPivot

Showing posts with label PowerPivot. Show all posts

Tuesday, August 16, 2011

Scoring R models against PowerPivot Data–Part 3–Evaluating and Scoring

This is part 2 of a three part series about integrating R and PowerPivot.

Part 1 – Intro (includes links to necessary components)
Part 2 – Creating a classification model in R
Part 3 – Evaluating and Scoring R models in Predixion

In my previous post, I walked through the creation of a classification model in R and imported the model into Predixion Insight. The necessary files for this part are on my SkyDrive – in order to follow through with the steps in this phase you will need to import the PMML file into Predixion Insight if you don’t already have it. (Go to Insight Analytics/My Stuff – click “Import”). Also, open the Excel file “R Demo.xslx”.
In this installment I will show you how to create a profit chart using the R model against data in Excel, and then score the model against data in PowerPivot.
A profit chart allows you to determine how much profit (or loss) you will achieve by using a predictive model to select which cases – e.g. leads, customers, etc – to target. The profit chart gives you information on how to apply the model to data in order to maximize your profit based on your cost metrics.
To launch the Profit Chart wizard, click on the Profit Chart button in the “Test” section of the Insight Analytics ribbon.

Once the Profit Chart wizard appears (you will be prompted to log in to Predixion Insight if you haven’t already) skip past the intro screen and select the model you imported from R – it will be under the PMML Models node which appears after all of the Predixion datasets

Note that when you select your PMML model you will see information about the model in the description pane. Not all PMML models support retrieving the probability of output states like the models created using Predixion Insight. If a PMML model supports this functionality it will be listed in the Output section in the Description pane. Retrieving the probability of the output states is required to use Profit Charts and Accuracy charts on PMML models.

The next step is to specify the business parameters of the profit chart – that is, the real factors that are important to how the model is expected to perform. In this case, you need to choose how much the campaign is going to cost and how much you expect to receive from successful contacts. Since there is a variable cost associated with contacting leads, you will need to click on the “…” button next to the individual cost field. Specify your costs according to these screen shots and click next.

Clicking Next on the dialog will bring you to the Select Input Data page. The R Demo Excel file has some prepared test data that you can use. You can also use data from PowerPivot, or test data that is automatically separated in Predixion datasets. If you were already looking at the test data when you launched the wizard, the table would already be selected – otherwise select the test data and click next to advance.

On the final page of the wizard, you need to specify the relationships between the model columns and the columns in your Excel spreadsheet. Predixion Insight will automatically bind the columns based on the column names if possible. At this point you can click the Finish button and be done with it, however, if you want to be able to easily rerun the profit chart with different parameters or even automate the process, you can click on the little arrow on the Finish button and select “Execute and Create a Visual Macro.” This will not only start the Profit Chart, but also create a worksheet with a Profit Chart Visual Macro – that is, a worksheet with the executable description of the job you just performed. This Visual Macro can be altered and re-run however you choose, and even executed from VBA or automated via SQL Server Integration Services

When the progress dialog comes up – it’s a good idea to dismiss it by clicking on the “Minimize to Task Pane” button which allows the operation to execute in the background. When complete you can click on the “Results” button in the job chicklet which renders the Profit Chart into your workbook. You can also click on the collaboration icon to send the job results directly to your friends and colleagues or publish to shared collections.

The actual Profit Chart itself shows us the maximum profit we can expect by using the R model to optimize our campaign. Additionally it tells us the probability threshold we need to consider when selecting which leads to contact. Specifically, in this case, we should only contact leads that have a probability of conversion, as dictated by the model, of 34.4% or greater.

Now that you know how to evaluate your R model – or more appropriately – how to evaluate the predictions from your R model, it’s time for the climax of this series – applying that R model to PowerPivot data. Actually, it’s almost an anticlimax because it’s so simple, but let’s walk through the process and I’ll point out some interesting bits along the way. To get started, click on the Query button on the Insight Analytics ribbon.

As before you select the model you want to query (the R model!) and then select the input data. This time, you need to select that the Input Type is “Power Pivot Data.” If you haven’t already, Predixion Insight will launch PowerPivot and then connect to PowerPivot to collect the list of tables. One option you have with PowerPivot data that you don’t have with Excel data is that you can specify an arbitrary data filter. This allows you to, for instance, only score the leads that were received within the last month or so rather than the entire dataset – very handy for continual deployment of your models!

After selecting your data, you have to map the model columns to the data columns as you did with the Profit Chart and then you get to the results page. In the results page you choose what information you need from the model. Since we are looking for leads with a probability to convert of at least 34.4%, all we need to do is to check the Probability of Yes checkbox. Since we want to integrate the results directly into PowerPivot, we don’t need to check any of the boxes to copy source data. You generally use that option when you need to move the results to a different destination.

Finally, you tell Predixion Insight what to do with the result. In this case we are going to add the results from our R model back into PowerPivot. Specify a table name and check the box with appends the result to the original table. Predixion Insight automatically detects the best identifier column, so you don’t have to actually specify anything there.

Again, you can finish (even choose to create a Visual Macro) and either wait for the job complete or dismiss it and keep working while the job executes in the background. Once the job finishes it will automatically update PowerPivot, or if you dismissed it, you click on the results to populate the PowerPivot table. Like the ProfitChart result, you can share the results with colleagues by clicking on the collaboration button – in which case, it would probably be a good idea to add some of the additional source columns for context! Note that you can also change the name of the job chicklet by selecting and typing over the default name. This can be useful when sharing or if you want to revisit past job submissions.

Finally after you’ve fetched results you can see the results in your PowerPivot table. In this screenshot you see that Predixion Insight created a new PowerPivot table called “R Query Results” and added a new calculated column to the source table that references the predicted result for each row.

Now you can analyze your PowerPivot data the typical way (PivotTables published to SharePoint!) integrating predictive results from R. My first step would be to add a calculated column indicating which leads crossed the threshold indicated by the ProfitChart, but the possibilities are limitless!

Monday, April 18, 2011

Transactional Sampling with PowerPivot and Predixion Insight

Predixion Insight For Excel provides some excellent sampling tools that can be applied to data in Excel worksheets or in an embedded PowerPivot database. If you believe that there are any analysis opportunities outside of pivot tables, sampling PowerPivot becomes very interesting. Since very few tools outside of Predixion Insight can read PowerPivot data directly and you simply cannot move over 1 million rows from PowerPivot into an Excel worksheet, you are pretty much stuck if you want to perform varied analyses on that PowerPivot data. By sampling PowerPivot data you can leverage any techniques that you know and love in native Excel for analysis as well as using any other analytical addins such as the Microsoft Solver Foundation.
Predixion Insight offers three kinds of sampling – percentage, row count, and balanced. Percentage sampling allows you to specify a specific percentage and will produce a random sample with approximately that percentage of data rows from the source table. Row count will produce a sample containing exactly the number of rows you request (assuming there are that many in the source table to begin with). Balanced sampling is a special case since the purpose isn’t to reduce data volume, rather it is to accentuate a typically rare feature of the data for better pattern discovery – e.g. increase the balance of rows representing fraudulent vs non fraudulent cases. Independent of the sampling type, the resulting sampled set can be created as a table in an Excel worksheet or in PowerPivot (ProTip: in a pinch, using Predixion Insight sampling is a great trick to move data back and forth between the two).
However, when you have transactional data, you really need to think about sampling a little differently. In transactional data, information about a single transaction is contained across multiple rows. Sampling individual rows would actually break up a transaction and change the nature of the data itself. For example, a 20% sample of shopping cart data essential removes 8 items out of a 10 item basket – you would be very unlikely to find the mythical beer and diapers correlation that way! Instead you need a way to sample that would take an entire transaction rather than individual rows.
In order to accomplish this, you need Excel and PowerPivot (obviously) as well as Predixion Insight for Excel and the source database from which you pull your transactional data.
I’ll go through the entire sequence explicitly, but here are the high level steps:

Load transactional data into PowerPivot
Load distinct transactional id’s into PowerPivot
Create a sample of the transactional keys
Add a relationship to PowerPivot between the full data table and the sample
Add a calculated column indicating rows that are selected
Use that column as a table filter

In this example, I’m using some data from Last.FM regarding music artists and user submitted tags that I pulled from here.

You can see that the data is “transactional” with the “Artist ID” being the transaction ID with the “Tag” being the items in the transactions.
First get the data into PowerPivot by opening the PowerPivot window from Excel and selecting “From Database/From SQL Server” from the ribbon. (If your data is in a different database, find it under “from other sources”)

This launches the table import wizard where you indicate the location and database containing the data to load, and then select the transactional table.

Next go back to the Table Import Wizard, but this time, select the “Write a query” option

The query is a simple “SELECT DISTINCT” of the transaction ID’s. Note that the “Friendly Query Name” will end up as the table name in PowerPivot so it’s a good idea to set it here.

Then you go back to Excel, switch to the Insight Analytics ribbon and choose the “Sample Data” tool

Select “PowerPivot Data” as the input type and choose the result of the select distinct query as the table.

Choose the sampling method – in this case you will want to choose Random Sampling to reduce data volumes.

Then choose how many transactions you want in the output. This is entirely up to you at this point. For this example, I chose 10,000 transactions, but really it’s whatever you choose to be right for your scenario.

Finally, specify where you want the results – you need to add the results as a PowerPivot table, but you can change the name. There’s no point in creating an unselected list as you will see later in the example.

When the sampling is complete you can open PowerPivot, switch to the sampled data page and see how many transactions you selected. Since I chose 10,000 for this example, there are exactly 10,000 transaction ID’s in the result!

Now you need to create a relationship between the Selected Data table and the source table – in my example, that’s the LastFM table. First, in the PowerPivot window, switch to the Design ribbon and click “Create Relationship.”

In the Create Relationship dialog, relate the transaction ID from the Selected Data table to the transaction ID of the full transaction table.

Next you need to add a calculated column to the full transaction table indicating for each row if that row has been selected or not. There are many DAX expressions that will work to do this, I chose the straightforward “not blank related” approach.

And that’s it! You now have a sampled a transactional table – or at least labeled rows in a transactional table in such a way that you can easily select the sampled transactions. In Predixion Insight for Excel, you can always filter a PowerPivot data source for any operation using the filter control on the data selection page. Also, this is why you don’t need to keep the “unselected” samples since you can easily choose either the “selected” items or the “unselected” items in any filter.

Let me finish this off with one more ProTip. It’s unlikely that you will want these extra tables – i.e. the distinct list and sampled list of transaction ids – to show up in your pivot table analyses. While you can’t explicitly hide tables in PowerPivot, if you hide all the columns in a PowerPivot table it has the same effect. So to effectively hide a table from pivot charts, go to the Design ribbon in PowerPivot and use the Hide and Unhide tool to hide all of the columns in the extra tables.

Wednesday, February 16, 2011

Prescription Finder Demo

This really isn’t a blog post – it’s just a post so you can find the related materials for the Prescription Finder demo that will appear in video links any time now.
Without a lot of additional explanations, just in case you’ve stumbled upon this accidentally, this is what the demo looks like.

You can access the demo files from the SkyDrive folder below. To run the entire demo, you will need to import the included model into your Predixion Insight account (get a free trial if you don’t have one!)

Enjoy!

Tuesday, February 8, 2011

Data Preparation Webinar Slide Deck

Here’s the deck that I presented at today’s webinar complete with the DAX formulas I added to the PracticeFusion PowerPivot workbook. You can get the data from the Azure Marketplace or download one of my preloaded copies.
Included in the deck is the coupon code for 50% off the Predixion Insight Cloud service for 3 months – expires 3/15/2011!

Drop me a line or post a comment if you have any questions! (I’ll update this post when the video is posted)
EDIT: Webinar replay is available here

Sunday, February 6, 2011

The Full Data Profile - Azure Marketplace Data with PowerPivot and Predixion Insight Part 3

In the first post of the series I gave a screenshot of a basic Profile Data report from Predixion Insight, leaving creating the remaining reports as “an exercise for the reader.” Of course, if you don’t currently have Excel 2010 installed on your system, you may worried that you wouldn’t be able to complete the assignment and might get a bad grade on the pop quiz that’s coming up after the next few posts. (Just kidding).
Since I got permission to go ahead and post the data and results for my last post, I put the work-in-progress files up on SkyDrive. To make it easier to see what Profile Data in Predixion Insight really does I embedded the results here for you to explore interactively. If you want to feel like Gulliver in Lilliput poking through the castle windows you can click around in this interface, otherwise you can click on the “View full size workbook”

button for better access.

This workbook was created by downloading the Practice Fusion data as described here and running the Profile Data tool from Predixion Insight on each table in PowerPivot. I ran Profile Data twice on each table so you can see the difference

between the “Basic” and “Advanced” profile reports. Then I just edited the names on the worksheet tabs and the report titles. The Insight Index is automatically created and updated to create a nice guide to the workbook – it also updates automatically as you change worksheet names and report titles, so it’s convenient as well. If you delete rows in your index, or delete the whole page, or somehow do something irreversible to the worksheet (it is in an Excel worksheet, after all), you can regenerate the Insight Index from the Options dialog available under the Predixion menu button.
One more thing I had to do to host the workbook interactively in SkyDrive is to drop all the PowerPivot tables after creating the reports. The Excel file was between 9 and 10 MB which is too large for viewing in the browser.

Advanced Data Profile

I explained a bit about the basic data profile at the start of this series, so I’ll just explain briefly the difference. Essentially the difference is that you just get more stats with “Advanced” than you do with “Basic”, plus the report takes slightly longer to execute as it has more to compute.
For numerical columns, Advanced adds measures such as Kurtosis and Skewness which measure the “normality” of the data – that is – how well the data conforms to a bell curve, an approximate median, the mode, and the confidence interval. For those who aren’t familiar with these terms, they are defined with links to appropriate Wikipedia articles in the column headers.
The Advanced Profile Data report also contains correlation and covariance matrices for numerical columns. Looking at the correlation matrix for the SyncChart table, for example, shows a clear correlation between Weight and BMI, and a negative correlation between Height and BMI.

For categorical columns, the Advanced report adds the top and bottom 3 states present in the column, as well as a number representing the top 80% of the states in the data. This number tells you how many values of the column cover 80% of the data. Using the SyncChart example again, you can see that although there are 39 different physician specialties represented in the data, 80% of the visits are to only 3 types of physicians.

A good use of the top and bottom states is to let you know if the data is particularly useful. For instance the PhysicianSpecialty row above shows that the bottom three states appear only once in the data. This indicates that, for Predictive Analytics at least, you will not find any patterns regarding Naturopathy-Acupuncture, Dermatology, or Physical Medicine & Rehabilitation using this data. A good practice to deal with data that shows up with way is to use the Clean Data/Outliers tool to rename such states to “Other”. (Personally, with this data, I would also use Clean Data/Relabel to change blank states in this columns to “Unknown”).

How do get this stuff

Just to recap for those who haven’t read the other posts, the you can get the tools used to create these reports from the following locations: