Wednesday, September 5, 2012

Net Lift Modeling with SAS

Net Lift Modeling is best explained by Victor Lo in "The True Lift Model - A Novel Data Mining Approach to Response Modeling in Database Marketing", SIGKDD Explorations. Volume 4 (2002), Issue 2, pg 78-86. I will not add an explanation here. Rather I want to pass on some modeling notes from my experience with neft lift modeling with SAS Enterprise Guide and Enterprise Miner.

1. A customer developed a net lift model template in SAS Enterprise Miner (EM) that I have been using. The model is well developed and much generalized. However, the modeler/analyst must fully understand the developers customized code as well as SAS EM built-in functions. Also, each model must have startup code. A template is available for this, but it must be customized for the modeler’s environment. Correct library names and directory paths are essential.

2. Input data analysis consumes about 80% of the modeling time. Analysts have an “ethical modeling requirement” to fully understand and develop “intelligent” data sets before embarking on modeling the data. The customer derives its data from several internal and external sources and a typical marketing model may utilize a dozen or more SAS data sets. Model variables can number in the thousands, and observations in the millions. Data sets vary as much in their completeness as they do in their sources. Moreover, the data set developers do not fully understand how the data will be used.

3. Variable names are not always chosen for ease of modeling. For instance, one of the most important variables, from an information value perspective, is numerical with a 12-character name. By the time this variable has gone through variable selection, imputation, transformation, and regression, the name may increase to 21+ characters. When the interactions are formed, the variable name increases to 38 characters. The Lo regression in EM cannot handle the length of “long” variables names.

4. In order for the Lo Regression path to perform properly, numerical variables must be included for interaction terms. In the presence of the data cleansing macros, reformatted binary and categorical variables in a model with binary treatment and response, numerical values do not stand a chance of being included, especially if the intervals are wide. You can transform such a variable in EM but this lengthens the variable name even more. If the regression model does not include at least one numerical interval variable, the path will fail to execute. The EM net lift model template also appears to be sensitive to too many variable imputations and transformations occurring in EM itself.

5. Macros are a wonderful EG tools. An additional needed macro is one that analyzes the variable distribution for skewness, particularly that of an exponential random variate. For logistics regression, when such a variable appears, it is appropriate to perform a log transformation, and it is best to do this before importing the data to EM. I write query programs that perform this function.

6. Information value macro produces a lot of information that is useful in evaluating the possible impact of variables on a given treatment and response. In working with the data, some variables may provide redundant information, yet rank as important. For instance D_XXXXXX (D for binary dummy variable) provides the same information as XXXXXX_P_C (converted to binary and collapsed) and both are be binary. The analyst cannot just rank order variables 1 through 200 and blindly use them in the model with these redundancies. Other variables may be more informative.

7. The data that goes into the sample (and is compiled by macro that samples from different data bases, sorts, and renames variables) is in some cases ambiguous and not labeled to explain the variables. Data from the AXIOM company is a noteworthy exception, but the Getsample macro renaming loses this information. Analyst must invest the time necessary to understand the data they are including in the model. Failure to do so may result in a poor model.

8. I have found and corrected minor errors that have huge impacts in macros and lift model template code. Some of these were caused by a code syntax change between versions and is not the fault of the developer. Another part of the problem is that analyst open a macro or code node, make changes and save them, writing over the original files. Care must be taken to use SAVE AS… with a different file name when using SAS Enterprise products. Versioning is also recommended. Another problem occurs when variables are dropped from data sets or there are variable name changes.

9. As mentioned, numerical variables have little chance of being included in a logistics regression in the presence of numerous binary and nominal variables produced from macros. I have had to use techniques to force an important numerical variable into the logistics regression. Techniques include log transformations, manual selection, and reduced R-square or Chi-square constraints. However, this may allow variables that are not very explanatory into the model as well. Each variable must be analyzed for its Wald chi-square value and dealt with on a case by case basis.

10. Net lift models for rare events seem to be extremely sensitive to oversampling. If oversampling is required to fit a regression model, prior probabilities may need to be addressed in the population. Analysts should perform sensitivity analyses for these models, which requires additional time in the data.

11. Whenever the data changes or additional variables are added, the EM model must be adjusted. This is another time consuming process.

12. Modeling is as much of an art as it is a science. Particularly with data that does not follow the laws of physical dynamics, a marketing model may vary in appearance between different “artists”, much like an artist’s interpretation of painting of a landscape.

13. It is more important to understand the data analysis and modeling processes than it is to understand a single tool. Tools change and the come and go. There have also been coding changes between versions.

14. Food for thought: “All models are wrong; some are useful” (George Box). We are not modeling reality, just a “sub-reality” filled in with assumptions and sketchy data (in some cases). A good model is better than trying to achieve the non-existent perfect model.

15. Operations research analysts are adept at documenting models and processes and could be relied upon for doing this task. Writing is as much a part of their tool box as is analysis.

Tuesday, September 4, 2012

Back in the Saddle Again

I took a break from blogging while getting established in a new job. My next series of blogs will be on predictive modeling using statitical modeling and analysis tools. I am currently working with SAS Enterprise Guide and Enterprise Miner performing net lift modeling. I would like to share some lessons learned.