Tuesday, May 14, 2013

How Are 10 Most Influential People in Data Analytics Selected?

I was asked by Gregory at KDnuggets about how I came up with "the 10 Most Influential People in Data Analytics" in my last blog post. The following was my reply.

To select 10 most influential people in data analytics, the following considerations are taken into account regarding an individual's contribution.

  1. The contribution is significant.
  2. The contribution is active/regular.
  3. A large number of people are impacted by the contribution.
  4. The focus is on the non-academic field.

I performed online research first to find qualified people. KDnuggets.com, other websites of social networking, data analytics conferences, consulting firms, and Amazon are just a few examples of good sources for information. I also took advantage of my own network. Being in the industry for 15 years, I have known many great data analytics professionals. They provided me with many names of qualified people. After I compiled a preliminary list, I sent it to a number of experts for their feedback. The final list was the result of several iterations.

Monday, May 13, 2013

10 Most Influential People in Data Analytics

We have identified 10 most influential people whose significant contributions have greatly enriched the data analytics community. This is the result of months of research. The following is the list (in alphabetical order of the last name).


Dean Abbott Michael Berry Tom Davenport John Elder Rayid Ghani
Anthony Goldbloom Vincent Granville Gregory Piatetsky-Shapiro Karl Rexer Eric Siegel



Dean Abbott is President of Abbott Analytics, Inc. in San Diego, California. Mr. Abbott is an internationally recognized data mining and predictive analytics expert with over two decades experience applying advanced data mining algorithms, data preparation techniques, and data visualization methods to real-world problems, including fraud detection, risk modeling, text mining, personality assessment, response modeling, survey analysis, planned giving, and predictive toxicology. He is also Chief Scientist of SmarterRemarketer, a startup company focusing on behaviorally- and data-driven marketing attribution and web analytics.

Mr. Abbott is a highly regarded and popular speaker at Predictive Analytics and Data Mining conferences, including Predictive Analytics World, Predictive Analytics Summit, the Predictive Analytics Center of Excellence, SAS Institute, DM Radio, and INFORMS.

He has served on the program committees for the KDD Industrial Track and Data Mining Case Studies workshop and is on the Advisory Boards for the UC/Irvine Predictive Analytics Certificate and the UCSD Data Mining Certificate programs. Mr. Abbott has taught applied data mining and text mining courses using IBM SPSS Modeler, Statsoft Statistica, Salford Systems SPM, SAS Enterprise Miner, Tibco Spotfire Miner, IBM Affinium Model, Megaputer Polyanalyst, KNIME, and RapidMiner.


Michael Berry is a recognized authority on business applications of data mining. He is the author (with Gordon Linoff) of several well-regarded books in the field including Data Mining Techniques for Marketing, Sales, and Customer Relationship Management which is now in its third edition.

He is currently responsible for analytics and business intelligence for the business-to-business side of Tripadvisor (www.tripadvisor.com). Mr. Berry is a co-founder of Data Miners, Inc. (www.data-miners.com), a consultancy specializing in the analysis of large volumes of data for marketing and CRM purposes.


Tom Davenport is a Visiting Professor at Harvard Business School. He is also the President’s Distinguished Professor of Information Technology and Management at Babson College, the co-founder of the International Institute for Analytics, and a Senior Advisor to Deloitte Analytics. He has published on the topics of analytics in business, process management, information and knowledge management, and enterprise systems. He pioneered the concept of “competing on analytics” with his best-selling 2006 Harvard Business Review article (and his 2007 book by the same name). His most recent book is Keeping Up with the Quants:Your Guide to Understanding and Using Analytics, with Jinho Kim. He wrote or edited fifteen other books, and over 100 articles for Harvard Business Review, Sloan Management Review, the Financial Times, and many other publications. In 2003 he was named one of the world’s “Top 25 Consultants” by Consulting magazine. In 2005 Optimize magazine’s readers named him among the top 3 business/technology analysts in the world. In 2007 and 2008 he was named one of the 100 most influential people in the IT industry by Ziff-Davis magazines. In 2012 he was named one of the world’s top fifty business school professors by Fortune magazine.


John Elder founded and leads America’s largest and most experienced data mining consultancy. Founded in 1995, Elder Research (http://www.datamininglab.com) has offices in Charlottesville Virginia and Washington DC and has solved projects in a huge variety of areas of mining data, text, and links. Dr. Elder co-authored 3 books (on practical data mining, ensembles, and text mining), two of which won PROSE awards for top book of the year in Mathematics or Computer Science. John has authored some data mining tools, was one of the discoverers of ensemble methods, has chaired international conferences, and is a frequent keynote speaker. He’s probably best known for explaining complex analytic concepts with clarity, humor, and enthusiasm.

Dr. Elder has degrees in Engineering (Systems PhD, UVA + Electrical Masters & BS, Rice) and is an occasional Adjunct Professor at UVA. He was honored to be named by President Bush to serve 5 years on a panel to guide technology for national security. Lastly, John is grateful to be a follower of Christ and the father of five.


Rayid Ghani is currently at the Computation Institute and the Harris School of Public Policy at the University of Chicago. Rayid is also the co-founder of Edgeflip, an analytics startup building social media analytics products that allow non-profits and social good organizations to better use social networks to raise money, recruit, engage, and mobilize volunteers, and do targeted outreach and advocacy.

Rayid Ghani was the Chief Scientist at Obama for America 2012 campaign focusing on analytics, technology, and data. His work focused on improving different functions of the campaign including fundraising, volunteer, and voter mobilization using analytics, social media, and machine learning. Before joining the campaign, Rayid was a Senior Research Scientist and Director of Analytics research at Accenture Labs where he led a technology research team focused on applied R&D in analytics, machine learning, and data mining for large-scale & emerging business problems in various industries including healthcare, retail & CPG, manufacturing, intelligence, and financial services.

In addition, Rayid serves as an adviser to several analytics start-ups, is an active speaker, organizer of, and participant in academic and industry analytics conferences, and publishes regularly in machine learning and data mining conferences and journals.


Anthony Goldbloom is the founder and CEO of Kaggle. Before founding Kaggle, Anthony worked in the macroeconomic modeling areas of the Reserve Bank of Australia and before that the Australian Treasury.

He holds a first class honours degree in economics and econometrics from the University of Melbourne and has published in The Economist magazine and the Australian Economic Review.

In 2011, Forbes Magazine cited Anthony as one of the 30 under 30 in technology and Fast Company featured him as one of the innovative thinkers who are changing the future of business.


Dr. Vincent Granville is a visionary data scientist with 15 years of big data, predictive modeling, digital and business analytics experience. Vincent is widely recognized as the leading expert in scoring technology, fraud detection and web traffic optimization and growth. Over the last ten years, he has worked in real-time credit card fraud detection with Visa, advertising mix optimization with CNET, change point detection with Microsoft, online user experience with Wells Fargo, search intelligence with InfoSpace, automated bidding with eBay, click fraud detection with major search engines, ad networks and large advertising clients.

Most recently, Vincent launched Data Science Central, the leading social network for big data, business analytics and data science practitioners. Vincent is a former post-doctorate of Cambridge University and the National Institute of Statistical Sciences. He was among the finalists at the Wharton School Business Plan Competition and at the Belgian Mathematical Olympiads. Vincent has published 40 papers in statistical journals and is an invited speaker at international conferences. He also developed a new data mining technology known as hidden decision trees, owns multiple patents, published the first data science book, and raised $6MM in start-up funding. Vincent is a top 20 big data influencers according to Forbes, was featured on CNN, and is #1 in Gil Press' A-List of data scientists.


Gregory Piatetsky-Shapiro, Ph.D. (@kdnuggets) is the Editor of KDnuggets.com, a leading site for Analytics, Big Data, Data Mining, and Data Science. He is also a well-known expert and an independent consultant in this field. Previously, he led a data mining teams at GTE Laboratories, and was a Chief Scientist for two start-ups. He has extensive experience in applying analytic and data mining methods to many areas — including customer modeling, healthcare data analysis, fraud detection, bioinformatics and Web analytics — and worked for a number of leading banks, insurance companies, telcos, and pharmaceutical companies.

He coined the terms “KDD” and “Knowledge Discovery in Data” when he organized and chaired the first three KDD workshops. He later helped grow the workshops into ACM Conf. on Knowledge Discovery and Data Mining (kdd.org), the top research conference in the field. Dr. Piatetsky-Shapiro is also a co-founder of ACM SIGKDD, the leading professional organization for Knowledge Discovery and Data Mining and served as the Chair of SIGKDD (2005-2009). He received ACM SIGKDD and IEEE ICDM Distinguished Service Awards. He has over 60 publications with over 10,000 citations.


Karl Rexer, PhD is President of Rexer Analytics (www.RexerAnalytics.com). Founded in 2002, Rexer Analytics has delivered analytic solutions to dozens of companies. Solutions include fraud detection, customer attrition analysis and prediction, advertisement abandonment prediction, direct mail targeting, market basket analysis and survey research. Rexer Analytics also conducts and freely distributes the widely read Data Miner Survey. The survey has been written about and cited in over 12 languages. In the spring of 2013, over a thousand analytic professionals from around the world participated in the 6th Data Miner Survey.

Karl has served on the organizing and review committees of several international conferences (e.g., KDD), and is on the Board of Directors of Oracle's Business Intelligence, Warehousing, & Analytics (BIWA) Special Interest Group. He has served on IBM's Customer Advisory Board, is an Industry Advisor for Babson College's Business Analytics program, and is in the #1 position on LinkedIn's list of Top Predictive Analytics Professionals. He is frequently an invited speaker and moderator at conferences and universities. So far in 2013 he has conducted data mining trainings in California, China and London. Prior to founding Rexer Analytics, Karl held leadership and consulting positions at several consulting firms and two multi-national banks.


Eric Siegel, PhD, founder of Predictive Analytics World and Text Analytics World, author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, and Executive Editor of the Predictive Analytics Times, makes the how and why of predictive analytics understandable and captivating. Eric is a former Columbia University professor who used to sing educational songs to his students, and a renowned speaker, educator and leader in the field.

Monday, April 22, 2013

Find the Most Important Variables In Predictive Models

A commonly used method in determining the most important variables is to examine how well each variable individually in predicting the target variable. However, this approach has its limits. The first limit is that it excludes variables that are actually good ones.

For example, to find if credit score is a good indicator of account default, we calculate the default rate for each credit class as shown below (or we may perform some statistical tests such as a Chi-Square test for that matter). As we can see, the low credit score class have a default rate of 18% vs that of 6% for the high credit score class. Thus, credit class is considered as a good variable to  build a default model.

Credit score and account default.

However, this approach has its limit because it does not take the relationship among variables into consideration.  This can be illustrated using an imaginary example. We want to asses if height  and weight of people are indicative of getting a disease. We can calculate the following tables for height and weight, respectively. Since short or tall people have the same percentage of sick people (2%), we may conclude that height is not relevant to predicting the disease. Similarly, we also  think weight is not important.

Height and Disease

Weight and Disease

If examining weight and height at the same time, we can develop the following matrix. There are four groups of people, high/heavy (normal), short/light (normal), high/light(abnormal), and short/heavy (abnormal).  11% of short/heavy or light/tall people are sick (orange cells). While the percentage of sick people from tall/heavy and short/light  groups (green cells) is only 0.1%. Thus, height and weight are very good variables to be included in a predictive model.

                                                      Height/Weight and Disease


As we see, this approach may exclude variables that are actually good. When we determine the most important variables for building a predictive model, ideally we should take a set of variables as a whole into consideration. More often than not, it is the relationships between variables that provide the best predictive power. How to find or generate the most useful variables for predictive models is so crucial that we will talk more about it in upcoming blog posts. I have written another post More on How to Find the Most Important Variables for a Predictive Model using Oracle Attribute Importance function.

Sunday, April 07, 2013

Logistic Regression Model Implemented in SQL

In a project, we need to deploy a logistic regression model into a production system that only takes SQL scripts as its input. Two functions come in handy, decode() and nvl(). Decode() converts categorical value into a weight and nvl() conveniently replaces null with a desired value. The following SQL scripts is similar to what we delivered.



select transaction_id,
(1/
(1+
exp(-(
nvl(AMT1*.000019199,0)+
nvl(AMT3*(-.00002155),0)+
decode(nvl(substr((TXN_CODE1),1,18),' '),
'XX',-.070935,
'57',-.192319,
'1',-.053794,
'81',-.010813,
'NR',-.079628,
'PD',-.102987,
'P',-1.388433,
'Z6',-.106081,
'01',-1.1528,
'Z4',-.004237,
'T1',.697737,
'AK',-.490381,
'U2',.063712,
'NK',.054354,
'PR',.205336,
'51',-.286213,
'N',.075582,
' ',-.330585,
0)+
decode(nvl(substr( trim(TXN_CODE2),1,18),' '),
'U',-.11176,
0)+
decode(nvl(substr( trim(TXN_CODE3),1,18),' '),
'1',-.642605,
0)+
decode(nvl(substr( trim(TXN_CODE4),1,18),' '),
'00',-.084517,
'10',.057248,
0)
-6.8190776
)
)
)
) as score from tbl_data;

Thursday, March 07, 2013

Watch out NULL values when comparing data

It is a very common task to compare data values. For example, I was involved in project where we  upgraded the scoring engine. We wanted to make sure the old and new scoring engines produce the same outputs given the same inputs. I use the following table to illustrate the problem. We want to make sure value_old and value_new are the same. (The blanks are NULL values.)

        ID  VALUE_OLD  VALUE_NEW
---------- ---------- ----------
         1        234
         2                   567
         3        789        789

If we simply use the following query to count the number of discrepancies, the result will return zero. This is not what we expect.
select count(*) from tbl_data_a where VALUE_OLD < > VALUE_NEW;

  COUNT(*)
----------
         0
This is because rows with NULL values appearing in the comparison are ignored.

A better approach is to write a query considering all of the following five situations:
In the following cases, VALUE_OLD and VALUE_NEW are the same.
1. VALUE_OLD is null, and VALUE_NEW is null.
2.VALUE_OLD is not null, VALUE_NEW is not null and VALUE_OLD=VALUE_NEW.

In the following cases, VALUE_OLD and VALUE_NEW are the different.

3.VALUE_OLD is null, and VALUE_NEW is not null.
4.VALUE_OLD is not null, and VALUE_NEW is null.
5.VALUE_OLD is not null, VALUE_NEW is not null and VALUE_OLD < > VALUE_NEW.

Monday, March 04, 2013

Custom model score vs credit bureau score

In previous post, we compare the performance of different types of the models applied to the same data set. In reality, a more frequently encountered  issue is to compare the performance of credit bureau scores and custom model scores.  By custom model scores, we mean the model that is built based on  client's own historical data. I have done numerous predictive modeling projects in the area across industries. My conclusion is that custom model scores are almost always better than generic credit bureau scores by large margins. The following gain charts are from a real project. For example, if we reject worst 20% customers based on their bureau scores, we can stop 26% of the loan default.  If we reject 20% customers based on custom model scores, almost 40% of the loan default can be stopped. The can easily translate into big savings for a company with large number of customers.



The patterns are common. Thus, it is worthwhile to build a custom model that will almost always outperform generic credit bureau scores as long as the client has enough historical data.

Saturday, February 16, 2013

The Comparison of Different Models

In previous posts, we mentioned that a great deal of the time should be spent on understanding data and building feature variables that are truly relevant to the target variable. The next question is which predictive models should we use? There are so many choices of types of models. For examples, for classification problems, the models we can use include CART, logistic regression, SVM, Neural Nets, Nearest-K, Bayesian classification model, ensemble models, etc. In my PhD dissertation, most of the content is dedicated to the empirical comparison of different models. In the commercial world, sometimes I applied different models to solve the same problem. The follow lift charts are the actual results for models that predict direct mail response. The models I tested include gradient boosting trees, CART,a logistic regression, and a simple cell (or cube) model. The cell of cube model here divides the training data into many cubes and calculates the response rate for each cube. The predicted response rate for a new data point is that of the cube where it is located.


As we can see from the above lift charts, gradient boosting trees is the best. Logistic regression and CART are almost the same. However, all the models, while vary greatly in terms of structures and sophistication, perform satisfactorily on the testing data set.

However, in reality the selection of models should not solely based on the model's predictive accuracy. Other important considerations are: how hard a model can be deployed into a production system, the computation efficiency, memory usage, can the model give a reason for its prediction, etc. It is completely acceptable that we choose a simple model that performs reasonably well. I have seen too many cases where statisticians build great (and sophisticated) models in their lab environments that could not be deployed into the production system. In those cases, the benefits of predictive modeling are never realized.

Wednesday, February 13, 2013

Abraham Lincoln and Predictive Modeling Project

Abraham Lincoln said, "If I had six hours to chop down a tree, I'd spend the first four hours sharpening the axe". The same principle can be applied to a predictive modeling project. From the discussion in previous posts, we can see that most of the time are not spent on building model itself. Thus we may say, "If I had ten days to build a predictive model, I'd spend the first seven days understanding data and building feature variables that are truly relevant to the target variable".

Tuesday, January 15, 2013

How to build predictive models that win competitions. (continued)

To build predictive models that are accurate and robust, it is crucial to find the input variables, or so called feature variables, that are truly predicative of the target variable. This is even more important than choosing types of predictive models. Good feature variables can greatly reduce the complexity of the modeling learning.

For example, suppose we want to build a model that predicts if a pair of vectors are parallel to each other. This is illustrated in the following chart.


We can represent the pair of vectors by all of their coordinates for the starting and ending points, i.e.,(Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2)

Suppose we have a few hundred of training samples (pairs of vectors) that are marked as parallel or not. The following are some examples.

Sample 1: variables (0,0,1,1,1,0,2,1) target: yes (parallel)
Sample 2: variables (0,0,2,2,2,0,4,2) target: yes (parallel)
Sample 3: variables (0,0,2,2,2,0,4,4) target: no (parallel)
......

However, it would be extremely challenging if we use (Xa1,Ya1,Xa2,Ya2,Xb1,Yb1,Xb2,Yb2) as input variables to build predictive models. Probably, most of the models will perform very poorly and are useless.

If we realize that we can compare the angles of the two vectors to decide if they are parallel, we can build two new variables using atan functions as the following.
(atan((Ya2-Ya1)/(Xa2-Xa1)), atan((Yb2-Yb1)/(Xb2-Xb1))

The atan function is the inverse tangent. Using the new variables, the above training samples will be:

Sample 1: variables (0.785, 0.785) target: yes (parallel)
Sample 2: variables (0.785, 0.785) target: yes (parallel)
Sample 3: variables (0.785, 1.107) target: no (parallel)
......

This representation greatly simplifies the problem. When the two numbers are close, the two vectors are parallel.

Thus, before we build models, a great deal of time should be spent on building feature variables that are truly predicative of the target variables. This is true for any applications such as fraud detection, credit risk, click through rate prediction, etc.

Some models, such as neural nets and SVM,are able to implicitly re-represent the input variables and automatically figure out the best feature variables during the learning process. For example, some researchers claim that a neural net is able to approximate the atan functions in its hidden neurons during training. This will be a subject of another post. From data mining practitioners' perspective, nothing prevents us from taking advantage of our domain knowledge and calculate the feature variables directly.

Sunday, January 06, 2013

Split data into multiple tables using multi-table insert

With Oracle multi-table insert, we can conveniently store data into multiple tables using a single select statment.

For example, The following insert statement puts records in TBL_STOCK_QUOTES into three tables based on the first letter of stock symbol, T_A2K,T_L2S and T_T2Z.

insert first
when substr(symbol,1,1) between 'A' and 'K' then
into T_A2K (SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
values (SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
when substr(symbol,1,1) between 'L' and 'S' then
into T_L2S(SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
else
into T_T2Z(SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
values(SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME)
select
SYMBOL,DT,HIGH,LOW,OPEN,CLOSE,VOLUME
from
TBL_STOCK_QUOTES;


We can verify the result after the inserting.
SQL> select min(symbol), max(symbol) from T_A2K;

MIN(SYMBOL) MAX(SYMBOL)
-------------------------------- --------------------------------
AAIT KYO

SQL> select min(symbol), max(symbol) from T_L2S;
MIN(SYMBOL) MAX(SYMBOL)
-------------------------------- --------------------------------
LANC SWI

SQL> select min(symbol), max(symbol) from T_T2Z;
MIN(SYMBOL) MAX(SYMBOL)
-------------------------------- --------------------------------
TAX ZN

Saturday, January 05, 2013

Watch out invisible characters.

When text files (or csv files when the filed delimiter is a comma) are loaded into a table, sometimes the columns contain weird and invisible characters. They cause unexpected results when we compare values or convert them from strings to numbers. Oracle function utl_raw.CAST_TO_RAW is a great way to look into the text strings and uncover those invisible characters.

In a project, we used ETL tool to move financial transactions into a data warehouse and found that in a very small number of cases the data showed discrepancies when we compared the data before and after the movement. However, the data looked the same visually. By using utl_raw.CAST_TO_RAW function, we were able to pinpoint those discrepancies caused by invisible characters and fixed the problem.

For example, there is a table containing text string about stock prices. The following query shows one record containing symbol ,Date,Open,High,Low,Close,Volume, and Adj Close.
SQL> select quotes from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

QUOTES
---------------------------------------------------------------- 
APEI,2013-01-02,36.89,37.34,36.59,36.96,198700,36.96 
The above query result looks OK. But the text string actually contains an invisible character new line (hex value 0A or decimal value 10). We can use utl_raw.CAST_TO_RAW to uncover this.
SQL> select utl_raw.CAST_TO_RAW(quotes) quotes_raw from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

QUOTES_RAW 
------------------------------------------------------------------------------------------------------------------------------------ 
415045492C323031332D30312D30322C33362E38392C33372E33342C33362E35392C33362E39362C3139383730302C33362E39360A 
Here are the complete ASCII tables showing the numerical representation of characters. We can remove the unwanted characters using Oracle replace function. chr(10) means the character with decimal value 10 or hex value 0A or new line character.
<
SQL> select replace(quotes,chr(10)) from TBL_SYMBOL_QUOTES_CUR a where rownum <2;
The following query shows new line character is removed.
SQL> select utl_raw.CAST_TO_RAW(replace(quotes,chr(10))) from TBL_SYMBOL_QUOTES_CUR a where rownum <2; 

UTL_RAW.CAST_TO_RAW(REPLACE(QUOTES,CHR(10))) 
------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------- 
415045492C323031332D30312D30322C33362E38392C33372E33342C33362E35392C33362E39362C3139383730302C33362E3936 

Thursday, January 03, 2013

the First principle of data analytics is to avoid making mistakes (continued)

The importance of avoiding mistakes in data mining (or in any work in general) could never be overemphasized. When mistakes happen, wrong (and sometime ridiculous) conclusions are drawn and the credibility of analysts is severely damaged.

One of the common sources causing mistakes is survivor bias. The following are some examples.

1. When we study average 10 year stock return, we collect the price history of stocks that are on the market. However, because many companies went under within the 10 years and their data are not included in the analysis, the calculated average return would be higher than the actual return.

2. It was uncovered that 12 months ago a data breach affected some credit cards of a card issuer. To measure the impact of this data breach, analysts take the current active credit cards, find out those that are affected by the data breach, and measure their fraud rate. They may find that the fraud rate is surprisingly low. This is because those cards had fraudulent activities as reported by the cardholders are already closed and these cards are purged from the active portfolio.

Stocks for companies going under, closed accounts due to nonpayment, closed credit cards due to fraud, churned customers, etc., are data corpses. To avoid data survivor bias, it is important to collect the complete data, including those data corpses. This is easier said than done as in reality data corpses are regularly removed and hard to collect.

Tuesday, January 01, 2013

the First principle of data analytics is to avoid making mistakes (continued)

Mistakes can happen when we deploy the models that work well in our "lab environment" into production. We need to anticipate unseen situations ahead of time and deal with them. For examples, the following are some of the common challenges:

1. Unseen categorical variable values. For example, in the training data, state variable does not contain value RI. However, in the production, RI appears in the data. One way to deal with this is to assign all unseen codes to the most frequent category in our model scripts. That way, the model will run as we expect.

2. Continuous variables out of boundaries. For example, in the training data, the credit card purchase amount is within a range of $0.01 and $22,000. However, in production, the purchase amount could be $56,000. If we do not handle purchase amount properly, extreme scores will be generated by the model for transactions with purchase amount of $56,000. It is a good idea to winsorize the variable. "Winsorize" is a fancy word for clipping the extreme values.

Saturday, December 22, 2012

the First principle of data analytics is to avoid making mistakes (continued)

Mistakes may happen when data or processes flow from one system to another. Sometimes, mistakes are caused by the flaws in the analytic methodologies themselves.

One of the popular mistakes is to think in terms of absolute numbers, not ratios. For example, analysts see a few hundreds of fraudulent credit card transactions happening at a particular online retailer website and conclude that the retailer is very risky. This may not be true. If the retailer has tens of millions of normal transactions within the same period, the ratio of fraudulent transactions is very low. We have seen in number of occasions that people use number of frauds,instead of fraud rate, as the measurement of risk. As a result, they create fraud detection rules that generate a lot of false alarms.

Friday, December 21, 2012

the First principle of data analytics is to avoid making mistakes

In previous post How to build predictive models that win competitions,we showed examples of actual mistakes made by statisticians/analysts. It is obvious that delivering something wrong will cause irrecoverable damage to the credibility. Mistakes can happen particularly when data or processes follow from one system to another:

1. When data files are provided by the clients. Customers make mistakes too. Their database people may use the wrong scripts and do not extract the data that we want.

2. When data are loaded from text files into databases or other systems. A lot of things can go wrong during data loading. Issues like missing value representation, number formats, date formats, unrecognized characters, etc. will surface. Thus, if possible, using text files to store data should be avoided.

3. When reports are generated. Many analysts perform data analytics work using tools like SAS, R, SPSS, SPlus, Oracle Data Mining, etc. Then they summarize the results manually using Excel Spreadsheet and make the final Powerpoint presentation. Things can go wrong here. Decimal points are placed in the wrong places. Dates are not updated.

4. When models are deployed into production systems. There are a lot of moving parts involved here. It is a challenge to precisely replicate the off-line data analytics process in a different production system.

Tuesday, December 18, 2012

How to build predictive models that win competitions.

The predictive models that we produced won head-to-head competitions and were chosen by clients. For example, our model predicting the risk of a new customer will not pay his/her phone bill was used by a top 3 cell phone company. Our debit card fraud detection model is being used by a top 15 banks. Our competitors included one of the three credit bureaus that had hundreds of statisticians working for them.

We have found that if we follow a number of principles, we will be able to produce good predictive models and quality data analytics work in general.

The first principle is to avoid making mistakes. We have seen many cases where mistakes damage the reputation and credibility of data analysts. For example, statisticians analyzed data and drew the conclusion that compromised credit cards were less likely show fraudulent activities than normal cards. It was a mistake, of course. One of our competitors loaded customer's data wrongly and produced reports where the numbers did not make sense at all. Mistakes like those will cause immediate rejection from clients and permanent damage to analysts' reputation.

We need to realize that avoiding mistakes should be an inherent part of our processes. In a large project, from data gathering, loading, validating, summarizing,  model building, report generation and model deployment, we may have to take many steps and produce hundreds of data sets. To avoid making mistake, we need to double check our results. We have found that it is actually much harder to verify the results are correct than to simply execute the steps to produce them.

Thus, we always perform two tasks: 1. produce the result; 2. verify the result. It is better to spend more time to produce correct results than to quickly deliver something wrong that cause irrecoverable damage to the credibility. We will talk more about avoid making mistakes in the post the First principle of data analytics is to avoid making mistakes

Friday, December 14, 2012

Oracle NTILE function

Oracle Ntile function divides the records into the same number of rows (off at most by 1). Please notice that the records with same variable values (ties) may be placed into different buckets. The following query result shows that the maximum value in a bucket may overlap with the minimum value in the next bucket.

SQL> with tbl as (select ntile(5) over(order by price) nt, price from STOCK_PRICE) select nt, count(1), min(price), max(price) from tbl group by nt
 order by nt;

        NT   COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
         1      36387       9.58      13.68
         2      36387      13.68      16.72
         3      36386      16.72      20.87
         4      36386      20.87      26.87
         5      36386      26.87      89.53

We can also use function cume_dist combined with ceil function to divide the records into  similar number of rows. Ties will be placed into the same buckets. The following query result shows that there is no overlap of values between different buckets.

SQL> with tbl as (select ceil(cume_dist() over(order by price nulls first)*5)  nt, price from MV_VIX_15S_AVG) select nt, count(1), min(price), max(price) from tbl group by nt order by nt;

        NT   COUNT(1) MIN(PRICE) MAX(PRICE)
---------- ---------- ---------- ----------
         1      36358       9.58     13.675
         2      36332      13.68      16.71
         3      36413      16.72      20.86
         4      36429      20.87     26.865
         5      36400      26.87      89.53

Tuesday, December 11, 2012

More on Data Preparation for Logistic Regression Models

In the earlier post Data preparation for building logistic regression models , we talked about converting data into more compact format so that we can use memory intensive software like R to handle large number of cases.

Another issue that we commonly face is how to deal with "unbalanced" data. The following are some examples:
1. Only 7 to 12 fraudulent ones out of 10,000 transactions.
2. Online ads has only 1 to 3 clicks per 10,000 impressions.
3. Mobile phone account non-payment rate is 12%.
4. It is assumed the 6% of medical claims are fraudulent.

For the sake of reducing data volume without sacrificing model accuracy, it makes sense to reduce the cases in the major classes (good card transactions, non-click impressions, good mobile phone accounts, normal medical claims) through random sampling. For example, to gather 1,000 fraudulent bank card transactions within certain period, there will be 10 million good transactions from the same period. We can use the approach described in an earlier post More on random sampling in Oracle. However, the final model training data sets do NOT have to be perfectly balanced, i.e., fraud and good transactions do NOT have to be 50% and 50%. For example, it is OK that the model training data set has 80% good transactions and 20% fraudulent ones. For example, we can build two logistic regression models:

Model A. training set: 50% good, 50% fraud
Model B. training set: 80% good, 20% fraud

Model A and B will produce different probabilities of being fraud for an individual transaction(Probability of being fraud produced by Model A will most likely be higher than that produced by Model B). However, the relative ranks of transactions' probabilities of being fraud for a whole data set given by both models could be the same. For example, the top 1% riskiest transactions identified by both models are the same. Thus both models are equivalent. As we can see in many applications what matters is the relative ranks (of being fraud, being non-payment, etc.) produced by the predictive models. The quality/usefulness of relative ranks can be depicted by gain charts.

Sunday, December 02, 2012

Build a Scorecard in Oracle Using Logistic Regression Model

One of the popular methods in assessing credit risk is a scorecard. To build a scorecard manually, analysts first identify a number of relevant variables, e.g. having bankruptcies in the past, number of open accounts, etc. Each variable is divided into meaningful segments based on its value. Then a point is assigned to each segment. The credit score is the sum of the points for all the variables. For example, a scorecard may look like the following:

Variable | Segment | Point
Having Bankruptcy | NO | 95
Having Bankruptcy | YES | 15
Number of Open Accounts | from 1 to 3 | 30
Number of Open Accounts | from 4 to 8 | 50
Number of Open Accounts | 9 or more | 40
...................................................

Assume a person's credit history shows past bankruptcy, 4 open accounts and other information. His credit score will be calculated as 15+50.....

To determine the points manually is subjective. A much better way is to build a logistic regression model and use the model coefficients as the "points". A scorecard created this way is accurate and yet still provides the transparency of a scorecard. One of the top cell phone service provider actually used our model-derived scorecard to screen the new customers.

In example 2 of the post Build Predictive Models Using PL/SQL, we show how to build a logistic regression model in Oracle. The extract the model coefficients, we simply use the following query:

select attribute_name, attribute_value, coefficient from table(select dbms_data_mining.get_model_details_glm('GLM_MODEL') from dual);
(Replacing 'GLM_MODEL' with real model name)

We may want to perform some transformation of the coefficients so that the score will be within the desired range.

Thursday, November 22, 2012

A Real World Case Study: Business Rule vs Predictive Model

The following is a true story to complement earlier posts Comparison of Business Rules and Predictive Models and Predictive Modeling vs Intuitive Business Rules .

A few years ago, we built a new customer acquisition model for a cell phone service provider based on its historical application and payment data. The model calculated a risk score for each cell phone service applicant using information found in his/her credit reports. The higher the score, the higher the risk that a customer will not pay his/her bill.

A few weeks after the model was running, we received an angry email from the client company manager. In the email, the manager gave a list of applicants who had several bankruptcies. According to the manager, they should be high risk customers. However, our model gave them average risk scores. He questioned the validity of the model.

We mentioned that the model score was based on 20 or so variables, not bankruptcies alone. We also analyzed people with bankruptcies in the data that we used to build the model. We found that they paid bills on time. It might be that people with bankruptcies are more mobile and thus depend more on cell phones for communication. They may not be good candidates for mortgage. But from cell phone service providers' perspective, they are good customers.

This is the bottom line. Data-driven predictive models are more trustworthy than intuition-driven business rules.