Tuesday, May 14, 2013

How Are 10 Most Influential People in Data Analytics Selected?

I was asked by Gregory at KDnuggets about how I came up with "the 10 Most Influential People in Data Analytics" in my last blog post. The following was my reply.

To select 10 most influential people in data analytics, the following considerations are taken into account regarding an individual's contribution.

  1. The contribution is significant.
  2. The contribution is active/regular.
  3. A large number of people are impacted by the contribution.
  4. The focus is on the non-academic field.

I performed online research first to find qualified people. KDnuggets.com, other websites of social networking, data analytics conferences, consulting firms, and Amazon are just a few examples of good sources for information. I also took advantage of my own network. Being in the industry for 15 years, I have known many great data analytics professionals. They provided me with many names of qualified people. After I compiled a preliminary list, I sent it to a number of experts for their feedback. The final list was the result of several iterations.

Monday, May 13, 2013

10 Most Influential People in Data Analytics

We have identified 10 most influential people whose significant contributions have greatly enriched the data analytics community. This is the result of months of research. The following is the list (in alphabetical order of the last name).


Dean Abbott Michael Berry Tom Davenport John Elder Rayid Ghani
Anthony Goldbloom Vincent Granville Gregory Piatetsky-Shapiro Karl Rexer Eric Siegel



Dean Abbott is President of Abbott Analytics, Inc. in San Diego, California. Mr. Abbott is an internationally recognized data mining and predictive analytics expert with over two decades experience applying advanced data mining algorithms, data preparation techniques, and data visualization methods to real-world problems, including fraud detection, risk modeling, text mining, personality assessment, response modeling, survey analysis, planned giving, and predictive toxicology. He is also Chief Scientist of SmarterRemarketer, a startup company focusing on behaviorally- and data-driven marketing attribution and web analytics.

Mr. Abbott is a highly regarded and popular speaker at Predictive Analytics and Data Mining conferences, including Predictive Analytics World, Predictive Analytics Summit, the Predictive Analytics Center of Excellence, SAS Institute, DM Radio, and INFORMS.

He has served on the program committees for the KDD Industrial Track and Data Mining Case Studies workshop and is on the Advisory Boards for the UC/Irvine Predictive Analytics Certificate and the UCSD Data Mining Certificate programs. Mr. Abbott has taught applied data mining and text mining courses using IBM SPSS Modeler, Statsoft Statistica, Salford Systems SPM, SAS Enterprise Miner, Tibco Spotfire Miner, IBM Affinium Model, Megaputer Polyanalyst, KNIME, and RapidMiner.


Michael Berry is a recognized authority on business applications of data mining. He is the author (with Gordon Linoff) of several well-regarded books in the field including Data Mining Techniques for Marketing, Sales, and Customer Relationship Management which is now in its third edition.

He is currently responsible for analytics and business intelligence for the business-to-business side of Tripadvisor (www.tripadvisor.com). Mr. Berry is a co-founder of Data Miners, Inc. (www.data-miners.com), a consultancy specializing in the analysis of large volumes of data for marketing and CRM purposes.


Tom Davenport is a Visiting Professor at Harvard Business School. He is also the President’s Distinguished Professor of Information Technology and Management at Babson College, the co-founder of the International Institute for Analytics, and a Senior Advisor to Deloitte Analytics. He has published on the topics of analytics in business, process management, information and knowledge management, and enterprise systems. He pioneered the concept of “competing on analytics” with his best-selling 2006 Harvard Business Review article (and his 2007 book by the same name). His most recent book is Keeping Up with the Quants:Your Guide to Understanding and Using Analytics, with Jinho Kim. He wrote or edited fifteen other books, and over 100 articles for Harvard Business Review, Sloan Management Review, the Financial Times, and many other publications. In 2003 he was named one of the world’s “Top 25 Consultants” by Consulting magazine. In 2005 Optimize magazine’s readers named him among the top 3 business/technology analysts in the world. In 2007 and 2008 he was named one of the 100 most influential people in the IT industry by Ziff-Davis magazines. In 2012 he was named one of the world’s top fifty business school professors by Fortune magazine.


John Elder founded and leads America’s largest and most experienced data mining consultancy. Founded in 1995, Elder Research (http://www.datamininglab.com) has offices in Charlottesville Virginia and Washington DC and has solved projects in a huge variety of areas of mining data, text, and links. Dr. Elder co-authored 3 books (on practical data mining, ensembles, and text mining), two of which won PROSE awards for top book of the year in Mathematics or Computer Science. John has authored some data mining tools, was one of the discoverers of ensemble methods, has chaired international conferences, and is a frequent keynote speaker. He’s probably best known for explaining complex analytic concepts with clarity, humor, and enthusiasm.

Dr. Elder has degrees in Engineering (Systems PhD, UVA + Electrical Masters & BS, Rice) and is an occasional Adjunct Professor at UVA. He was honored to be named by President Bush to serve 5 years on a panel to guide technology for national security. Lastly, John is grateful to be a follower of Christ and the father of five.


Rayid Ghani is currently at the Computation Institute and the Harris School of Public Policy at the University of Chicago. Rayid is also the co-founder of Edgeflip, an analytics startup building social media analytics products that allow non-profits and social good organizations to better use social networks to raise money, recruit, engage, and mobilize volunteers, and do targeted outreach and advocacy.

Rayid Ghani was the Chief Scientist at Obama for America 2012 campaign focusing on analytics, technology, and data. His work focused on improving different functions of the campaign including fundraising, volunteer, and voter mobilization using analytics, social media, and machine learning. Before joining the campaign, Rayid was a Senior Research Scientist and Director of Analytics research at Accenture Labs where he led a technology research team focused on applied R&D in analytics, machine learning, and data mining for large-scale & emerging business problems in various industries including healthcare, retail & CPG, manufacturing, intelligence, and financial services.

In addition, Rayid serves as an adviser to several analytics start-ups, is an active speaker, organizer of, and participant in academic and industry analytics conferences, and publishes regularly in machine learning and data mining conferences and journals.


Anthony Goldbloom is the founder and CEO of Kaggle. Before founding Kaggle, Anthony worked in the macroeconomic modeling areas of the Reserve Bank of Australia and before that the Australian Treasury.

He holds a first class honours degree in economics and econometrics from the University of Melbourne and has published in The Economist magazine and the Australian Economic Review.

In 2011, Forbes Magazine cited Anthony as one of the 30 under 30 in technology and Fast Company featured him as one of the innovative thinkers who are changing the future of business.


Dr. Vincent Granville is a visionary data scientist with 15 years of big data, predictive modeling, digital and business analytics experience. Vincent is widely recognized as the leading expert in scoring technology, fraud detection and web traffic optimization and growth. Over the last ten years, he has worked in real-time credit card fraud detection with Visa, advertising mix optimization with CNET, change point detection with Microsoft, online user experience with Wells Fargo, search intelligence with InfoSpace, automated bidding with eBay, click fraud detection with major search engines, ad networks and large advertising clients.

Most recently, Vincent launched Data Science Central, the leading social network for big data, business analytics and data science practitioners. Vincent is a former post-doctorate of Cambridge University and the National Institute of Statistical Sciences. He was among the finalists at the Wharton School Business Plan Competition and at the Belgian Mathematical Olympiads. Vincent has published 40 papers in statistical journals and is an invited speaker at international conferences. He also developed a new data mining technology known as hidden decision trees, owns multiple patents, published the first data science book, and raised $6MM in start-up funding. Vincent is a top 20 big data influencers according to Forbes, was featured on CNN, and is #1 in Gil Press' A-List of data scientists.


Gregory Piatetsky-Shapiro, Ph.D. (@kdnuggets) is the Editor of KDnuggets.com, a leading site for Analytics, Big Data, Data Mining, and Data Science. He is also a well-known expert and an independent consultant in this field. Previously, he led a data mining teams at GTE Laboratories, and was a Chief Scientist for two start-ups. He has extensive experience in applying analytic and data mining methods to many areas — including customer modeling, healthcare data analysis, fraud detection, bioinformatics and Web analytics — and worked for a number of leading banks, insurance companies, telcos, and pharmaceutical companies.

He coined the terms “KDD” and “Knowledge Discovery in Data” when he organized and chaired the first three KDD workshops. He later helped grow the workshops into ACM Conf. on Knowledge Discovery and Data Mining (kdd.org), the top research conference in the field. Dr. Piatetsky-Shapiro is also a co-founder of ACM SIGKDD, the leading professional organization for Knowledge Discovery and Data Mining and served as the Chair of SIGKDD (2005-2009). He received ACM SIGKDD and IEEE ICDM Distinguished Service Awards. He has over 60 publications with over 10,000 citations.


Karl Rexer, PhD is President of Rexer Analytics (www.RexerAnalytics.com). Founded in 2002, Rexer Analytics has delivered analytic solutions to dozens of companies. Solutions include fraud detection, customer attrition analysis and prediction, advertisement abandonment prediction, direct mail targeting, market basket analysis and survey research. Rexer Analytics also conducts and freely distributes the widely read Data Miner Survey. The survey has been written about and cited in over 12 languages. In the spring of 2013, over a thousand analytic professionals from around the world participated in the 6th Data Miner Survey.

Karl has served on the organizing and review committees of several international conferences (e.g., KDD), and is on the Board of Directors of Oracle's Business Intelligence, Warehousing, & Analytics (BIWA) Special Interest Group. He has served on IBM's Customer Advisory Board, is an Industry Advisor for Babson College's Business Analytics program, and is in the #1 position on LinkedIn's list of Top Predictive Analytics Professionals. He is frequently an invited speaker and moderator at conferences and universities. So far in 2013 he has conducted data mining trainings in California, China and London. Prior to founding Rexer Analytics, Karl held leadership and consulting positions at several consulting firms and two multi-national banks.


Eric Siegel, PhD, founder of Predictive Analytics World and Text Analytics World, author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, and Executive Editor of the Predictive Analytics Times, makes the how and why of predictive analytics understandable and captivating. Eric is a former Columbia University professor who used to sing educational songs to his students, and a renowned speaker, educator and leader in the field.

Monday, April 22, 2013

Find the Most Important Variables In Predictive Models

A commonly used method in determining the most important variables is to examine how well each variable individually in predicting the target variable. However, this approach has its limits. The first limit is that it excludes variables that are actually good ones.

For example, to find if credit score is a good indicator of account default, we calculate the default rate for each credit class as shown below (or we may perform some statistical tests such as a Chi-Square test for that matter). As we can see, the low credit score class have a default rate of 18% vs that of 6% for the high credit score class. Thus, credit class is considered as a good variable to  build a default model.

Credit score and account default.

However, this approach has its limit because it does not take the relationship among variables into consideration.  This can be illustrated using an imaginary example. We want to asses if height  and weight of people are indicative of getting a disease. We can calculate the following tables for height and weight, respectively. Since short or tall people have the same percentage of sick people (2%), we may conclude that height is not relevant to predicting the disease. Similarly, we also  think weight is not important.

Height and Disease

Weight and Disease

If examining weight and height at the same time, we can develop the following matrix. There are four groups of people, high/heavy (normal), short/light (normal), high/light(abnormal), and short/heavy (abnormal).  11% of short/heavy or light/tall people are sick (orange cells). While the percentage of sick people from tall/heavy and short/light  groups (green cells) is only 0.1%. Thus, height and weight are very good variables to be included in a predictive model.

                                                      Height/Weight and Disease


As we see, this approach may exclude variables that are actually good. When we determine the most important variables for building a predictive model, ideally we should take a set of variables as a whole into consideration. More often than not, it is the relationships between variables that provide the best predictive power. How to find or generate the most useful variables for predictive models is so crucial that we will talk more about it in upcoming blog posts.

Sunday, April 07, 2013

Logistic Regression Model Implemented in SQL

In a project, we need to deploy a logistic regression model into a production system that only takes SQL scripts as its input. Two functions come in handy, decode() and nvl(). Decode() converts categorical value into a weight and nvl() conveniently replaces null with a desired value. The following SQL scripts is similar to what we delivered.



select transaction_id,
(1/
(1+
exp(-(
nvl(AMT1*.000019199,0)+
nvl(AMT3*(-.00002155),0)+
decode(nvl(substr((TXN_CODE1),1,18),' '),
'XX',-.070935,
'57',-.192319,
'1',-.053794,
'81',-.010813,
'NR',-.079628,
'PD',-.102987,
'P',-1.388433,
'Z6',-.106081,
'01',-1.1528,
'Z4',-.004237,
'T1',.697737,
'AK',-.490381,
'U2',.063712,
'NK',.054354,
'PR',.205336,
'51',-.286213,
'N',.075582,
' ',-.330585,
0)+
decode(nvl(substr( trim(TXN_CODE2),1,18),' '),
'U',-.11176,
0)+
decode(nvl(substr( trim(TXN_CODE3),1,18),' '),
'1',-.642605,
0)+
decode(nvl(substr( trim(TXN_CODE4),1,18),' '),
'00',-.084517,
'10',.057248,
0)
-6.8190776
)
)
)
) as score from tbl_data;

Thursday, March 07, 2013

Watch out NULL values when comparing data

It is a very common task to compare data values. For example, I was involved in project where we  upgraded the scoring engine. We wanted to make sure the old and new scoring engines produce the same outputs given the same inputs. I use the following table to illustrate the problem. We want to make sure value_old and value_new are the same. (The blanks are NULL values.)

        ID  VALUE_OLD  VALUE_NEW
---------- ---------- ----------
         1        234
         2                   567
         3        789        789

If we simply use the following query to count the number of discrepancies, the result will return zero. This is not what we expect.
select count(*) from tbl_data_a where VALUE_OLD < > VALUE_NEW;

  COUNT(*)
----------
         0
This is because rows with NULL values appearing in the comparison are ignored.

A better approach is to write a query considering all of the following five situations:
In the following cases, VALUE_OLD and VALUE_NEW are the same.
1. VALUE_OLD is null, and VALUE_NEW is null.
2.VALUE_OLD is not null, VALUE_NEW is not null and VALUE_OLD=VALUE_NEW.

In the following cases, VALUE_OLD and VALUE_NEW are the different.

3.VALUE_OLD is null, and VALUE_NEW is not null.
4.VALUE_OLD is not null, and VALUE_NEW is null.
5.VALUE_OLD is not null, VALUE_NEW is not null and VALUE_OLD < > VALUE_NEW.

Monday, March 04, 2013

Custom model score vs credit bureau score

In previous post, we compare the performance of different types of the models applied to the same data set. In reality, a more frequently encountered  issue is to compare the performance of credit bureau scores and custom model scores.  By custom model scores, we mean the model that is built based on  client's own historical data. I have done numerous predictive modeling projects in the area across industries. My conclusion is that custom model scores are almost always better than generic credit bureau scores by large margins. The following gain charts are from a real project. For example, if we reject worst 20% customers based on their bureau scores, we can stop 26% of the loan default.  If we reject 20% customers based on custom model scores, almost 40% of the loan default can be stopped. The can easily translate into big savings for a company with large number of customers.



The patterns are common. Thus, it is worthwhile to build a custom model that will almost always outperform generic credit bureau scores as long as the client has enough historical data.

Saturday, February 16, 2013

The Comparison of Different Models

In previous posts, we mentioned that a great deal of the time should be spent on understanding data and building feature variables that are truly relevant to the target variable. The next question is which predictive models should we use? There are so many choices of types of models. For examples, for classification problems, the models we can use include CART, logistic regression, SVM, Neural Nets, Nearest-K, Bayesian classification model, ensemble models, etc. In my PhD dissertation, most of the content is dedicated to the empirical comparison of different models. In the commercial world, sometimes I applied different models to solve the same problem. The follow lift charts are the actual results for models that predict direct mail response. The models I tested include gradient boosting trees, CART,a logistic regression, and a simple cell (or cube) model. The cell of cube model here divides the training data into many cubes and calculates the response rate for each cube. The predicted response rate for a new data point is that of the cube where it is located.


As we can see from the above lift charts, gradient boosting trees is the best. Logistic regression and CART are almost the same. However, all the models, while vary greatly in terms of structures and sophistication, perform satisfactorily on the testing data set.

However, in reality the selection of models should not solely based on the model's predictive accuracy. Other important considerations are: how hard a model can be deployed into a production system, the computation efficiency, memory usage, can the model give a reason for its prediction, etc. It is completely acceptable that we choose a simple model that performs reasonably well. I have seen too many cases where statisticians build great (and sophisticated) models in their lab environments that could not be deployed into the production system. In those cases, the benefits of predictive modeling are never realized.