Friday, December 21, 2012
the First principle of data analytics is to avoid making mistakes
In previous post How to build predictive models that win competitions,we showed examples of actual mistakes made by statisticians/analysts. It is obvious that delivering something wrong will cause irrecoverable damage to the credibility. Mistakes can happen particularly when data or processes follow from one system to another: 1. When data files are provided by the clients. Customers make mistakes too. Their database people may use the wrong scripts and do not extract the data that we want. 2. When data are loaded from text files into databases or other systems. A lot of things can go wrong during data loading. Issues like missing value representation, number formats, date formats, unrecognized characters, etc. will surface. Thus, if possible, using text files to store data should be avoided. 3. When reports are generated. Many analysts perform data analytics work using tools like SAS, R, SPSS, SPlus, Oracle Data Mining, etc. Then they summarize the results manually using Excel Spreadsheet and make the final Powerpoint presentation. Things can go wrong here. Decimal points are placed in the wrong places. Dates are not updated. 4. When models are deployed into production systems. There are a lot of moving parts involved here. It is a challenge to precisely replicate the off-line data analytics process in a different production system.