Wednesday, May 11, 2016

Featured Company: ZSAnalytics

ZSAnalytics is a business partner bringing advanced process capabilities to its customers at the best possible cost resulting in the highest savings. ZSA provides customer’s access to Productized Analytics, advanced Analysis and subject matter expertise resulting in increased operational efficiency and lowest total operating cost. Their solutions and services have saved our clients hundreds of millions. They include:

  • ZSales: 70% increase in response rate
  • ZClaim: 40% reduction in insurance claim losses
  • ZCheck: 70% reduction in charge off
  • ZCredit: 50% reduction in delinquency rate 
In the following youtube video clips, ZSAnalytics' CEO Chris Sandusky is explaining how ZSA's solutions lower the overall business cost and the advantages of predictive models over intuitive rules in fraud detection.

Lower the Overall Business Cost

 


Predictive Models vs Intuitive Rules 

    Sunday, March 06, 2016

    the Danger of Predictive Model Overfitting

    In the post a Young Data Scientist- Kaggle Competition Top 5% Winner: Yuyu Zhou, Yuyu talks about the important role of feature engineering, i.e, finding good derived variables, and gradient boosting trees in their success. He also tells me a very interesting observations on Kaggle Competition ranking.

    "After participating teams finished building their predictive models, they apply their models to two data sets to generate predictions: a smaller set containing target variable and a larger data set where the target variable is removed. Each participating team's model is temporarily ranked based on the result on the smaller data set with the target variable. After the competition's deadline is due, Kaggle will calculate the final ranking of each team based on a model's prediction on the larger data set. "
    "It is interesting to see that the rankings of some top 1% models based on the smaller data set drop more than 20% on the larger data set. I figure out what might cause the huge discrepancies in their model performance. Those teams' models fit the smaller data set so well that they lose their capability to generalize. It is a typical overfitting problem."
    It is important to avoid model overfitting. After all, a predictive model is only useful if it can generalize, i.e., able to handle new data reasonably well.

    Saturday, March 05, 2016

    a Young Data Scientist- Kaggle Competition Top 5% Winner: Yuyu Zhou


    Yuyu Zhou is a graduate student in Analytics in University of New Hampshire. His team has achieved the top 3% and 5% in two Kaggle prediction competitions respectively. In an interview, I asked him how their predictive models performed so well. Yuyu said,

    "One of the keys to the success is that we spend tremendous amount of time working on building feature variables. Those variables are usually the results of combining several raw variables. For example, the ratio between the body weight and height is a better variable in predicting a patient's health than using body weight or height alone."
    "My training in computer science is extremely helpful in these projects. I am able to write Java, Python and SQL scripts to perform tasks such as data cleansing, data merge, and data transform, etc. As we know, more than 80% of time in a project is typically spent on those tasks before we start building predictive models."
    "We have tried many type of predictive models and found that gradient boosting trees have consistently perform the best."

    The following is a summary of Yuyu's contribution in those two projects.

    Kaggle Competition: Rossmann Store Sales Prediction (ranked top 5%) Oct 2015 – Dec 2015

    • Built the Predictive Model for daily sales for Rossmann Stores using Python Machine Learning library.
    • Conducted data cleaning and feature engineering for increasing data quality.
    • Designed final prediction model by combining the multiple gradient boosting trees algorithms
    • Prediction accuracy was ranked at 163 out of 3303 teams

    Kaggle Competition: Property Risk Level Prediction (ranked top 3%) July 2015 – Aug 2015

    • Developed Statistics models to predict risk level of properties which Liberty Mutual Inc is going to protect.
    • Led the team and conducted cost and benefit analysis on new ideas.
    • Implemented ideas using statistical packages from Python.
    • Prediction accuracy was ranked at 71 out of 2236 teams.
    Yuyu is currently looking for a full time job in data analytics. Please feel free to contact him if you are hiring. He can be reached by email yuyu.zhou@hotmail.com or phone (508) 933-7311. Here is his LinkedIn Profile.

    Friday, March 04, 2016

    Find Tablespaces for Oracle Tables

    We may run the following query from SYS to find the tablespace names for all tables, including IOTs (index organized table) and partitioned tables (replacing 'DMUSER' with your user name).

    select  u.name owner, o.name table_name,t.name tablspace_name 
    from
     obj$ o,  ts$ t, sys_objects s, user$ u
    where o.obj#=s.object_id and s.ts_number=t.ts#
          and o.owner#=u.user# 
          and o.type#=2
          and u.name='DMUSER'
    order by 1,2;
    OWNER     TABLE_NAME              TABLSPACE_NAME
    -------------------------------- -------------------------------- 
    DMUSER     AI_EXPLAIN_OUTPUT         TBS_1
    DMUSER     AR_SH_SAMPLE_SETTINGS     TBS_1
    DMUSER     DM$P0AR_SH_SAMPLE         TBS_1
    DMUSER     DM$P0AR_SH_SAMPLE_2COL    TBS_1
    DMUSER     DM$P0EM_SH_CLUS_SAMPLE    TBS_1
    DMUSER     DM$P0NB_SH_CLAS_SAMPLE    TBS_1
    DMUSER     DM$P0OC_SH_CLUS_SAMPLE    TBS_1
    DMUSER     DM$P1EM_SH_CLUS_SAMPLE    TBS_1
    DMUSER     DM$P1NB_SH_CLAS_SAMPLE    TBS_1
    DMUSER     DM$P1OC_SH_CLUS_SAMPLE    TBS_1
    

    Wednesday, March 02, 2016

    a Random Number Generator

    Rand Number Generator

    Wednesday, February 17, 2016

    Dr. Jiang Zhou's Presentation at UNH Analytics

    I shared my analytics experience with a group of students and professors from University of New Hampshire Analytics on February 25. I really enjoyed the interaction with professors and students. The following is the news on UNH Graduate Analytics LinkedIn page.

    UNH Graduate Analytics would like to thank Dr. Jiang Zhou for his wonderful presentation this morning, which was part of our Analytics Speaker Series. 

    Dr. Zhou highlighted the advantages and disadvantages of using statistical models vs. intuition to solve business problems throughout his 10+ years of experience. Among his many accomplishments, Dr. Zhou built a fraud detection model for large banks, which detects 70% more fraud than using business intuition rules alone.

    Free Data Sources

    Bernard Marr wrote a post Big Data: 33 Brilliant And Free Data Sources For 2016. He gives a list of free data sources provided by governments, social websites and companies including Google and Facebook. The data types include census, climate, financial, health, news, etc. I have found the list very interesting.

    Friday, December 25, 2015

    Convert Values Using Decode Function

    We use decode() function to convert a value to other values. We have the following table.

    SQL> select * from tbl_test order by id;
    
         ID MESSAGE
    ------- --------------------------------
          1 A
          2 B
          3 C
          4 D
    
    The following query transforms characters values into numbers or weights.
    SQL> select a.*, decode(message, 'A', 0.3, 'B',0.4, 0.35) weight from tbl_test a order by id;
    
         ID MESSAGE                              WEIGHT
    ------- -------------------------------- ----------
          1 A                                        .3
          2 B                                        .4
          3 C                                       .35
          4 D                                       .35
    
    In the above query, A is mapped to 0.3, B to 0.4. All other values are mapped to the default value of 0.35. If the default value is not defined, other characters where the mapping is not defined will be given a NULL value.
    SQL> select a.*, decode(message, 'A', 0.3, 'B',0.4) weight from tbl_test a order by id;
    
         ID MESSAGE                              WEIGHT
    ------- -------------------------------- ----------
          1 A                                        .3
          2 B                                        .4
          3 C
          4 D
    
    I have used decode function to convert categorical values into weights that are derived from a logistic regression model. For example, I used the following query to convert transaction code into weights. Before decode function is applied, CARD_TXN_CDE is cleaned up using trim, substr and nvl functions. The trim function removes leading and trailing blanks. The substr function extracts the first 18 characters of the string. The nvl function converts NULL values to blank.
    decode(nvl(substr( trim(CARD_TXN_CDE),1,18),' '),
    ' ',-.660841,
    '01',-.518927,
    '1',-.076546,
    '10',-.294631,
    '12',.077699,
    '14',-.709884,
    '25',-.30619,
    '40',.021855,
    '51',-.004593,
    '52',-.069521,
    '53',-.000344,
    '57',.570421,
    '59',.858444,
    '67',1.481654,
    '81',.29988,
    '91',-.004755,
    '96',.02628,
    'AD',-.001036,
    'AK',-.150162,
    'CB',-.001588,
    'D5',.364975,
    'G',-.015795,
    'H',-.274374,
    'I',-1.065177,
    'J',-.027991,
    'N',.157622,
    'NK',.763406,
    'NR',.080558,
    'P',-.961133,
    'PE',-.19558,
    'PH',.479081,
    'PR',.134741,
    'S',-.239287,
    'SV',.475934,
    'T1',.241061,
    'T2',-.277572,
    'T3',.901487,
    'T4',-.0137,
    'TS',.01362,
    'U2',-.186914,
    'X',1.301152,
    'XX',-.11462,
    'Z0',-.7141,
    'Z4',-.004642,
    'Z6',-.014541,
    0)