Tuesday, March 24, 2015

Most Frequently Used SQL Functions by Data Scientist

As a data scientist, I have done many projects for banks, insurance companies, marketing firms, etc. Most of the projects involve building predictive models. I have won three head to head competitions for building best predictive models.

In a typical project, I receive the data, usually in the format for test files, from a client. Then I load them into Oracle databases. I do most of the data manipulation in the database using SQL. As a result of years of work, I have created close to 800 SQL script files under my work directories on my computers. I figure out it would be interesting to see what are the most frequently used SQL functions from a data scientist's perspective. So I perform a simple analysis on my SQL scripts. I have found that the following 45 functions accounts for about 65% of Oracle SQL functions that I have used for various projects. The following table shows ranks of frequencies of those functions as they present in my SQL scripts. As we may see, a typical data analytics project only involves a relatively small number of frequently used SQL functions.

We may have the following observations: 1. Many of the functions are used for data cleanse and data conversion including trim, nvl, decode,to_char, to_number, trunc, substr,etc. This precisely reflects the fact that at least 80% of the work is about data manipulation. 2. Some simple statistics functions include sum, count, min, max, stddev, corr, median. It is convenient to calculate useful statistics using SQL. We may use those SQL functions to produce reports similar to that by SAS UNIVARIATE.

Rank SQL Function Frequency
1  trim 10.03%
2  sum 8.22%
3  nvl 5.93%
4  count 5.67%
5  decode 5.38%
6  to_char 4.63%
7  min 3.53%
8  max 3.33%
9  to_number 2.25%
10  trunc 2.22%
11  row_number 1.97%
12  avg 1.88%
13  round 1.33%
14  to_date 1.20%
15  substr 1.07%
16  sign 0.93%
17  upper 0.77%
19  stddev 0.45%
20  mod 0.35%
21  lower 0.30%
22  exp 0.28%
23  ratio_to_report 0.27%
24  median 0.25%
25  abs 0.23%
26  lag 0.23%
27  dense_rank 0.22%
28  length 0.22%
29  greatest 0.18%
30  sys_guid 0.15%
31  ceil 0.15%
32  chr 0.13%
33  regexp_instr 0.10%
35  percent_rank 0.10%
36  replace 0.10%
37  regexp_replace 0.10%
38  cume_dist 0.08%
39  instr 0.07%
40  rtrim 0.07%
41  width_bucket 0.07%
42  prediction 0.07%
43  corr 0.05%
44  lead 0.03%
45  lpad 0.03%

No comments: