Deep Data Mining Blog: Most Frequently Used SQL Functions by Data Scientist

Tuesday, March 24, 2015

Most Frequently Used SQL Functions by Data Scientist

As a data scientist, I have done many projects for banks, insurance companies, marketing firms, etc. Most of the projects involve building predictive models. I have won three head to head competitions for building best predictive models.

In a typical project, I receive the data, usually in the format for test files, from a client. Then I load them into Oracle databases. I do most of the data manipulation in the database using SQL. As a result of years of work, I have created close to 800 SQL script files under my work directories on my computers. I figure out it would be interesting to see what are the most frequently used SQL functions from a data scientist's perspective. So I perform a simple analysis on my SQL scripts. I have found that the following 45 functions accounts for about 65% of Oracle SQL functions that I have used for various projects. The following table shows ranks of frequencies of those functions as they present in my SQL scripts. As we may see, a typical data analytics project only involves a relatively small number of frequently used SQL functions.

We may have the following observations: 1. Many of the functions are used for data cleanse and data conversion including trim, nvl, decode,to_char, to_number, trunc, substr,etc. This precisely reflects the fact that at least 80% of the work is about data manipulation. 2. Some simple statistics functions include sum, count, min, max, stddev, corr, median. It is convenient to calculate useful statistics using SQL. We may use those SQL functions to produce reports similar to that by SAS UNIVARIATE.

Rank	SQL Function	Frequency
1	trim	10.03%
2	sum	8.22%
3	nvl	5.93%
4	count	5.67%
5	decode	5.38%
6	to_char	4.63%
7	min	3.53%
8	max	3.33%
9	to_number	2.25%
10	trunc	2.22%
11	row_number	1.97%
12	avg	1.88%
13	round	1.33%
14	to_date	1.20%
15	substr	1.07%
16	sign	0.93%
17	upper	0.77%
19	stddev	0.45%
20	mod	0.35%
21	lower	0.30%
22	exp	0.28%
23	ratio_to_report	0.27%
24	median	0.25%
25	abs	0.23%
26	lag	0.23%
27	dense_rank	0.22%
28	length	0.22%
29	greatest	0.18%
30	sys_guid	0.15%
31	ceil	0.15%
32	chr	0.13%
33	regexp_instr	0.10%
35	percent_rank	0.10%
36	replace	0.10%
37	regexp_replace	0.10%
38	cume_dist	0.08%
39	instr	0.07%
40	rtrim	0.07%
41	width_bucket	0.07%
42	prediction	0.07%
43	corr	0.05%
44	lead	0.03%
45	lpad	0.03%

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Tuesday, March 24, 2015

Most Frequently Used SQL Functions by Data Scientist

No comments: