Monday, November 18, 2013

More on Calculating Z-Score

Problem

In the earlier post Calculate Z-Score using SQL in Oracle, we show that we can normalize a variable using z-score which is defined as (variable value-mean)/Standard Deviation. How do we calculate the Z-scores for employee's salaries for the following table?

```SQL>  select EMPNO, ENAME, JOB, SAL from emp order by job, sal desc;

EMPNO ENAME      JOB              SAL
---------- ---------- --------- ----------
7902 FORD       ANALYST         3000
7788 SCOTT      ANALYST         3000
7934 MILLER     CLERK           1300
7876 ADAMS      CLERK           1100
7900 JAMES      CLERK            950
7369 SMITH      CLERK            800
7566 JONES      MANAGER         2975
7698 BLAKE      MANAGER         2850
7782 CLARK      MANAGER         2450
7839 KING       PRESIDENT       5000
7499 ALLEN      SALESMAN        1600
7844 TURNER     SALESMAN        1500
7521 WARD       SALESMAN        1250
7654 MARTIN     SALESMAN        1250

14 rows selected.
```

Solution

We can use the following query to calculate Z-score.

```SQL> with tbl_mean_std as
(
select avg(sal) m, stddev(sal) std from EMP
)
select EMPNO, ENAME, JOB, SAL, (sal-m)/std as z_score
from EMP, tbl_mean_std order by job, sal desc;

EMPNO ENAME      JOB              SAL    Z_SCORE
---------- ---------- --------- ---------- ----------
7902 FORD       ANALYST         3000 .783748996
7788 SCOTT      ANALYST         3000 .783748996
7934 MILLER     CLERK           1300 -.65387922
7876 ADAMS      CLERK           1100 -.82301195
7900 JAMES      CLERK            950  -.9498615
7369 SMITH      CLERK            800  -1.076711
7566 JONES      MANAGER         2975 .762607405
7698 BLAKE      MANAGER         2850 .656899448
7782 CLARK      MANAGER         2450 .318633985
7839 KING       PRESIDENT       5000 2.47507631
7499 ALLEN      SALESMAN        1600 -.40018012
7844 TURNER     SALESMAN        1500 -.48474649
7521 WARD       SALESMAN        1250  -.6961624
7654 MARTIN     SALESMAN        1250  -.6961624

14 rows selected.
```
Not surprisingly, the president's salary has a z-score of 2.47 which is almost two and half standard deviation above the mean. Mean always has a z-score of zero. If we want to save the mean and standard deviation derived from this data set, we may create a permanent table to store them. This table can be used to calculate the z-score for any future data points. The following queries create a permanent table containing mean and standard deviation. It is then used to create a view that calculates the z-score.
```SQL> create table tbl_mean_std_perm as select avg(sal) m, stddev(sal) std from EMP;
SQL> create view v_emp_zscore as select EMPNO,  ENAME,  JOB,   SAL,        (sal-m)/std as z_score from EMP, tbl_mean_std_perm;
```
We can take a look at the actual values of mean and standard deviation.
```SQL> select * from tbl_mean_std_perm;

M        STD
---------- ----------
2073.21429 1182.50322
```