Thursday, August 08, 2013

Calculate Statistical Mode- the Most Frequent Value

Statistical mode is the value that happens most often (as shown in the chart below). For discrete variables, mode can be calculated using the same approach describe in post Find the most frequent items using SQL. We simply select the top 1 most frequent item.




For continuous variables, we can use the function width_bucket() to divide the data into segments and calculate the frequency in each segment. We can then pick the average value of the data points from the  most frequent segment as the mode. We use the 1,000 random numbers described in the post  Generate 1000 Normally Distributed Numbers. The following query calculate the frequency and average value for each of the segment defined by width_bucket(variable, min_value, max_value, number_of_buckets). Buckets 0 and 12 are for values lower than -3 and higher than 3, respectively.

SQL> select s, count(1), avg(num) from (select width_bucket(num, -3,3, 11) s, num from TBL_1K_RND) group by s order by s;

         S   COUNT(1)   AVG(NUM)
---------- ---------- ----------
         0          2 -3.0538402
         1          5 -2.6387896
         2         29 -2.1748723
         3         58 -1.6073189
         4        115 -1.0457551
         5        185 -.53460548
         6        208 .010066623
         7        194 .525553927
         8        119 1.05966651
         9         47  1.5985678
        10         32 2.14458861
        11          4 2.67165765
        12          2 3.06640121

Again, we use row_number() function to generate rank based on frequency and calculate the mode as follows.

with tbl_1 as
(
select s, count(1) cnt, avg(num) num from (select width_bucket(num, -3,3, 11) s, num from TBL_1K_RND) group by s order by s),
tbl_2 as
(
select a.*, row_number() over(order by cnt desc) rnk from tbl_1 a
)
select num from tbl_2 where rnk=1;

       NUM
----------

.010066623

No comments: