Wednesday, August 07, 2013

Find the most frequent items using SQL (continued)

In the earlier post Find the most frequent items using SQL, we use "group by" query to calculate the frequency of items and then functions row_number() or dense_rank() to generate rank based on the frequency to pick the top N items. The difference between row_number() or dense_rank() is that row_number() generate unique rank for each row even if there are ties.

Now, we want to solve a more sophisticate problem. We want to select the top 2 most frequent items in table tbl_items_by_person (shown below) for each person, Emily and John (instead of global top 2 most frequent items).

Table tbl_items_by_person

The first step is to calculate the item frequency "group by" name and item.

SQL> select name, item, count(*) as num from TBL_ITEMS_BY_PERSON group by name, item order by name, count(*) desc;

NAME                             ITEM                                    NUM
-------------------------------- -------------------------------- ----------
Emily                            Peach                                     8
Emily                            Apple                                     4
Emily                            Banana                                    3
Emily                            Pear                                      2
John                             Orange                                    5
John                             Apple                                     3
John                             Peach                                     3
John                             Banana                                    2
John                             Pear                                      2

The second step is to use row_number() or dense_rank() to generate rank for each name separately. This is done by using the clause over(partition by name order by num desc).

SQL> select name, item, num, row_number() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from TBL_ITEMS_BY_
PERSON group by name, item ) order by name, num desc;

NAME                             ITEM                                    NUM        RNK
-------------------------------- -------------------------------- ---------- ----------
Emily                            Peach                                     8          1
Emily                            Apple                                     4          2
Emily                            Banana                                    3          3
Emily                            Pear                                      2          4
John                             Orange                                    5          1
John                             Apple                                     3          2
John                             Peach                                     3          3
John                             Banana                                    2          4
John                             Pear                                      2          5

We can pick the top 2 frequent items for Emily and John using the following query.

SQL> with tbl as (select name, item, num, row_number() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from
TBL_ITEMS_BY_PERSON group by name, item )) select * from tbl where rnk<=2 order by name, num desc;

NAME                             ITEM                                    NUM        RNK
-------------------------------- -------------------------------- ---------- ----------
Emily                            Peach                                     8          1
Emily                            Apple                                     4          2
John                             Orange                                    5          1
John                             Peach                                     3          2

Or, we can use dense_rank() to generate rank which will assign the same rank for ties.

SQL> select name, item, num, dense_rank() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from TBL_ITEMS_BY_
PERSON group by name, item ) order by name, num desc;

NAME                             ITEM                                    NUM        RNK
-------------------------------- -------------------------------- ---------- ----------
Emily                            Peach                                     8          1
Emily                            Apple                                     4          2
Emily                            Banana                                    3          3
Emily                            Pear                                      2          4
John                             Orange                                    5          1
John                             Apple                                     3          2
John                             Peach                                     3          2
John                             Banana                                    2          3
John                             Pear                                      2          3

For John, there are 3 peaches and Apples. Both of them are ranked as 2. Thus they are both selected as shown below. 

SQL> with tbl as (select name, item, num, dense_rank() over(partition by name order by num desc) as rnk from (select name, item, count(*) as num from
TBL_ITEMS_BY_PERSON group by name, item )) select * from tbl where rnk<=2 order by name, num desc;

NAME                             ITEM                                    NUM        RNK
-------------------------------- -------------------------------- ---------- ----------
Emily                            Peach                                     8          1
Emily                            Apple                                     4          2
John                             Orange                                    5          1
John                             Peach                                     3          2
John                             Apple                                     3          2

1 comment:

Unknown said...

HI!

Let's assume that I would like to see the frequency of each fruit bought. So in the table we have information that people buy fruits. My result table would show that Apple is bought 76% of the people, Peach is bought 45% of the people. Normally, ONE person can buy several fruits, not just one.

How can I calculate this?

thx,
Imre