Tuesday, November 19, 2013

Logically Merge Data From Different Sources- Combine Records

Problem

In the real world application, data are collected by different systems. For example, some of the credit card fraudulent transactions are detected by banks through their fraud operation. Other fraudulent transactions are reported by customers and collected by claim department. To get the totality of credit cards that are involved in fraudulent activities, we have to combine the records from above two data sources. We describe the problem using a simple example. There are two tables, tbl_soccer_kids and tbl_tennis_kids, that record the names of kids who join soccer and tennis clubs. We want answers to the following questions:
1. Who join either soccer or tennis clubs?
2. Who join soccer and tennis clubs?
3. Who join soccer club but not tennis club?
4. Who join tennis club but not soccer club?

SQL> select * from tbl_soccer_kids order by name;

NAME
--------------------------------
CLARK
FORD
JAMES
JONES
MARTIN
SCOTT
TURNER

7 rows selected.

SQL> select * from tbl_tennis_kids order by name;

NAME
--------------------------------
ADAMS
ALLEN
BLAKE
CLARK
MARTIN
MILLER
TURNER
WARD

8 rows selected.

Solution

1. Who join either soccer or tennis clubs?
We use the "union" to combine the names from two tables. The duplicate names will be automatically removed. The following query shows that 12 kids are joining either soccer or tennis clubs.

SQL> select name from tbl_soccer_kids union select name from tbl_tennis_kids order by name;

NAME
--------------------------------
ADAMS
ALLEN
BLAKE
CLARK
FORD
JAMES
JONES
MARTIN
MILLER
SCOTT
TURNER
WARD

12 rows selected.
2. Who join soccer and tennis clubs?
We use "intersect" to find the common names in both tables.
SQL> select name from tbl_soccer_kids intersect select name from tbl_tennis_kids order by name;

NAME
--------------------------------
CLARK
MARTIN
TURNER
There are other approaches to find the common set and I will talk about them in another blog post.
3. Who join soccer club but not tennis club?
We use "minus" to find out records that are in the first table but not in the second table.
SQL> select name from tbl_soccer_kids minus select name from tbl_tennis_kids order by name;

NAME
--------------------------------
FORD
JAMES
JONES
SCOTT
4. Who join tennis club but not soccer club?
This is similar to question 3. We use "minus" and simply switch the order of two tables.
SQL> select name from tbl_tennis_kids minus select name from tbl_soccer_kids order by name;

NAME
--------------------------------
ADAMS
ALLEN
BLAKE
MILLER
WARD

No comments: