De-duplication is the process of deleting duplicate rows from a database.
Duplicate rows is a problem that data analysts frequently deal with.
Before deleting duplicates you have to find them.
In this example, duplicates are identified using a CTE on the Order table.
WITH Cte(FirstName, LastName,
OrderDate, RowNumber) AS (
SELECT FirstName, LastName, OrderDate,
ROW_NUMBER() OVER (
PARTITION BY OrderDate, FirstName, LastName
ORDER BY OrderDate, FirstName, LastName) AS RowNumber
FROM [Order] O
JOIN Customer C ON C.Id = O.CustomerId
)
SELECT *
FROM Cte
WHERE RowNumber > 1
Note: These are not true duplicates, but it gets the point across. This query returns customers that placed more than one order on the same day.
| FirstName | LastName | OrderDate | RowNumber |
|---|---|---|---|
| Elizabeth | Lincoln | 2013-01-10 00:00:00.000 | 2 |
| Philip | Cramer | 2013-02-25 00:00:00.000 | 2 |
| Howard | Snyder | 2013-07-31 00:00:00.000 | 2 |
| Jose | Pavarotti | 2013-10-22 00:00:00.000 | 2 |
| Felipe | Izquierdo | 2014-01-19 00:00:00.000 | 2 |
| Daniel | Tonini | 2014-03-24 00:00:00.000 | 2 |
| Jose | Pavarotti | 2014-04-17 00:00:00.000 | 2 |
The ROW_NUMBER() function sequentially numbers rows that are grouped by FirstName, LastName, and OrderDate. Those with a RowNumber > 1 are duplicates of another row.
Next, the duplicate rows need deletion.
Changing SELECT to DELETE in the above query will accomplish this.
WITH Cte(Id, CustomerId, OrderDate, RowNumber) AS (
SELECT Id, CustomerId, OrderDate,
ROW_NUMBER() OVER (
PARTITION BY CustomerId, OrderDate
ORDER BY OrderDate, CustomerId) AS RowNumber
FROM [Order] O
)
DELETE
FROM Cte
WHERE RowNumber > 1
Note: This query always deletes the second row it finds which may, or may not, be desirable. A manual inspection step of the duplicates is often required.
SELECT A.Id AS OrderId, FirstName, LastName,
A.OrderNumber, A.OrderDate
FROM [Order] A
JOIN (
SELECT C.Id, FirstName, LastName, OrderDate,
COUNT(O.Id) AS Occurences
FROM [Order] O
JOIN Customer C ON C.Id = O.CustomerId
GROUP BY C.Id, OrderDate, FirstName, LastName
HAVING COUNT(O.Id) > 1) AS B
ON A.CustomerId = B.Id AND A.OrderDate = B.OrderDate
| OrderId | FirstName | LastName | OrderNumber | OrderDate |
|---|---|---|---|---|
| 163 | Elizabeth | Lincoln | 542540 | 2013-01-10 00:00:00.000 |
| 164 | Elizabeth | Lincoln | 542541 | 2013-01-10 00:00:00.000 |
| 209 | Philip | Cramer | 542586 | 2013-02-25 00:00:00.000 |
| 210 | Philip | Cramer | 542587 | 2013-02-25 00:00:00.000 |
| 369 | Howard | Snyder | 542746 | 2013-07-31 00:00:00.000 |
| 370 | Howard | Snyder | 542747 | 2013-07-31 00:00:00.000 |
| 466 | Jose | Pavarotti | 542843 | 2013-10-22 00:00:00.000 |
| 467 | Jose | Pavarotti | 542844 | 2013-10-22 00:00:00.000 |
| 591 | Felipe | Izquierdo | 542968 | 2014-01-19 00:00:00.000 |
| 592 | Felipe | Izquierdo | 542969 | 2014-01-19 00:00:00.000 |
| 725 | Daniel | Tonini | 543102 | 2014-03-24 00:00:00.000 |
| 726 | Daniel | Tonini | 543103 | 2014-03-24 00:00:00.000 |
| 783 | Jose | Pavarotti | 543160 | 2014-04-17 00:00:00.000 |
| 784 | Jose | Pavarotti | 543161 | 2014-04-17 00:00:00.000 |
Here we have a list with paired duplicates, one after the other.