How to Remove Duplicates in SQL Server?

Hello and welcome to another post.

In this post, I’m going to show you how to remove duplicate rows from a table using a very fancy way in SQL Server. Imagine that somebody managed to upload a file with duplicate records to our SQL Server and now, the reports went crazy showing numbers that don’t make sense at all. I’m pretty sure that you’ve heard this at least once during your career, right? Well first, let’s see how we can check for duplicate rows in our table.

USING GROUP BY AND HAVING

There is a very easy way of how to see if we have duplicates in our table. We need to COUNT the CustomerNames and GROUP them to see how many times each customer is repeated. After that, we have to use the HAVING clause to filter the groups that have more than 1 record. Let’s see.

SELECT

CustomerName

, COUNT(*) NameCount

FROM

dbo.DuplicateData

GROUP BY

CustomerName

HAVING

COUNT(*) > 1

SELECT CustomerName , COUNT(*) NameCount FROM dbo.DuplicateData GROUP BY CustomerName HAVING COUNT(*) > 1

SELECT 
    CustomerName
,	COUNT(*) NameCount
FROM
    dbo.DuplicateData
GROUP BY
    CustomerName
HAVING
    COUNT(*) > 1

Here we can see the list of customers and how many times the show up in the table. After we have this, we need to find a way to remove the duplicates only. The way I think about is as follows…

USING DISTINCT

As we all know, DISTINCT is a SQL clause that prevents getting duplicate rows. We already know that we have duplicates, now using DISTINCT we can copy the unique values into a new table, TRUNCATE the real table and then INSERT the unique rows into that and we’ll be done. For purposes of this blog post and because I’m lazy, I’ll be using temporary tables, so I don’t have to recreate the physical table every time.

SELECT

DISTINCT *

INTO #NonDuplicateData

FROM

dbo.DuplicateData

ORDER BY

CustomerName

SELECT * FROM #NonDuplicateData ORDER BY CustomerName

SELECT DISTINCT * INTO #NonDuplicateData FROM dbo.DuplicateData ORDER BY CustomerName SELECT * FROM #NonDuplicateData ORDER BY CustomerName

SELECT
    DISTINCT * 
INTO #NonDuplicateData
FROM
    dbo.DuplicateData
ORDER BY
    CustomerName

SELECT * FROM #NonDuplicateData ORDER BY CustomerName

You can see that I’m only getting the unique values now.

And now, we just do a TRUNCATE TABLE and INSERT INTO the actual table again.

TRUNCATE TABLE #DuplicateData

INSERT INTO #DuplicateData

SELECT * FROM #NonDuplicateData

TRUNCATE TABLE #DuplicateData INSERT INTO #DuplicateData SELECT * FROM #NonDuplicateData

TRUNCATE TABLE #DuplicateData

INSERT INTO #DuplicateData
SELECT * FROM #NonDuplicateData

And, we check for duplicates in the table, and there you go. First method done. Let’s move to the next.

USING A CTE

I had written about Common Table Expressions a bit in this blog post. That time I used a ROW_NUMBER() window function, and we’re going to do the same today. Using CTEs is my personal favorite way of how to get rid of duplicate rows because is less work and you and do it all at once. You don’t need temporary tables or be moving data around. Let’s see how it’s done. We can “split” this way in two steps. The first step would be checking for duplicates like this:

SELECT

CustomerName

, ROW_NUMBER() OVER(PARTITION BY CustomerName ORDER BY CustomerName ASC) AS rn

FROM

DuplicateData

SELECT CustomerName , ROW_NUMBER() OVER(PARTITION BY CustomerName ORDER BY CustomerName ASC) AS rn FROM DuplicateData

SELECT 
    CustomerName
,	ROW_NUMBER() OVER(PARTITION BY CustomerName ORDER BY CustomerName ASC) AS rn
FROM 
    DuplicateData

See that I created a new column for the row number, and in here we can easily identify the ones that have duplicates. Going to step two, we need to enclose that query in a Common Table Expression, and since we have the row number we can do the DELETE directly just telling SQL to remove the ones that have a number greater than one.

WITH Duplicates

(

SELECT

CustomerName

, ROW_NUMBER() OVER(PARTITION BY CustomerName ORDER BY CustomerName ASC) AS rn

FROM

DuplicateData

)

DELETE

FROM

Duplicates

WHERE

rn > 1;

WITH Duplicates AS ( SELECT CustomerName , ROW_NUMBER() OVER(PARTITION BY CustomerName ORDER BY CustomerName ASC) AS rn FROM DuplicateData ) DELETE FROM Duplicates WHERE rn > 1;

WITH Duplicates
AS
(
    SELECT 
        CustomerName
    ,	ROW_NUMBER() OVER(PARTITION BY CustomerName ORDER BY CustomerName ASC) AS rn
    FROM 
        DuplicateData
)
DELETE
FROM
    Duplicates
WHERE
    rn > 1;

And, check again for duplicates.

All the rows have a number one, it means that we got rid of the duplicates and we can back to ~~watch YouTube videos~~ work.

That’s all for today’s post. I hope you guys liked it. You can find the code for this and other posts in my github.

If you have any questions or comments, shoot me an email, leave a message or tweet me here. Don’t forget to like and share. Also, if you want to receive updates and new blog posts to your inbox, subscribe to my LearningSQL with Beno email list down here. [mc4wp_form id=”331″] See you soon. Cheers!

Discover more from Ben Rodríguez

Subscribe to get the latest posts sent to your email.