In this post, I’m going to show you how to remove duplicate rows from a table using a very fancy way in SQL Server. Imagine that somebody managed to upload a file with duplicate records to our SQL Server and now, the reports went crazy showing numbers that don’t make sense at all. I’m pretty sure that you’ve heard this at least once during your career, right? Well first, let’s see how we can check for duplicate rows in our table.
USING GROUP BY AND HAVING
There is a very easy way of how to see if we have duplicates in our table. We need to COUNT
the CustomerNames and GROUP
them to see how many times each customer is repeated. After that, we have to use the HAVING
clause to filter the groups that have more than 1 record. Let’s see.
SELECT CustomerName , COUNT(*) NameCount FROM dbo.DuplicateData GROUP BY CustomerName HAVING COUNT(*) > 1
Here we can see the list of customers and how many times the show up in the table. After we have this, we need to find a way to remove the duplicates only. The way I think about is as follows…
USING DISTINCT
As we all know, DISTINCT
is a SQL clause that prevents getting duplicate rows. We already know that we have duplicates, now using DISTINCT
we can copy the unique values into a new table, TRUNCATE
the real table and then INSERT
the unique rows into that and we’ll be done. For purposes of this blog post and because I’m lazy, I’ll be using temporary tables, so I don’t have to recreate the physical table every time.
SELECT DISTINCT * INTO #NonDuplicateData FROM dbo.DuplicateData ORDER BY CustomerName SELECT * FROM #NonDuplicateData ORDER BY CustomerNameYou can see that I’m only getting the unique values now. And now, we just do a
TRUNCATE TABLE
and INSERT INTO
the actual table again.
TRUNCATE TABLE #DuplicateData INSERT INTO #DuplicateData SELECT * FROM #NonDuplicateDataAnd, we check for duplicates in the table, and there you go. First method done. Let’s move to the next.
USING A CTE
I had written about Common Table Expressions a bit in this blog post. That time I used a ROW_NUMBER()
window function, and we’re going to do the same today. Using CTEs is my personal favorite way of how to get rid of duplicate rows because is less work and you and do it all at once. You don’t need temporary tables or be moving data around. Let’s see how it’s done. We can “split” this way in two steps. The first step would be checking for duplicates like this:
SELECT CustomerName , ROW_NUMBER() OVER(PARTITION BY CustomerName ORDER BY CustomerName ASC) AS rn FROM DuplicateData
See that I created a new column for the row number, and in here we can easily identify the ones that have duplicates. Going to step two, we need to enclose that query in a Common Table Expression, and since we have the row number we can do the DELETE
directly just telling SQL to remove the ones that have a number greater than one.
WITH Duplicates AS ( SELECT CustomerName , ROW_NUMBER() OVER(PARTITION BY CustomerName ORDER BY CustomerName ASC) AS rn FROM DuplicateData ) DELETE FROM Duplicates WHERE rn > 1;And, check again for duplicates.
All the rows have a number one, it means that we got rid of the duplicates and we can back to watch YouTube videos work.
That’s all for today’s post. I hope you guys liked it. You can find the code for this and other posts in my github.
If you have any questions or comments, shoot me an email, leave a message or tweet me here. Don’t forget to like and share. Also, if you want to receive updates and new blog posts to your inbox, subscribe to my LearningSQL with Beno email list down here. [mc4wp_form id=”331″] See you soon. Cheers!Discover more from Ben Rodríguez
Subscribe to get the latest posts sent to your email.