Back to Basics: SQL Server Aggregations and Grouping

Hello, and welcome back to another post!

In my last post, I showed you the basics of SQL Server queries, now in this post let’s complement that with more functionality. We’re going to add aggregations and groups to our queries.

SQL Server has many built-in functions that we can use to manipulate data. Scalar valued functions, table valued functions, type conversion functions, others to work with dates and times, offset functions, character functions, aggregate functions, to format and concatenate strings, also for splitting strings, system functions and more.

I will be writing of several of those in future posts, but for now we’re going to focus on aggregate functions.

WHAT IS AN AGGREGATE FUNCTION?

Basically, an aggregate function is a function that can be applied to a set of rows and get a single value as a result. For example SQL Server supports aggregate functions like: SUM(), COUNT(), MIN(), MAX(), AVG() and some others.

Something worth mentioning is that aggregate functions ignore NULL, with the exception of COUNT(*) which means, it’s going to count all the rows in our table.

Let’s see some examples of how these aggregate functions are used, and the first one is to show you what I meant about ignoring NULL values.

COUNT()

SELECT
    COUNT(*) AS CountAllRows
,   COUNT(ShipRegion) AS CountWithoutNulls
FROM
    Orders;


CountAllRows CountWithoutNulls
------------ -----------------
830          323

I included the result right below of the query. If you notice, the first column “CountAllRows” is doing a COUNT(*), which is returning the 830 rows in that table. For the second row, I chose a column with NULL values in it, and if you see, we’re getting only 323 rows back. That means the other 507 rows have NULLs.

Alright, moving on to the next one. We’re going to see how SUM works.

SUM()

SELECT
   SUM(UnitPrice) AS SumTotalUnitPrice
FROM
   Products;


SumTotalUnitPrice
---------------------
2222.71

What’s going on in here? So, I mentioned that an aggregate function returns a single value when applied to a set of data. So, here since we told it to SUM() the UnitPrice column, we got back the total sum of every value in that column, we didn’t ask for any filtering or categorization or anything else. Later, I will show you a bit more complex example. Move on.

AVG()

If you can tell, this function it’s the abbreviation of Average, so the column we pass into that aggregate will return an average.

SELECT
   AVG(UnitPrice) AS AverageUnitPrice
FROM
   Products;

AverageUnitPrice
---------------------
28.8663

As I promised, what the function returned was the average of the UnitPrice column. It means that it went grabbing all the values in the column, adding them up one by one, and divided them by the total number of elements in the column. You know, an average. Next.

MIN() and MAX()

I bet you know what they do by now, right? Let’s go to the example.

SELECT
   MIN(UnitPrice) AS CheapestProduct
,  MAX(UnitPrice) AS MostExpensiveProduct
FROM
   Products;


CheapestProduct       MostExpensiveProduct
--------------------- ---------------------
2.00                  263.50

By looking at the results, you might have a clue of what these functions do. If you haven’t, the MIN() function went to the UnitPrice column and grabbed the lowest value, and MAX() went on and grabbed the highest value it could find. Easy, right?

Aggregate functions are very useful when we are analyzing data, the thing is that we need to know what we are analyzing. In our previous examples, we didn’t show any other columns or values, we just saw a single result and we don’t know how to interpret that alone.

We need to group the results in order to be able to discover more about our data. For example, what are the cheapest and most expensive products we have? How many units of each product we have on inventory?

GROUPING

Grouping is exactly that… when we have just one column and an aggregation, it will take that column and group together the values that are equal. In our examples: ProductName, Customer, Orders and so on. When we have more than one column, and we want to apply an aggregation to those, it will produce a group for each combination of values.

I will show you that with some examples.

SELECT
    CategoryID
,   COUNT(*) AS CountOfProducts
FROM
    Products
GROUP BY
    CategoryID;

CategoryID  CountOfProducts
----------- ---------------
1           12
2           12
3           13
4           10
5           7
6           6
7           5
8           12

Let’s analyze the query above. I wanted to know how many product were in each category, so I included CategoryID and the COUNT(*). In grouped queries, the aggregates are applied to the group.

Now, because of this, we have to include the columns in the SELECT list in the GROUP BY clause. Let’s check a few more examples.

SELECT
    ShipCountry
,   COUNT(*) AS CountOfShipments
FROM
    Orders
GROUP BY
    ShipCountry
ORDER BY
    COUNT(*) DESC;


ShipCountry     CountOfShipments
--------------- ----------------
USA             122
Germany         122
Brazil          83
France          77
UK              56
Venezuela       46
Austria         40
Sweden          37
Canada          30
Mexico          28

In this query, we have the same pattern. We wanted to know which country we ship the most. If you see I included the ORDER BY COUNT(*) DESC at the end, to sort the results in descending order so we can see who has more shipments.

We know which country we ship the most, but in that country which city is the one with more shipments? We can include the ShipCity column, in here it will depend on how we want to see our data sorted. If we keep ORDER BY COUNT(*) DESC, we will see the city with most shipment, but it won’t be the US and Germany like in our last example.

SELECT
    ShipCountry
,   ShipCity
,   COUNT(*) AS CountOfShipments
FROM
    Orders
GROUP BY
    ShipCountry, ShipCity
ORDER BY
    COUNT(*) DESC;

ShipCountry     ShipCity        CountOfShipments
--------------- --------------- ----------------
Brazil          Rio de Janeiro  34
UK              London          33
Brazil          Sao Paulo       31
USA             Boise           31
Austria         Graz            30
Germany         Cunewalde       28
Mexico          México D.F.     28
Ireland         Cork            19
Sweden          Bräcke          19
USA             Albuquerque     18
Sweden          Luleå           18

I’m pasting only the first rows of the result set. If you see, the city with the highest amount of shipments is Rio de Janeiro in Brazil. If you see the US is now in a different place because of the sort we gave to the query.

FILTERING GROUPS

There would be times where we are looking for something specific within groups. For example in our last query, let’s say we want to show only Countries and Cities with more than 20 shipments. You might think that we can include a WHERE clause to the query, but remember that we have an aggregation to a group. How can we do that?

SELECT
    ShipCountry
,   ShipCity
,   COUNT(*) AS CountOfShipments
FROM
    Orders
WHERE
    COUNT(*) >= 20
GROUP BY
    ShipCountry, ShipCity;

Msg 147, Level 15, State 1, Line 80
An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference.

As you can see, if we try to filter with a WHERE clause, we get an error message saying that we can’t do that because we have an aggregate, it also says something else about HAVING.

The HAVING clause is use exactly for the task we want to achieve, to filter groups.

SELECT
    ShipCountry
,	ShipCity
,	COUNT(*) AS CountOfShipments
FROM
    Orders
GROUP BY
    ShipCountry, ShipCity
HAVING
    COUNT(*) >= 20;

ShipCountry     ShipCity        CountOfShipments
--------------- --------------- ----------------
USA             Boise           31
Germany         Cunewalde       28
Austria         Graz            30
UK              London          33
Mexico          México D.F.     28
Brazil          Rio de Janeiro  34
Brazil          Sao Paulo       31

There you have it. Cities with more than 20 shipments. Analyzing the query we can see that we just changed WHERE to HAVING and the condition is the same, but there’s something else. If we want to filter groups, we first need to define our groups with GROUP BY and then we can filter them out with HAVING clause.

Well, that’s it for now. I hope you liked the post, don’t forget that you can find the code in my GitHub if you want to follow along, also if you don’t have a copy of Northwind database, you can get it here.

Thanks for reading, see you soon. Cheers! 🍻

Discover more from Ben Rodríguez

Subscribe to get the latest posts sent to your email.