So far every query returned rows that already existed. Aggregation is different: it collapses many rows into a summary: a count, a total, an average. This is where SQL starts answering business questions like "how many?" and "how much per department?"
Aggregate functions (COUNT, SUM, AVG, MIN, MAX) reduce many rows to one number. Add GROUP BY to get one number per category instead of one for the whole table.
One number for the whole table
Without GROUP BY, an aggregate function summarises every row into a single value. Run it:
The five you'll use constantly: COUNT (how many), SUM (total), AVG (mean), MIN and MAX (smallest / largest).
GROUP BY — one number per category
Picture it first: GROUP BY department sorts the rows into coloured groups, then collapses each group into a single summary row. Two engineers become one Engineering row; two salespeople become one Sales row:
| name | department | salary |
|---|---|---|
| Ada | Engineering | 62000 |
| Bo | Sales | 48000 |
| Cy | Engineering | 71000 |
| Di | Sales | 52000 |
| Eli | Marketing | 45000 |
GROUP BY department| department | headcount | avg_salary |
|---|---|---|
| Engineering | 2 | 66500 |
| Sales | 2 | 50000 |
| Marketing | 1 | 45000 |
Try it now: produce the grouped table the visual showed, the headcount per department (department and headcount).
Your turn · Exercise
Return the headcount per department (columns department and headcount).
The data
employees
5 rows returned
| name | department | salary |
|---|---|---|
| Ada | Engineering | 62,000 |
| Bo | Sales | 48,000 |
| Cy | Engineering | 71,000 |
| Di | Sales | 52,000 |
| Eli | Marketing | 45,000 |
GROUP BY department splits the table into groups and runs the aggregate once per group. The rule of thumb: every column in your SELECT is either inside an aggregate function or listed in GROUP BY. Run it:
Watch out
If you SELECT a column that's neither aggregated nor in GROUP BY, the query is ambiguous: "which name should I show for the Engineering group, there are two?" Most databases error; some pick one arbitrarily. The fix is always: aggregate it, or add it to GROUP BY.
Your turn
Group and aggregate. Return the total salary per department, highest total first.
Your turn · Exercise
Return the total salary per department (columns department and total_salary), highest total first.
The data
employees
5 rows returned
| name | department | salary |
|---|---|---|
| Ada | Engineering | 62,000 |
| Bo | Sales | 48,000 |
| Cy | Engineering | 71,000 |
| Di | Sales | 52,000 |
| Eli | Marketing | 45,000 |
COUNT(*) vs COUNT(column) vs COUNT(DISTINCT)
These three look alike and mean different things:
| Expression | Counts | |
|---|---|---|
| COUNT(*) | Every row, including ones with NULLs | |
| COUNT(col) | Only rows where col is NOT NULL | |
| COUNT(DISTINCT col) | Distinct non-NULL values of col |
The gap between COUNT(*) and COUNT(col) is exactly your NULL count.
Run it and watch the three numbers differ on the same column:
From practice
SUM, AVG, MIN, MAX all silently skip NULLs too. That's usually what you want, but if NULL means "zero" in your data, the average will be higher than you expect because the NULL rows aren't counted in the denominator. Know your data.
Next: filtering on those aggregated numbers (which needs HAVING, not WHERE) and the order SQL actually runs your clauses in.