Counting and Summing: GROUP BY and the Aggregate Functions

So far every query returned rows that already existed. Aggregation is different: it collapses many rows into a summary: a count, a total, an average. This is where SQL starts answering business questions like "how many?" and "how much per department?"

In one line

Aggregate functions (COUNT, SUM, AVG, MIN, MAX) reduce many rows to one number. Add GROUP BY to get one number per category instead of one for the whole table.

One number for the whole table

Without GROUP BY, an aggregate function summarises every row into a single value. Run it:

Whole-table summary: how many employees, and the total + average salary

SELECT
  COUNT(*)    AS headcount,
  SUM(salary) AS total_salary,
  AVG(salary) AS avg_salary
FROM employees

Editable · runs in your browser

The five you'll use constantly: COUNT (how many), SUM (total), AVG (mean), MIN and MAX (smallest / largest).

GROUP BY — one number per category

Picture it first: GROUP BY department sorts the rows into coloured groups, then collapses each group into a single summary row. Two engineers become one Engineering row; two salespeople become one Sales row:

name	department	salary
Ada	Engineering	62000
Bo	Sales	48000
Cy	Engineering	71000
Di	Sales	52000
Eli	Marketing	45000

GROUP BY department

department	headcount	avg_salary
Engineering	2	66500
Sales	2	50000
Marketing	1	45000

Same-colour rows share a group; each group folds into one result row (COUNT and AVG computed per group).

Try it now: produce the grouped table the visual showed, the headcount per department (department and headcount).

Your turn · Exercise

Return the headcount per department (columns department and headcount).

The data

employees

5 rows returned

name	department	salary
Ada	Engineering	62,000
Bo	Sales	48,000
Cy	Engineering	71,000
Di	Sales	52,000
Eli	Marketing	45,000

SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY ...

Runs in your browser

GROUP BY department splits the table into groups and runs the aggregate once per group. The rule of thumb: every column in your SELECT is either inside an aggregate function or listed in GROUP BY. Run it:

One row per department: headcount and average salary

SELECT
  department,
  COUNT(*)    AS headcount,
  AVG(salary) AS avg_salary
FROM employees
GROUP BY department
ORDER BY headcount DESC

Editable · runs in your browser

Watch out

If you SELECT a column that's neither aggregated nor in GROUP BY, the query is ambiguous: "which name should I show for the Engineering group, there are two?" Most databases error; some pick one arbitrarily. The fix is always: aggregate it, or add it to GROUP BY.

Your turn

Group and aggregate. Return the total salary per department, highest total first.

Your turn · Exercise

Return the total salary per department (columns department and total_salary), highest total first.

The data

employees

5 rows returned

name	department	salary
Ada	Engineering	62,000
Bo	Sales	48,000
Cy	Engineering	71,000
Di	Sales	52,000
Eli	Marketing	45,000

SELECT department, SUM(salary) AS total_salary
FROM employees
GROUP BY ...
ORDER BY ...

Runs in your browser

COUNT(*) vs COUNT(column) vs COUNT(DISTINCT)

These three look alike and mean different things:

	Expression	Counts
COUNT(*)	Every row, including ones with NULLs
COUNT(col)	Only rows where col is NOT NULL
COUNT(DISTINCT col)	Distinct non-NULL values of col

The gap between COUNT(*) and COUNT(col) is exactly your NULL count.

Run it and watch the three numbers differ on the same column:

bonus has 2 NULLs and a repeated value — see how each COUNT reacts

SELECT
  COUNT(*)               AS all_rows,
  COUNT(bonus)           AS rows_with_bonus,
  COUNT(DISTINCT bonus)  AS distinct_bonuses
FROM employees

Editable · runs in your browser

From practice

SUM, AVG, MIN, MAX all silently skip NULLs too. That's usually what you want, but if NULL means "zero" in your data, the average will be higher than you expect because the NULL rows aren't counted in the denominator. Know your data.

Next: filtering on those aggregated numbers (which needs HAVING, not WHERE) and the order SQL actually runs your clauses in.