6. Sorting Results
4/4 Intermediate SQL
1. Sorting results
Congratulations on making it this far! We’ll now learn how to sort and group results to gain further insight.
2. Sorting results
Sorting results means we want to put our data in a specific order. It’s another way to make our data easier to understand by quickly seeing it in a sequence.
Let’s say we wanted to extract our three longest coats; if our closet were messy, it would take a long time to find. However, if we sorted our closet by garment type and length, we could quickly grab them!
3. ORDER BY
In SQL, the ORDER BY keyword is used to sort results of one or more fields. When used on its own, it is written after the FROM statement, as shown here.
ORDER BY will sort in ascending order by default. This can mean from smallest to biggest or from A to Z. In this case, we have one query sorting the budget from smallest to biggest and a second query sorting the titles alphabetically.
Our database contains film titles that start with symbols and numbers; these come before the letter A.
4. ASCending
We could also add the ASC keyword to our query to clarify that we are sorting in ascending order. The results are the same, and our code is more readable.
5. DESCending
We can use the DESC keyword to sort the results in descending order. This query gives us the film titles sorted by budget from biggest to smallest.
However, our data contains a lot of null values. We can add a WHERE clause before ORDER BY to filter the budget field for only non-null values and improve our results.
6. Sorting fields
Notice that we don’t have to select the field we are sorting on. For example, here’s a query where we sort by release year and only look at the title.
However, it is a good idea to include the field we are sorting on in the SELECT statement for clarity.
7. ORDER BY multiple fields
ORDER BY can also be used to sort on multiple fields. It will sort by the first field specified, then sort by the next, etc.
To specify multiple fields, we separate the field names with a comma. The second field we sort by can be thought of as a tie-breaker when the first field is not decisive in telling the order.
Here is an example. Let’s say we wanted to find the best movie. In the first query, we are only sorting the films by the number of Oscar wins and getting a tie.
We can break that tie by adding a second sorting field by seeing which film has the most wins and the highest imdb_score.
8. Different orders
We can also select a different order for each field we are sorting. For example, here, we are sorting birthdate in ascending order and name in descending order.
9. Order of execution
ORDER BY falls towards the end of the order of execution we already know, coming in just before limit.
The FROM statement will execute first,
then WHERE, followed by SELECT, ORDER BY,
and finally, LIMIT.
10. Let’s practice!
Time to practice our new sorting skills!
Sorting text
SQL provides you with the ORDER BY
keyword to sort one or more fields from your data. It can do this multi-directionally and helps make results easy to interpret.
How does ORDER BY
sort a column of text values by default?
Sorting single fields
Now that you understand how ORDER BY
works, you'll put it into practice. In this exercise, you'll work on sorting single fields only. This can be helpful to extract quick insights such as the top-grossing or top-scoring film.
The following exercises will help you gain further insights into the film database.
- Select the
name
of each person in thepeople
table, sorted alphabetically.
SELECT name
FROM people
ORDER BY name ASC;
2. Select the title
and duration
for every film, from longest duration to shortest.
-- Select the title and duration from longest to shortest film
SELECT title, duration
FROM films
ORDER BY duration DESC;
Sorting multiple fields
ORDER BY
can also be used to sort on multiple fields. It will sort by the first field specified, then sort by the next, and so on. As an example, you may want to sort the people
data by age and keep the names in alphabetical order.
Try using ORDER BY
to sort multiple columns.
- Select the
release_year
,duration
, andtitle
of films ordered by their release year and duration, in that order.
SELECT release_year, duration, title
FROM films
ORDER by release_year, duration;
2. Select the certification
, release_year
, and title
from films
ordered first by certification (alphabetically) and second by release year, starting with the most recent year.
SELECT certification, release_year, title
FROM films
ORDER BY certification, release_year DESC;
. Grouping data
We’ve learned how to sort data. Next, we’ll look at grouping our results.
2. Grouping data
In the real world, we’ll often need to summarize data for a particular group of results. For example, we might want to see the film data grouped by certification and make calculations on those groups, such as the average duration for each certification.
3. GROUP BY single fields
SQL allows us to group with the GROUP BY clause. Here it is used in a query where we have grouped by certification.
GROUP BY is commonly used with aggregate functions to provide summary statistics, particularly when only grouping a single field, certification, and selecting multiple fields, certification and title. This is because the aggregate function will reduce the non-grouped field to one record only, which will need to correspond to one group.
4. Error handling
SQL will return an error if we try to SELECT a field that is not in our GROUP BY clause. We’ll need to correct this by adding an aggregate function around title.
5. GROUP BY multiple fields
We can use GROUP BY on multiple fields similar to ORDER BY. The order in which we write the fields affects how the data is grouped. The query here selects and groups certification and language while aggregating the title.
The result shows that we have five films that have missing values for both certification and language, two films that are unrated and in Japanese, two films that are rated R and in Norwegian, and so on.
6. GROUP BY with ORDER BY
We can combine GROUP BY with ORDER BY to group our results, make a calculation, and then order our results. For example, we can clean up one of our previous queries by sorting the results by the title count in descending order. Here is that query without ORDER BY, and this is the same query with ordering added.
ORDER BY is always written after GROUP BY, and notice that we can refer back to the alias within the query. That is because of the order of execution. It looks like movies rated R are the most common in our database.
7. Order of execution
GROUP BY fits into our order after FROM and before all other clauses. Our updated queries will begin with
FROM, followed by grouping, selecting the data and creating the alias, sorting the results, and limiting them to the desired number.
8. Let’s practice!
In the following exercises, we’ll examine our film database to find out about release year, review, and budget patterns. Let’s practice!
GROUP BY single fields
GROUP BY
is a SQL keyword that allows you to group and summarize results with the additional use of aggregate functions. For example, films can be grouped by the certification and language before counting the film titles in each group. This allows you to see how many films had a particular certification and language grouping.
In the following steps, you’ll summarize other groups of films to learn more about the films in your database.
- Select the
release_year
and count of films released in each year aliased asfilm_count
.
SELECT release_year, COUNT(*) AS film_count
FROM films
GROUP BY releas_year;
2. Select the release_year
and average duration aliased as avg_duration
of all films, grouped by release_year
.
SELECT release_year, AVG(duration) AS avg_duration
FROM films
GROUP BY release_year;
GROUP BY multiple fields
GROUP BY
becomes more powerful when used across multiple fields or combined with ORDER BY
and LIMIT
.
Perhaps you’re interested in learning about budget changes throughout the years in individual countries. You’ll use grouping in this exercise to look at the maximum budget for each country in each year there is data available.
- Select the
release_year
,country
, and the maximumbudget
aliased asmax_budget
for each year and each country; sort your results byrelease_year
andcountry
.
SELECT release_year, country, MAX(budget) AS max_budget
FROM films
GROUP BY release_year, country;
Answering business questions
In the real world, every SQL query starts with a business question. Then it is up to you to decide how to write the query that answers the question. Let’s try this out.
Which release_year
had the most language diversity?
Take your time to translate this question into code. We’ll get you started then, it’s up to you to test your queries in the console.
“Most language diversity” can be interpreted as COUNT(DISTINCT ___)
. Now over to you.
SELECT release_year, COUNT(DISTINCT language)
FROM films
GROUP BY release_year
ORDER BY language;
1. Filtering grouped data
That was excellent work. We’ve combined sorting and grouping; next, we will combine filtering with grouping.
2. HAVING
In SQL, we can’t filter aggregate functions with WHERE clauses. For example, this query attempting to filter the title count is invalid.
That means that if we want to filter based on the result of an aggregate function, we need another way. Groups have their own special filtering word: HAVING.
For example, this query shows only those years in which more than ten films were released.
3. Order of execution
The reason why groups have their own keyword for filtering comes down to the order of execution. We’ve written a query using many of the keywords we have covered here.
This is their written order, starting with SELECT, FROM films, WHERE the certification is G, PG, or PG-13, GROUP BY certification, HAVING the title count be greater than 500, ORDER BY title count, and LIMIT to three. In contrast, the order of execution is:
FROM,
WHERE,
GROUP BY,
HAVING,
SELECT,
ORDER BY,
and LIMIT.
By reviewing this order, we can see WHERE is executed before GROUP BY and before any aggregation occurs. This order is also why we cannot use the alias with HAVING, but we can with ORDER BY.
4. HAVING vs WHERE
WHERE filters individual records while HAVING filters grouped records. We’ll walk through two business questions here to show how to translate them into the correct filter.
The first question is “What films were released in the year 2000?”. This question does not indicate any sort of grouping.
It asks to see only the titles from a specific year and can therefore be written as SELECT title, FROM films, WHERE release year equals 2000.
The second question is, “In what years was the average film duration over two hours?”. Straight away, we can see this question has a few more layers. Let’s break down the question and query into smaller, easier-to-understand steps.
5. HAVING vs WHERE
This question requires us to return information about years, so we select the release year from the films table. Next, it asks for the average film duration, which tells us we need to place AVG(duration) somewhere. Since we do not need to provide any additional information around the duration on its own, it is unlikely we need to perform the aggregation within the SELECT clause, so we’ll try the HAVING clause instead. The last part of the question indicates we need to filter on the duration. Since we can’t filter aggregates with WHERE, this supports our theory about using HAVING! Finally, we need to add a GROUP BY into our query since we have selected a column that has not been aggregated. Recall the aggregate function will convert the duration values into one average value. Going back to the start of our question, we’re interested in knowing the average duration per year, so we group it by release year. And there we have it!
6. Let’s practice!
Let’s improve our confidence with a final round of practice.
Filter with HAVING
Your final keyword is HAVING
. It works similarly to WHERE
in that it is a filtering clause, with the difference that HAVING
filters grouped data.
Filtering grouped data can be especially handy when working with a large dataset. When working with thousands or even millions of rows, HAVING
will allow you to filter for just the group of data you want, such as films over two hours in length!
Practice using HAVING
to find out which countries (or country) have the most varied film certifications.
- Select
country
from thefilms
table, and get the distinct count ofcertification
aliased ascertification_count
.
SELECT country, COUNT(DISTINCT certification) AS certification_count
FROM films
2. Group the results by country
.
SELECT country, COUNT(DISTINCT certification) AS certification_count
FROM films
GROUP BY country
3. Filter the unique count of certifications to only results greater than 10.
SELECT country, COUNT(DISTINCT certification) AS certification_count
FROM films
GROUP BY country
HAVING COUNT(DISTINCT certification) > 10;
HAVING and sorting
Filtering and sorting go hand in hand and gives you greater interpretability by ordering our results.
Let’s see this magic at work by writing a query showing what countries have the highest average film budgets.
- Select the
country
and the average budget asaverage_budget
, rounded to two decimal, fromfilms
.
SELECT country, ROUND(AVG(budget),2) AS average_budget
FROM films
- Group the results by
country
.
GROUP BY country
- Filter the results to countries with an average budget of more than one billion (
1000000000
).
HAVING AVG(budget) > 1000000000
- Sort by descending order of the
average_budget
.
SELECT country, ROUND(AVG(budget),2) AS average_budget
FROM films
GROUP BY country
HAVING AVG(budget) > 1000000000
SORT BY average_budget DESC;
All together now
It’s time to use much of what you’ve learned in one query! This is good preparation for using SQL in the real world where you’ll often be asked to write more complex queries since some of the basic queries can be answered by playing around in spreadsheet applications.
In this exercise, you’ll write a query that returns the average budget and gross earnings for films each year after 1990 if the average budget is greater than 60 million.
This will be a big query, but you can handle it!
- Select the
release_year
for each film in thefilms
table, filter for records released after 1990, and group byrelease_year
.
SELECT release_year
FROM films
WHERE release_year > 1990
GROUP BY release_year;
- Modify the query to include the average
budget
aliased asavg_budget
and averagegross
aliased asavg_gross
for the results we have so far.
SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross
FROM films
WHERE release_year > 1990
GROUP BY release_year;
- Modify the query once more so that only years with an average budget of greater than 60 million are included.
SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross
FROM films
WHERE release_year > 1990
GROUP BY release_year
HAVING AVG(budget) > 60000000;
- Finally, order the results from the highest average gross and limit to one.
SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross
FROM films
WHERE release_year > 1990
GROUP BY release_year
HAVING AVG(budget) > 60000000;
ORDER BY avg_gross DESC
LIMIT 1;
This is what we have learned so far!!
Congratulations!
Best of luck, and remember making mistakes is all part of learning!