13.3. SQL Data - SELECT

13.3.1. SELECT, Reading from a Database

The exercises and sample material here is based on the world database as illustrated in the ER diagram of the database.

Figure 13.1. The World Database, ER Diagram

Let us start with an example:

SELECT id, name FROM city WHERE id = 9;
  • SELECT, FROM, and WHERE are keywords. Case insensitive by the way.
  • name, city, and id are identifiers. In programming they would be variable names, here the first and last are column names while the middle one is a table name.
  • = is an operator.
  • 9 is a constant. Numeric in this case.
  • ; is a statement delimiter.

This is an SQL declaration, or statement. This is what it consists of, nothing but a combination of the above.

The same declaration as the one we have just seen could be written as follows:

select id, name
from city
where id = 9;

Each line is a clause. The select, and the from clauses are mandatory while the where clause is optional. Keywords are case insensitive as mentioned, and splitting the declaration into three lines makes it easier to read. The result when executing the query is

select id, name
from city
where id = 1804;
+------+----------+
| id   | name     |
+------+----------+
| 1804 | Yaoundé  |
+------+----------+
1 row in set (0.00 sec)

Two of the identifiers were columns while the remaining one was a table name. Let us for the sake of clarity see [a fragment of] the table in question:

select *
from city
where countrycode = 'CMR';
+------+------------+-------------+---------------+------------+
| id   | name       | countrycode | district      | population |
+------+------------+-------------+---------------+------------+
| 1803 | Douala     | CMR         | Littoral      |    1448300 |
| 1804 | Yaoundé    | CMR         | Centre        |    1372800 |
| 1805 | Garoua     | CMR         | Nord          |     177000 |
| 1806 | Maroua     | CMR         | Extrême-Nord  |     143000 |
| 1807 | Bamenda    | CMR         | Nord-Ouest    |     138000 |
| 1808 | Bafoussam  | CMR         | Ouest         |     131000 |
| 1809 | Nkongsamba | CMR         | Littoral      |     112454 |
+------+------------+-------------+---------------+------------+
7 rows in set (0.00 sec)

Tables consist of rows and colums. A row hold information about some item. The columns hold values for the attributes of that item.

The proper way of saying it would be to say that each column/row intersection holds a component in the sense defined by the section called “Tuple and Attribute Defined by [Date00a]”.

13.3.2. Ordnung Muss Sein!

You have no theoretical right to expect your database query to deliver results in any specific order - unless you tell it to. In a totally new database relation, table you may probably get the rows output in order they were entered. Have you done any updates, new inserts, or the like you may experience that the RDBMS software has shifted the rows around. This leaves the question of how to specify output order? It's really simple, but remember, you must do it yourself. An example

select *
from city
where countrycode = 'CMR'
order by name;
+------+------------+-------------+---------------+------------+
| id   | name       | countrycode | district      | population |
+------+------------+-------------+---------------+------------+
| 1808 | Bafoussam  | CMR         | Ouest         |     131000 |
| 1807 | Bamenda    | CMR         | Nord-Ouest    |     138000 |
| 1803 | Douala     | CMR         | Littoral      |    1448300 |
| 1805 | Garoua     | CMR         | Nord          |     177000 |
| 1806 | Maroua     | CMR         | Extrême-Nord  |     143000 |
| 1809 | Nkongsamba | CMR         | Littoral      |     112454 |
| 1804 | Yaoundé    | CMR         | Centre        |    1372800 |
+------+------------+-------------+---------------+------------+
7 rows in set (0.00 sec)

or more detailed, by

select *
from city
where countrycode = 'CMR'
order by district, name;
+------+------------+-------------+---------------+------------+
| id   | name       | countrycode | district      | population |
+------+------------+-------------+---------------+------------+
| 1804 | Yaoundé    | CMR         | Centre        |    1372800 |
| 1806 | Maroua     | CMR         | Extrême-Nord  |     143000 |
| 1803 | Douala     | CMR         | Littoral      |    1448300 |
| 1809 | Nkongsamba | CMR         | Littoral      |     112454 |
| 1805 | Garoua     | CMR         | Nord          |     177000 |
| 1807 | Bamenda    | CMR         | Nord-Ouest    |     138000 |
| 1808 | Bafoussam  | CMR         | Ouest         |     131000 |
+------+------------+-------------+---------------+------------+
7 rows in set (0.00 sec)

where district is the top level sort criterium, and the name is the subsidiary criterium.

13.3.3. Introducing Two More Relations

The exercise and sample material here is, again, based on the world database you have downloaded and installed. Let me by another two example queries introduce you to the remaning two relations from that database:

select code2, name, surfacearea, population
from country
where continent ='Oceania'
order by surfacearea desc;
+-------+--------------------------------------+-------------+------------+
| code2 | name                                 | surfacearea | population |
+-------+--------------------------------------+-------------+------------+
| AU    | Australia                            |  7741220.00 |   18886000 |
| PG    | Papua New Guinea                     |   462840.00 |    4807000 |
| NZ    | New Zealand                          |   270534.00 |    3862000 |
| SB    | Solomon Islands                      |    28896.00 |     444000 |
| NC    | New Caledonia                        |    18575.00 |     214000 |
| FJ    | Fiji Islands                         |    18274.00 |     817000 |
| VU    | Vanuatu                              |    12189.00 |     190000 |
| PF    | French Polynesia                     |     4000.00 |     235000 |
| WS    | Samoa                                |     2831.00 |     180000 |
| KI    | Kiribati                             |      726.00 |      83000 |
| FM    | Micronesia, Federated States of      |      702.00 |     119000 |
| TO    | Tonga                                |      650.00 |      99000 |
| GU    | Guam                                 |      549.00 |     168000 |
| MP    | Northern Mariana Islands             |      464.00 |      78000 |
| PW    | Palau                                |      459.00 |      19000 |
| NU    | Niue                                 |      260.00 |       2000 |
| CK    | Cook Islands                         |      236.00 |      20000 |
| WF    | Wallis and Futuna                    |      200.00 |      15000 |
| AS    | American Samoa                       |      199.00 |      68000 |
| MH    | Marshall Islands                     |      181.00 |      64000 |
| CX    | Christmas Island                     |      135.00 |       2500 |
| PN    | Pitcairn                             |       49.00 |         50 |
| NF    | Norfolk Island                       |       36.00 |       2000 |
| TV    | Tuvalu                               |       26.00 |      12000 |
| NR    | Nauru                                |       21.00 |      12000 |
| UM    | United States Minor Outlying Islands |       16.00 |          0 |
| CC    | Cocos (Keeling) Islands              |       14.00 |        600 |
| TK    | Tokelau                              |       12.00 |       2000 |
+-------+--------------------------------------+-------------+------------+
28 rows in set (0.00 sec)

Please notice the dual where clause. The last remaining table has data on languages spoken in all those countries

select countrycode, language, percentage, isofficial
from countrylanguage
where countrycode='ZAF'
order by percentage desc;
+-------------+------------+------------+------------+
| countrycode | language   | percentage | isofficial |
+-------------+------------+------------+------------+
| ZAF         | Zulu       |       22.7 | T          |
| ZAF         | Xhosa      |       17.7 | T          |
| ZAF         | Afrikaans  |       14.3 | T          |
| ZAF         | Northsotho |        9.1 | F          |
| ZAF         | English    |        8.5 | T          |
| ZAF         | Tswana     |        8.1 | F          |
| ZAF         | Southsotho |        7.6 | F          |
| ZAF         | Tsonga     |        4.3 | F          |
| ZAF         | Swazi      |        2.5 | F          |
| ZAF         | Venda      |        2.2 | F          |
| ZAF         | Ndebele    |        1.5 | F          |
+-------------+------------+------------+------------+
11 rows in set (0.00 sec)

MariaDB [world]> select * from countrylanguage where language='Zulu';
+-------------+----------+------------+------------+
| countrycode | language | isofficial | percentage |
+-------------+----------+------------+------------+
| LSO         | Zulu     | F          |       15.0 |
| SWZ         | Zulu     | F          |        2.0 |
| ZAF         | Zulu     | T          |       22.7 |
+-------------+----------+------------+------------+

Please notice that there were two queries here. Another thing to notice in the latter query is the use of * as an all wild card for attributes in the relation. The use of this wild card is strongly not recommended in the construction of queries for web pages.

13.3.4. JOIN - Reading from Multiple Tables

Reading from more than one table at a time is called joining tables. The following text is based on the zipcode and student tables in the world database I have given to you.

select zip, name 
from zipcode;
+------+-----------+
| zip  | name      |
+------+-----------+
| 8000 | Aarhus C  |
| 8200 | Aarhus N  |
| 8210 | Aarhus V  |
| 8240 | Riisskov  |
| 8260 | Viby J    |
| 8270 | Højbjerg  |
+------+-----------+
6 rows in set (0.00 sec)

And student

select sno, name, zip 
from student;
+-----+----------+------+
| sno | name     | zip  |
+-----+----------+------+
|   1 | Adelaide | 8000 |
|   2 | Abigail  | 8260 |
|   3 | Avital   | 8270 |
+-----+----------+------+
3 rows in set (0.00 sec)

Obviously we need to read from both tables in order to print a list of students with fill info on zipcode and name of zipcode. Now the naive user will do as follows, please notice that we use the keyword join.

select sno, name, zip, name 
from student join zipcode;
ERROR 1052 (23000): Column 'name' in field list is ambiguous

Oops, obviously an error. We must qualify the ambiguous column names with the table name. Let's try again

select sno, student.name, zipcode.zip, zipcode.name
from student 
join zipcode;
+-----+----------+------+-----------+
| sno | name     | zip  | name      |
+-----+----------+------+-----------+
|   1 | Adelaide | 8000 | Aarhus C  |
|   2 | Abigail  | 8000 | Aarhus C  |
|   3 | Avital   | 8000 | Aarhus C  |
|   1 | Adelaide | 8200 | Aarhus N  |
|   2 | Abigail  | 8200 | Aarhus N  |
|   3 | Avital   | 8200 | Aarhus N  |
|   1 | Adelaide | 8210 | Aarhus V  |
|   2 | Abigail  | 8210 | Aarhus V  |
|   3 | Avital   | 8210 | Aarhus V  |
|   1 | Adelaide | 8240 | Riisskov  |
|   2 | Abigail  | 8240 | Riisskov  |
|   3 | Avital   | 8240 | Riisskov  |
|   1 | Adelaide | 8260 | Viby J    |
|   2 | Abigail  | 8260 | Viby J    |
|   3 | Avital   | 8260 | Viby J    |
|   1 | Adelaide | 8270 | Højbjerg  |
|   2 | Abigail  | 8270 | Højbjerg  |
|   3 | Avital   | 8270 | Højbjerg  |
+-----+----------+------+-----------+
18 rows in set (0.00 sec)

Mathematically, and relational set manipulation is set manipulation, this is correct. It is an example of a so called cross join aka cartesian product. In such a thing every element from one set is paired with every element from the other set. This results, in our concrete case in 6x3=18 elements in the result set.

Although technically correct, we don't like it. It is not what we expected or wanted. We forgot however, to specify the so called join criteria. Simply add those to the query and get

select sno, student.name, zipcode.zip, zipcode.name
from student 
join zipcode 
on student.zip = zipcode.zip;
+-----+----------+------+-----------+
| sno | name     | zip  | name      |
+-----+----------+------+-----------+
|   1 | Adelaide | 8000 | Aarhus C  |
|   2 | Abigail  | 8260 | Viby J    |
|   3 | Avital   | 8270 | Højbjerg  |
+-----+----------+------+-----------+
3 rows in set (0.00 sec)

Nobody ever needs a Cartesian product. Everybody needs a proper subset of one. Get it by adding proper join parameters in your query. Now you have been warned.

13.3.5. Outer Joins

Joins are most frequently used as we have just seen. They are identified by the use of the keyword inner join, or join for short. The keyword inner is default.

select sno, student.name, zipcode.zip, zipcode.name
from student 
join zipcode 
on student.zip = zipcode.zip;

or

select sno, student.name, zipcode.zip, zipcode.name
from student 
inner join zipcode 
on student.zip = zipcode.zip;

You may want to verify that these two declarations would yield identical results.

Now, the headline of this section reads Outer Join indicating that not all joins are inner joins. We use the inner joins to match elements from two or more relations. Outer joins are used when we need the result set to include elements from a set such that they do not have a matching element in the other set. This is easier illustrated by an example.

select sno, student.name, zipcode.zip, zipcode.name 
from student 
right outer join zipcode 
on student.zip = zipcode.zip;
+------+----------+------+-----------+
| sno  | name     | zip  | name      |
+------+----------+------+-----------+
|    1 | Adelaide | 8000 | Aarhus C  |
|    2 | Abigail  | 8260 | Viby J    |
|    3 | Avital   | 8270 | Højbjerg  |
| NULL | NULL     | 8200 | Aarhus N  |
| NULL | NULL     | 8210 | Aarhus V  |
| NULL | NULL     | 8240 | Riisskov  |
+------+----------+------+-----------+
6 rows in set (0.00 sec)

This example was a right outer join. You may wonter what is left and what is right in this context. Here you could say that from the relations to the right of the join keyword, zipcode you should include elements unmatched by students.

In order to clarify this let me try to swap the left and the right relations in the declaration and substitute left for right.

select sno, student.name, zipcode.zip, zipcode.name 
from zipcode 
left outer join student 
on student.zip = zipcode.zip;
+------+----------+------+-----------+
| sno  | name     | zip  | name      |
+------+----------+------+-----------+
|    1 | Adelaide | 8000 | Aarhus C  |
|    2 | Abigail  | 8260 | Viby J    |
|    3 | Avital   | 8270 | Højbjerg  |
| NULL | NULL     | 8200 | Aarhus N  |
| NULL | NULL     | 8210 | Aarhus V  |
| NULL | NULL     | 8240 | Riisskov  |
+------+----------+------+-----------+
6 rows in set (0.00 sec)

13.3.6. Aliases

You have already seen a couple of declarations where some of the identifiers are use repeatedly.

select sno, student.name, zipcode.zip, zipcode.name
from student 
join zipcode 
on student.zip = zipcode.zip;

Here student and zipcode are repeated. If you want to economize with keyboard work, try this

select sno, s.name, z.zip, z.name
from student s 
join zipcode z
on s.zip = z.zip;
+-----+----------+------+-----------+
| sno | name     | zip  | name      |
+-----+----------+------+-----------+
|   1 | Adelaide | 8000 | Aarhus C  |
|   2 | Abigail  | 8260 | Viby J    |
|   3 | Avital   | 8270 | Højbjerg  |
+-----+----------+------+-----------+
3 rows in set (0.00 sec)

In the second line of the query you assign an alias s to the student relation. In the following line you similarly assign an alias z to the relation zipcode. These aliases are used throughout lines 1 and 4 of the declaration. The result is a much shorter declaration.

As declarations are frequently long, and both relation names and attribute names likewise in order to make meaningful names, aliases is an appreciated way of shortening declaration typing. You might want to take advantage of that when you test your declarations before applying them to real life through programs.

In order to avoid ambiguity in column names attribute names may also be aliased. In the above example we have two name attributes in the attribute list resulting in two name columns in the output. The following will resolve the ambiquity.

select sno, s.name studentname, z.zip, z.name cityname
from student s 
join zipcode z
on s.zip = z.zip;
+-----+-------------+------+-----------+
| sno | studentname | zip  | cityname  |
+-----+-------------+------+-----------+
|   1 | Adelaide    | 8000 | Aarhus C  |
|   2 | Abigail     | 8260 | Viby J    |
|   3 | Avital      | 8270 | Højbjerg  |
+-----+-------------+------+-----------+
3 rows in set (0.00 sec)

Aliases are optional and used for convenience. There is however one situation where you must use aliases. Attributes prefixed with table names or table aliases must be aliased if the result set is to be handled by backend programming languages.

13.3.7. Grouping, Group Functions, Ordering

SQL has grouping facilities built in. Grouping is necessary for example when you list countries and you want to total population per continent. Then continent is considered a group.

select continent, count(name) countries, sum(population) 'Total population'
from country
group by continent;
+---------------+-----------+------------------+
| continent     | countries | Total population |
+---------------+-----------+------------------+
| Asia          |        51 |       3705025700 |
| Europe        |        46 |        730074600 |
| North America |        37 |        482993000 |
| Africa        |        58 |        784475000 |
| Oceania       |        28 |         30401150 |
| Antarctica    |         5 |                0 |
| South America |        14 |        345780000 |
+---------------+-----------+------------------+
7 rows in set (0.00 sec)

When you group, it is not recommended to include columns from individual rows in the result set. Country names in the above query would make no sense because only one row per group will be in the result.

Please notice the use of column aliases in the query. They are irrelevant in any web context, but useful in order to created meaningful column headings in test output from the query.

There are 5 group functions, no more, no less. You saw two here: count(), and sum. In addition to those you have min(), max(), and avg(). Apart from count() they are obviously only useful on numeric columns. In all cases you must include a column name in the parenthesis.

It is possible to use group functions without the keyword group by in the query. In such cases the total relation will be considered one group, and therefore the result set will include only one row.

select 'All continents'
      , count(continent) 'No of Continents'
      , sum(population) 'No of People' 
from country;
+----------------+------------------+--------------+
| All continents | No of Continents | No of People |
+----------------+------------------+--------------+
| All continents |              239 |   6078749450 |
+----------------+------------------+--------------+
1 row in set (0.00 sec)

Here you will notice, apart from what was explained, the use of a fixed text as the first column, and column aliases as headers for the other columns.

It is obvious logically that grouping involves some kind of ordering the input in such a way that counting, and summation make sense. There's no ordering of the output from this grouping. As always, if you need order, you create it yourself by order by.

select continent, count(name), sum(population)  
from country  
group by continent 
order by sum(population) desc;
+---------------+-------------+-----------------+
| continent     | count(name) | sum(population) |
+---------------+-------------+-----------------+
| Asia          |          51 |      3705025700 |
| Africa        |          58 |       784475000 |
| Europe        |          46 |       730074600 |
| North America |          37 |       482993000 |
| South America |          14 |       345780000 |
| Oceania       |          28 |        30401150 |
| Antarctica    |           5 |               0 |
+---------------+-------------+-----------------+
7 rows in set (0.00 sec)

13.3.8. Filtering

Working with groups, you sometimes need selective results. Now in an ordinary select, the where clause filters the relevant rows to be included in the result set. In groups however, the individual rows contribute to the group results, and therefore we need a different clause to select from the group results. This keyword is having.

select continent, count(name), sum(population)  
from country  
group by continent 
having sum(population) > 500000000 
order by sum(population) desc;
+-----------+-------------+-----------------+
| continent | count(name) | sum(population) |
+-----------+-------------+-----------------+
| Asia      |          51 |      3705025700 |
| Africa    |          58 |       784475000 |
| Europe    |          46 |       730074600 |
+-----------+-------------+-----------------+
3 rows in set (0.00 sec)

Here's an example of filtering with aliases in the declaration.

select co.name, count(ci.name) as citycount, sum(ci.population) 
from country co 
join city ci on code=ci.countrycode 
where continent = 'South America' 
group by co.name 
having count(ci.name) > 10 
order by sum(ci.population) desc;
select co.name, co.population, count(ci.name) citycount, sum(ci.population)
from country co 
join city ci   
on code=ci.countrycode 
where continent = 'South America' 
group by co.name 
having citycount > 10
order by population desc;
+-----------+------------+-----------+--------------------+
| name      | population | citycount | sum(ci.population) |
+-----------+------------+-----------+--------------------+
| Brazil    |  170115000 |       250 |           85876862 |
| Colombia  |   42321000 |        38 |           20250990 |
| Argentina |   37032000 |        57 |           19996563 |
| Peru      |   25662000 |        22 |           12147242 |
| Venezuela |   24170000 |        41 |           12251091 |
| Chile     |   15211000 |        29 |            9717970 |
| Ecuador   |   12646000 |        15 |            5744142 |
+-----------+------------+-----------+--------------------+
7 rows in set (0.00 sec)

13.3.9. Calculations

Obviously we need to make calculations on data stored in database relations. One way is retrieving data by querying and then calculating by some host programming language. In a web situation this might be accomplished by PHP.

SQL is capable of doing calculation on the fly in queries. In some cases this is beneficial for your purposes. In the following query, the number of cities is counted grouped by country. The total of city inhabitants is calculated by sum(ci.population) citypop. This sum is calculated as a percentage of the population of the country as retrieved from the country table. Plain arithmetic, courtesy of SQL. All the regular arithmetic operators apply.

The following declaration counts cities, sums city populations per country in South America. It also calculates the city population as a percentage of country population to give a column called Urbanization.

select co.name
     , co.population
     , count(ci.name) citycount
     , sum(ci.population) citypop
     , sum(ci.population) * 100 / co.population "Urbanization" 
from country co 
join city ci    
on code=ci.countrycode 
where continent = 'South America'  
group by co.name 
order by urbanization desc;
+------------------+------------+-----------+----------+--------------+
| name             | population | citycount | citypop  | Urbanization |
+------------------+------------+-----------+----------+--------------+
| Falkland Islands |       2000 |         1 |     1636 |      81.8000 |
| Chile            |   15211000 |        29 |  9717970 |      63.8878 |
| Argentina        |   37032000 |        57 | 19996563 |      53.9981 |
| Venezuela        |   24170000 |        41 | 12251091 |      50.6872 |
| Brazil           |  170115000 |       250 | 85876862 |      50.4817 |
| Colombia         |   42321000 |        38 | 20250990 |      47.8509 |
| Peru             |   25662000 |        22 | 12147242 |      47.3355 |
| Ecuador          |   12646000 |        15 |  5744142 |      45.4226 |
| Bolivia          |    8329000 |         8 |  3378644 |      40.5648 |
| Uruguay          |    3337000 |         1 |  1236000 |      37.0393 |
| Guyana           |     861000 |         1 |   254000 |      29.5006 |
| French Guiana    |     181000 |         1 |    50699 |      28.0105 |
| Suriname         |     417000 |         1 |   112000 |      26.8585 |
| Paraguay         |    5496000 |         5 |  1020020 |      18.5593 |
+------------------+------------+-----------+----------+--------------+
14 rows in set (0.00 sec)

13.3.10. Distinct

To answer a simple question such as "What languages exist is this world" we could use

select language
from countrylanguage
order by language;

Now this is true of course, but the result has a lot of repetitions, redundancies if you will. To make it even more useful SQL has a nice keyword distinct for you. See what it does? Every distinct column value appears exactly once.

select distinct language
from countrylanguage
order by language;