The exercises and sample material here is
based on the
world
database
as illustrated in the ER diagram of the database.
Let us start with an example:
SELECT id, name FROM city WHERE id = 9;
SELECT
, FROM
, and
WHERE
are keywords.
Case insensitive by the way.
name
, city
, and id
are identifiers.
In programming they would be variable names,
here the first and last are column
names while the middle one is a
table name.
=
is an operator.
9
is a constant.
Numeric in this case.
;
is a
statement delimiter.
This is an SQL declaration, or statement. This is what it consists of, nothing but a combination of the above.
The same declaration as the one we have just seen could be written as follows:
select id, name
from city
where id = 9;
Each line is a clause. The
select
, and the from
clauses are
mandatory while the where
clause is optional.
Keywords are case insensitive as mentioned, and
splitting the declaration into three lines makes
it easier to read. The result when executing the query
is
select id, name
from city
where id = 1804;
+------+----------+
| id | name |
+------+----------+
| 1804 | Yaoundé |
+------+----------+
1 row in set (0.00 sec)
Two of the identifiers were columns while the remaining one was a table name. Let us for the sake of clarity see [a fragment of] the table in question:
select *
from city
where countrycode = 'CMR';
+------+------------+-------------+---------------+------------+
| id | name | countrycode | district | population |
+------+------------+-------------+---------------+------------+
| 1803 | Douala | CMR | Littoral | 1448300 |
| 1804 | Yaoundé | CMR | Centre | 1372800 |
| 1805 | Garoua | CMR | Nord | 177000 |
| 1806 | Maroua | CMR | Extrême-Nord | 143000 |
| 1807 | Bamenda | CMR | Nord-Ouest | 138000 |
| 1808 | Bafoussam | CMR | Ouest | 131000 |
| 1809 | Nkongsamba | CMR | Littoral | 112454 |
+------+------------+-------------+---------------+------------+
7 rows in set (0.00 sec)
Tables consist of rows and colums. A row hold information about some item. The columns hold values for the attributes of that item.
The proper way of saying it would be to say that each column/row intersection holds a component in the sense defined by the section called “Tuple and Attribute Defined by [Date00a]”.
You have no theoretical right to expect your database query to deliver results in any specific order - unless you tell it to. In a totally new database relation, table you may probably get the rows output in order they were entered. Have you done any updates, new inserts, or the like you may experience that the RDBMS software has shifted the rows around. This leaves the question of how to specify output order? It's really simple, but remember, you must do it yourself. An example
select *
from city
where countrycode = 'CMR'
order by name;
+------+------------+-------------+---------------+------------+
| id | name | countrycode | district | population |
+------+------------+-------------+---------------+------------+
| 1808 | Bafoussam | CMR | Ouest | 131000 |
| 1807 | Bamenda | CMR | Nord-Ouest | 138000 |
| 1803 | Douala | CMR | Littoral | 1448300 |
| 1805 | Garoua | CMR | Nord | 177000 |
| 1806 | Maroua | CMR | Extrême-Nord | 143000 |
| 1809 | Nkongsamba | CMR | Littoral | 112454 |
| 1804 | Yaoundé | CMR | Centre | 1372800 |
+------+------------+-------------+---------------+------------+
7 rows in set (0.00 sec)
or more detailed, by
select *
from city
where countrycode = 'CMR'
order by district, name;
+------+------------+-------------+---------------+------------+
| id | name | countrycode | district | population |
+------+------------+-------------+---------------+------------+
| 1804 | Yaoundé | CMR | Centre | 1372800 |
| 1806 | Maroua | CMR | Extrême-Nord | 143000 |
| 1803 | Douala | CMR | Littoral | 1448300 |
| 1809 | Nkongsamba | CMR | Littoral | 112454 |
| 1805 | Garoua | CMR | Nord | 177000 |
| 1807 | Bamenda | CMR | Nord-Ouest | 138000 |
| 1808 | Bafoussam | CMR | Ouest | 131000 |
+------+------------+-------------+---------------+------------+
7 rows in set (0.00 sec)
where district
is the top level
sort criterium, and the name
is the subsidiary criterium.
The exercise and sample material here is, again,
based on the
world
database you have downloaded and
installed.
Let me by another two example queries introduce you
to the remaning two relations from that database:
select code2, name, surfacearea, population
from country
where continent ='Oceania'
order by surfacearea desc;
+-------+--------------------------------------+-------------+------------+
| code2 | name | surfacearea | population |
+-------+--------------------------------------+-------------+------------+
| AU | Australia | 7741220.00 | 18886000 |
| PG | Papua New Guinea | 462840.00 | 4807000 |
| NZ | New Zealand | 270534.00 | 3862000 |
| SB | Solomon Islands | 28896.00 | 444000 |
| NC | New Caledonia | 18575.00 | 214000 |
| FJ | Fiji Islands | 18274.00 | 817000 |
| VU | Vanuatu | 12189.00 | 190000 |
| PF | French Polynesia | 4000.00 | 235000 |
| WS | Samoa | 2831.00 | 180000 |
| KI | Kiribati | 726.00 | 83000 |
| FM | Micronesia, Federated States of | 702.00 | 119000 |
| TO | Tonga | 650.00 | 99000 |
| GU | Guam | 549.00 | 168000 |
| MP | Northern Mariana Islands | 464.00 | 78000 |
| PW | Palau | 459.00 | 19000 |
| NU | Niue | 260.00 | 2000 |
| CK | Cook Islands | 236.00 | 20000 |
| WF | Wallis and Futuna | 200.00 | 15000 |
| AS | American Samoa | 199.00 | 68000 |
| MH | Marshall Islands | 181.00 | 64000 |
| CX | Christmas Island | 135.00 | 2500 |
| PN | Pitcairn | 49.00 | 50 |
| NF | Norfolk Island | 36.00 | 2000 |
| TV | Tuvalu | 26.00 | 12000 |
| NR | Nauru | 21.00 | 12000 |
| UM | United States Minor Outlying Islands | 16.00 | 0 |
| CC | Cocos (Keeling) Islands | 14.00 | 600 |
| TK | Tokelau | 12.00 | 2000 |
+-------+--------------------------------------+-------------+------------+
28 rows in set (0.00 sec)
Please notice the dual where clause. The last remaining table has data on languages spoken in all those countries
select countrycode, language, percentage, isofficial
from countrylanguage
where countrycode='ZAF'
order by percentage desc;
+-------------+------------+------------+------------+
| countrycode | language | percentage | isofficial |
+-------------+------------+------------+------------+
| ZAF | Zulu | 22.7 | T |
| ZAF | Xhosa | 17.7 | T |
| ZAF | Afrikaans | 14.3 | T |
| ZAF | Northsotho | 9.1 | F |
| ZAF | English | 8.5 | T |
| ZAF | Tswana | 8.1 | F |
| ZAF | Southsotho | 7.6 | F |
| ZAF | Tsonga | 4.3 | F |
| ZAF | Swazi | 2.5 | F |
| ZAF | Venda | 2.2 | F |
| ZAF | Ndebele | 1.5 | F |
+-------------+------------+------------+------------+
11 rows in set (0.00 sec)
MariaDB [world]> select * from countrylanguage where language='Zulu';
+-------------+----------+------------+------------+
| countrycode | language | isofficial | percentage |
+-------------+----------+------------+------------+
| LSO | Zulu | F | 15.0 |
| SWZ | Zulu | F | 2.0 |
| ZAF | Zulu | T | 22.7 |
+-------------+----------+------------+------------+
Please notice that there were two queries here.
Another thing to notice in the latter query is the use
of *
as an all wild
card for attributes in the relation.
The use of this wild card is strongly not
recommended in the construction of queries for web
pages.
Reading from more than one table at a time is called
joining tables. The following text is
based on the zipcode
and student
tables in the world
database
I have given to you.
select zip, name
from zipcode;
+------+-----------+
| zip | name |
+------+-----------+
| 8000 | Aarhus C |
| 8200 | Aarhus N |
| 8210 | Aarhus V |
| 8240 | Riisskov |
| 8260 | Viby J |
| 8270 | Højbjerg |
+------+-----------+
6 rows in set (0.00 sec)
And student
select sno, name, zip
from student;
+-----+----------+------+
| sno | name | zip |
+-----+----------+------+
| 1 | Adelaide | 8000 |
| 2 | Abigail | 8260 |
| 3 | Avital | 8270 |
+-----+----------+------+
3 rows in set (0.00 sec)
Obviously we need to read from both tables
in order to print a list of students with
fill info on zipcode and name of zipcode.
Now the naive user will do as follows, please
notice that we use the keyword join
.
select sno, name, zip, name
from student join zipcode;
ERROR 1052 (23000): Column 'name' in field list is ambiguous
Oops, obviously an error. We must qualify the ambiguous column names with the table name. Let's try again
select sno, student.name, zipcode.zip, zipcode.name
from student
join zipcode;
+-----+----------+------+-----------+
| sno | name | zip | name |
+-----+----------+------+-----------+
| 1 | Adelaide | 8000 | Aarhus C |
| 2 | Abigail | 8000 | Aarhus C |
| 3 | Avital | 8000 | Aarhus C |
| 1 | Adelaide | 8200 | Aarhus N |
| 2 | Abigail | 8200 | Aarhus N |
| 3 | Avital | 8200 | Aarhus N |
| 1 | Adelaide | 8210 | Aarhus V |
| 2 | Abigail | 8210 | Aarhus V |
| 3 | Avital | 8210 | Aarhus V |
| 1 | Adelaide | 8240 | Riisskov |
| 2 | Abigail | 8240 | Riisskov |
| 3 | Avital | 8240 | Riisskov |
| 1 | Adelaide | 8260 | Viby J |
| 2 | Abigail | 8260 | Viby J |
| 3 | Avital | 8260 | Viby J |
| 1 | Adelaide | 8270 | Højbjerg |
| 2 | Abigail | 8270 | Højbjerg |
| 3 | Avital | 8270 | Højbjerg |
+-----+----------+------+-----------+
18 rows in set (0.00 sec)
Mathematically, and relational set manipulation is set manipulation, this is correct. It is an example of a so called cross join aka cartesian product. In such a thing every element from one set is paired with every element from the other set. This results, in our concrete case in 6x3=18 elements in the result set.
Although technically correct, we don't like it. It is not what we expected or wanted. We forgot however, to specify the so called join criteria. Simply add those to the query and get
select sno, student.name, zipcode.zip, zipcode.name
from student
join zipcode
on student.zip = zipcode.zip;
+-----+----------+------+-----------+
| sno | name | zip | name |
+-----+----------+------+-----------+
| 1 | Adelaide | 8000 | Aarhus C |
| 2 | Abigail | 8260 | Viby J |
| 3 | Avital | 8270 | Højbjerg |
+-----+----------+------+-----------+
3 rows in set (0.00 sec)
Nobody ever needs a Cartesian product. Everybody needs a proper subset of one. Get it by adding proper join parameters in your query. Now you have been warned.
Joins are most frequently used as we have just seen.
They are identified by the use of the
keyword inner join
, or join
for short. The keyword
inner is default.
select sno, student.name, zipcode.zip, zipcode.name
from student
join zipcode
on student.zip = zipcode.zip;
or
select sno, student.name, zipcode.zip, zipcode.name
from student
inner join zipcode
on student.zip = zipcode.zip;
You may want to verify that these two declarations would yield identical results.
Now, the headline of this section reads Outer Join indicating that not all joins are inner joins. We use the inner joins to match elements from two or more relations. Outer joins are used when we need the result set to include elements from a set such that they do not have a matching element in the other set. This is easier illustrated by an example.
select sno, student.name, zipcode.zip, zipcode.name
from student
right outer join zipcode
on student.zip = zipcode.zip;
+------+----------+------+-----------+
| sno | name | zip | name |
+------+----------+------+-----------+
| 1 | Adelaide | 8000 | Aarhus C |
| 2 | Abigail | 8260 | Viby J |
| 3 | Avital | 8270 | Højbjerg |
| NULL | NULL | 8200 | Aarhus N |
| NULL | NULL | 8210 | Aarhus V |
| NULL | NULL | 8240 | Riisskov |
+------+----------+------+-----------+
6 rows in set (0.00 sec)
This example was a right outer join
. You may
wonter what is left and what is right in this context.
Here you could say that from the relations to the right of
the join
keyword, zipcode
you
should include elements unmatched by students.
In order to clarify this let me try to swap the left and the
right relations in the declaration and
substitute left
for right
.
select sno, student.name, zipcode.zip, zipcode.name
from zipcode
left outer join student
on student.zip = zipcode.zip;
+------+----------+------+-----------+
| sno | name | zip | name |
+------+----------+------+-----------+
| 1 | Adelaide | 8000 | Aarhus C |
| 2 | Abigail | 8260 | Viby J |
| 3 | Avital | 8270 | Højbjerg |
| NULL | NULL | 8200 | Aarhus N |
| NULL | NULL | 8210 | Aarhus V |
| NULL | NULL | 8240 | Riisskov |
+------+----------+------+-----------+
6 rows in set (0.00 sec)
You have already seen a couple of declarations where some of the identifiers are use repeatedly.
select sno, student.name, zipcode.zip, zipcode.name
from student
join zipcode
on student.zip = zipcode.zip;
Here student
and zipcode
are
repeated. If you want to economize with keyboard work,
try this
select sno, s.name, z.zip, z.name
from student s
join zipcode z
on s.zip = z.zip;
+-----+----------+------+-----------+
| sno | name | zip | name |
+-----+----------+------+-----------+
| 1 | Adelaide | 8000 | Aarhus C |
| 2 | Abigail | 8260 | Viby J |
| 3 | Avital | 8270 | Højbjerg |
+-----+----------+------+-----------+
3 rows in set (0.00 sec)
In the second line of the query you assign an
alias s
to the student relation.
In the following line you similarly assign an
alias z
to the relation zipcode.
These aliases are used throughout lines 1 and 4 of the
declaration. The result is a much shorter declaration.
As declarations are frequently long, and both relation names and attribute names likewise in order to make meaningful names, aliases is an appreciated way of shortening declaration typing. You might want to take advantage of that when you test your declarations before applying them to real life through programs.
In order to avoid ambiguity in column names attribute names may also be aliased. In the above example we have two name attributes in the attribute list resulting in two name columns in the output. The following will resolve the ambiquity.
select sno, s.name studentname, z.zip, z.name cityname
from student s
join zipcode z
on s.zip = z.zip;
+-----+-------------+------+-----------+
| sno | studentname | zip | cityname |
+-----+-------------+------+-----------+
| 1 | Adelaide | 8000 | Aarhus C |
| 2 | Abigail | 8260 | Viby J |
| 3 | Avital | 8270 | Højbjerg |
+-----+-------------+------+-----------+
3 rows in set (0.00 sec)
Aliases are optional and used for convenience. There is however one situation where you must use aliases. Attributes prefixed with table names or table aliases must be aliased if the result set is to be handled by backend programming languages.
SQL has grouping facilities built in. Grouping is necessary for example when you list countries and you want to total population per continent. Then continent is considered a group.
select continent, count(name) countries, sum(population) 'Total population'
from country
group by continent;
+---------------+-----------+------------------+
| continent | countries | Total population |
+---------------+-----------+------------------+
| Asia | 51 | 3705025700 |
| Europe | 46 | 730074600 |
| North America | 37 | 482993000 |
| Africa | 58 | 784475000 |
| Oceania | 28 | 30401150 |
| Antarctica | 5 | 0 |
| South America | 14 | 345780000 |
+---------------+-----------+------------------+
7 rows in set (0.00 sec)
When you group, it is not recommended to include columns from individual rows in the result set. Country names in the above query would make no sense because only one row per group will be in the result.
Please notice the use of column aliases in the query. They are irrelevant in any web context, but useful in order to created meaningful column headings in test output from the query.
There are 5 group functions, no more,
no less. You saw two here: count()
,
and sum
. In addition to those you
have min()
, max()
,
and avg()
. Apart from count()
they
are obviously only useful on numeric columns. In all cases
you must include a column name in the parenthesis.
It is possible to use group functions without the
keyword group by
in the query. In such cases
the total relation will be considered one group, and
therefore the result set will include only one row.
select 'All continents'
, count(continent) 'No of Continents'
, sum(population) 'No of People'
from country;
+----------------+------------------+--------------+
| All continents | No of Continents | No of People |
+----------------+------------------+--------------+
| All continents | 239 | 6078749450 |
+----------------+------------------+--------------+
1 row in set (0.00 sec)
Here you will notice, apart from what was explained, the use of a fixed text as the first column, and column aliases as headers for the other columns.
It is obvious logically that grouping involves some kind
of ordering the input in such a way that counting, and
summation make sense. There's no ordering of the output
from this grouping. As always, if you need order, you
create it yourself by order by
.
select continent, count(name), sum(population)
from country
group by continent
order by sum(population) desc;
+---------------+-------------+-----------------+
| continent | count(name) | sum(population) |
+---------------+-------------+-----------------+
| Asia | 51 | 3705025700 |
| Africa | 58 | 784475000 |
| Europe | 46 | 730074600 |
| North America | 37 | 482993000 |
| South America | 14 | 345780000 |
| Oceania | 28 | 30401150 |
| Antarctica | 5 | 0 |
+---------------+-------------+-----------------+
7 rows in set (0.00 sec)
Working with groups, you sometimes need selective results.
Now in an ordinary select
, the where
clause filters the relevant rows to be included in the
result set.
In groups however, the individual rows contribute to the group
results, and therefore we need a different clause to
select from the group results. This keyword is having
.
select continent, count(name), sum(population)
from country
group by continent
having sum(population) > 500000000
order by sum(population) desc;
+-----------+-------------+-----------------+
| continent | count(name) | sum(population) |
+-----------+-------------+-----------------+
| Asia | 51 | 3705025700 |
| Africa | 58 | 784475000 |
| Europe | 46 | 730074600 |
+-----------+-------------+-----------------+
3 rows in set (0.00 sec)
Here's an example of filtering with aliases in the declaration.
select co.name, count(ci.name) as citycount, sum(ci.population)
from country co
join city ci on code=ci.countrycode
where continent = 'South America'
group by co.name
having count(ci.name) > 10
order by sum(ci.population) desc;
select co.name, co.population, count(ci.name) citycount, sum(ci.population)
from country co
join city ci
on code=ci.countrycode
where continent = 'South America'
group by co.name
having citycount > 10
order by population desc;
+-----------+------------+-----------+--------------------+
| name | population | citycount | sum(ci.population) |
+-----------+------------+-----------+--------------------+
| Brazil | 170115000 | 250 | 85876862 |
| Colombia | 42321000 | 38 | 20250990 |
| Argentina | 37032000 | 57 | 19996563 |
| Peru | 25662000 | 22 | 12147242 |
| Venezuela | 24170000 | 41 | 12251091 |
| Chile | 15211000 | 29 | 9717970 |
| Ecuador | 12646000 | 15 | 5744142 |
+-----------+------------+-----------+--------------------+
7 rows in set (0.00 sec)
Obviously we need to make calculations on data stored in database relations. One way is retrieving data by querying and then calculating by some host programming language. In a web situation this might be accomplished by PHP.
SQL is capable of doing calculation on the fly in queries.
In some cases this is beneficial for your purposes.
In the following query, the number of cities is
counted grouped by country. The total of city
inhabitants is calculated
by sum(ci.population) citypop
.
This sum is calculated as a percentage of the
population of the country as retrieved from the
country table.
Plain arithmetic, courtesy of SQL. All the regular
arithmetic operators apply.
The following declaration counts cities, sums city populations per country in South America. It also calculates the city population as a percentage of country population to give a column called Urbanization.
select co.name
, co.population
, count(ci.name) citycount
, sum(ci.population) citypop
, sum(ci.population) * 100 / co.population "Urbanization"
from country co
join city ci
on code=ci.countrycode
where continent = 'South America'
group by co.name
order by urbanization desc;
+------------------+------------+-----------+----------+--------------+
| name | population | citycount | citypop | Urbanization |
+------------------+------------+-----------+----------+--------------+
| Falkland Islands | 2000 | 1 | 1636 | 81.8000 |
| Chile | 15211000 | 29 | 9717970 | 63.8878 |
| Argentina | 37032000 | 57 | 19996563 | 53.9981 |
| Venezuela | 24170000 | 41 | 12251091 | 50.6872 |
| Brazil | 170115000 | 250 | 85876862 | 50.4817 |
| Colombia | 42321000 | 38 | 20250990 | 47.8509 |
| Peru | 25662000 | 22 | 12147242 | 47.3355 |
| Ecuador | 12646000 | 15 | 5744142 | 45.4226 |
| Bolivia | 8329000 | 8 | 3378644 | 40.5648 |
| Uruguay | 3337000 | 1 | 1236000 | 37.0393 |
| Guyana | 861000 | 1 | 254000 | 29.5006 |
| French Guiana | 181000 | 1 | 50699 | 28.0105 |
| Suriname | 417000 | 1 | 112000 | 26.8585 |
| Paraguay | 5496000 | 5 | 1020020 | 18.5593 |
+------------------+------------+-----------+----------+--------------+
14 rows in set (0.00 sec)
To answer a simple question such as "What languages exist is this world" we could use
select language
from countrylanguage
order by language;
Now this is true of course, but the result has a lot
of repetitions, redundancies if you will. To make it
even more useful SQL has a nice
keyword distinct
for you. See what it does?
Every distinct column value appears exactly once.
select distinct language
from countrylanguage
order by language;