The Math of SQL

Video

Motivation

Few technologies have changed the way I see the world more than SQL Databases. Deeply mathematical but surprisingly versatile, the storage of information exists in a back-and-forth relationship with your understanding of the information stored.

By working from good principles - of robust design, minimizing unnecessary limitations, and a permissive outlook - what may seem impossible to change could be very easy.

Mathematics

Relational Data

Tabular data is easy to store. Lists of numbers, dates, or names are well-studied.

Real world data is almost never so simple. Data, like the people who generate it, comes with a complex web of relationships.

In the abstract context, mathematicians who study the idea of data define a few core concepts for a Dataset, which contains:

  • A (finite) set of Columns values $S_i$, indexed by some (finite) set $I$.
    • Each column set represents some Data Type (or Data Domain) : (finite) equality-comparable sets of possible Data Values.
  • A (finite) set of Relations $R$ : A "named tuple" of column IDs and Values for some subset of the columns.

Data is manipulated in terms of finite first-order predicate calculus. In those terms, the Relations are typically defined as a truth function:

\begin{align} \prod_i S_i &\overset f \mapsto \{T,F\} & f((s_1,\dots,s_n)) = T \iff (s_1,\dots, s_n) \in R \end{align}

Fundamentally, all finite data can be viewed this way, and complex statements about the data can be phrased - and thus resolved - as predicate statements about the truth function.

It can be helpful to group the Relations according to their Names and Data Domains. By analogy, we refer to these groups as Tables, the Names as Column Names, the Values as Attributes, and the Data Domains/Types as Column Domains/Types.

Note that the same real-world data can be represented in many ways, that are equivalent (up to a translation of the predicate calculus). The study of the relative costs and benefits of certain data representations has many real-world applications to the storage of data, as we will see.

Part of real world data is that some things are impossible - start dates occur after end dates, and so on - so we can define Data Spaces $D \subset \prod S_i$, which contain only "valid" data under some reasonable conditions.

Example

Let's say you had a lot of data - for example, a dataset on mathematical publications:

author title publisher journal
Wagstaff, Samuel S Divisors of Mersenne numbers Amer. Math. Soc. Mathematics of Computation
Pomerance, Carl and Selfridge, John L and Wagstaff, Samuel S The pseudoprimes to 25⋅ 10⁹ Amer. Math. Soc. Mathematics of Computation
Pomerance, Carl On the distribution of pseudoprimes Amer. Math. Soc. Mathematics of Computation
Wagstaff, Samuel S Infinite matroids Amer. Math. Soc. Transactions of the American Mathematical Society
Alford, William R and Granville, Andrew and Pomerance, Carl There are infinitely many Carmichael numbers JSTOR Annals of Mathematics


Using this table works in general, for a tabular representation of small data - this is basically how BibTex stores it, after all.

This can work for some computations; we could group articles by the same single author - providing there are no typos or alternate spellings - with a predicate operation on the authors. However, we can do better.

Normalization

Atomicity and the 1st Normal Form

Let's pick out some flaws from this that we can improve upon. First, the Author section - searching for all entries by the same author is tedious, involving substring matching and errors.

Enter the First Normal Form:

A Dataset is said to be in 1st Normal Form iff every cell is Atomic, that is, no cell represents more than one Data Element.

Example

We could attempt a simple Atomization - conversion to first normal form - by setting some reasonable bound on the number of authors, and making a separate column for each:

Author 1 Author 2 Author 3 title publisher journal
Wagstaff, Samuel S None None Divisors of Mersenne numbers Amer. Math. Soc. Mathematics of Computation
Pomerance, Carl Selfridge, John L Wagstaff, Samuel S The pseudoprimes to 25⋅ 10⁹ Amer. Math. Soc. Mathematics of Computation
Pomerance, Carl None None On the distribution of pseudoprimes Amer. Math. Soc. Mathematics of Computation
Wagstaff, Samuel S None None Infinite matroids Amer. Math. Soc. Transactions of the American Mathematical Society
Alford, William R Granville, Andrew Pomerance, Carl There are infinitely many Carmichael numbers JSTOR Annals of Mathematics

But there may be some very large collaborations out there - and this makes it hard to search by author.

A more robust Atomization might be to expand the author column into rows:

author name author number title publisher journal
Alford, William R 1 There are infinitely many Carmichael numbers JSTOR Annals of Mathematics
Granville, Andrew 2 There are infinitely many Carmichael numbers JSTOR Annals of Mathematics
Pomerance, Carl 3 There are infinitely many Carmichael numbers JSTOR Annals of Mathematics
Pomerance, Carl 1 The pseudoprimes to 25⋅ 10⁹ Amer. Math. Soc. Mathematics of Computation
Selfridge, John L 2 The pseudoprimes to 25⋅ 10⁹ Amer. Math. Soc. Mathematics of Computation
Wagstaff, Samuel S 3 The pseudoprimes to 25⋅ 10⁹ Amer. Math. Soc. Mathematics of Computation
Pomerance, Carl 1 On the distribution of pseudoprimes Amer. Math. Soc. Mathematics of Computation
Wagstaff, Samuel S 1 Infinite matroids Amer. Math. Soc. Transactions of the American Mathematical Society
Wagstaff, Samuel S 1 Divisors of Mersenne numbers Amer. Math. Soc. Mathematics of Computation


This allows us to search for all papers by a given author!

Uniqueness and the 2nd Normal Form

However, now we have even more repeated information. To fix that, we will continue Normalization.

To define the 2nd Normal Form, we have to talk a bit about keys.

Under the mathematical definition, no two rows can be identical. Often, in practice, there are smaller subsets which also should be unique in the space - such as the ISBN for books, or author-title combination for journal articles.

Definition: Suppose we have a Data Space $D \subset \prod S_i$, and a set of relations $R \subset D$.

A set of Attributes $A \subset I$ of $r \in R$ is called a Candidate Key if and only if both:

  • $\not \exists s \in D \backslash r$ such that $s|A = r|A$ - that is, $r$ restricted to the columns $A$ uniquely identifies $r$ in $D$.
  • $A$ is minimal with respect to this condition; that is, $\not \exists B \subsetneq A$ such that $B$ satisfies the first condition.

Candidate Keys will come up a lot, and they show up in the formal definition of the 2nd normal form. However, we need another quick definition:

Definition: A Functional Dependency from $A$ to $B$ in a Data Space $D \subset \prod S_i$ is a relation between disjoint attibute sets $A,B \subset I$, $A \cap B = \emptyset$ such that such that: \[ \forall r,s \in D: r|A=s|A \implies r|B=s|B \]

Or, more colloquially, the columns in B depend completely on the columns in A.

With these definitions, we can build the second normal form:

Definition: A dataset is said to be in 2nd Normal Form if there are no (nontrivial) functional dependencies from a proper subset of a candidate key, to a non-key attribute.

This is kind of technical, but basically it means - don't store conclusions based on part of your key.

Example

Our current table violates this, because the journal title - irrespective of author - determines most of the information.

This can cause extra storage, but the worse thing is inconsistencies - If one of the values is changed, such as an updated publisher, we could get the same title resolving to two different possibilities.

To Normalize further, we can take our data and split it into multiple tables. These tables can store data specific to objects; for example, we could have a table for articles without redundancy:

title publisher journal
ID
wagstaff1983divisors Divisors of Mersenne numbers Amer. Math. Soc. Mathematics of Computation
pomerance1980pseudoprimes The pseudoprimes to 25⋅ 10⁹ Amer. Math. Soc. Mathematics of Computation
pomerance1981distribution On the distribution of pseudoprimes Amer. Math. Soc. Mathematics of Computation
wagstaff1973infinite Infinite matroids Amer. Math. Soc. Transactions of the American Mathematical Society
alford1994there There are infinitely many Carmichael numbers JSTOR Annals of Mathematics

And a many-to-many join table which links them together, along with any information relevant to the join:

paper id author name author number
wagstaff1983divisors Wagstaff, Samuel S 1
pomerance1980pseudoprimes Pomerance, Carl 1
pomerance1980pseudoprimes Selfridge, John L 2
pomerance1980pseudoprimes Wagstaff, Samuel S 3
pomerance1981distribution Pomerance, Carl 1
wagstaff1973infinite Wagstaff, Samuel S 1
alford1994there Alford, William R 1
alford1994there Granville, Andrew 2
alford1994there Pomerance, Carl 3

Third Normal Form and the End of Redundancy

There's one last thing to notice. All of these articles contain both a journal and a publisher. However, (in an ideal world), the journal determines the publisher.

This doesn't violate 2nd normal form - the journal is not a key - but it is still redundant. To get rid of it, we will define Third Normal Form:

A dataset is said to be in Third Normal Form if:

  • The dataset is in Second Normal Form
  • Every non-key attribute depends on every key attribute
  • No non-key attribute implies another non-key attribute which does not imply it

The classic saying about 3NF is that every attribute should determine "The key, the whole key, and nothing but the key, so help me Codd."

Fundamentally, 3NF is a restriction - and guide - on the design of your tables. A database in 3NF has tables that represent data at the minimal scope necessary. This gives not just a normal form, but close to a standard form.

In practicality, of course, functional dependencies crop up in unexpected ways - in the real world, true 3NF compliance is more of an ideal than a guarantee.

Example

Our remaining 3NF violation is - as hinted above - the journal $\Rightarrow$ publisher dependency.

We can abstract it away by adding a Journals table:

publisher journal
Amer. Math. Soc. Mathematics of Computation
Amer. Math. Soc. Transactions of the American Mathematical Society
JSTOR Annals of Mathematics

And modifying our Articles to no longer contain redundant information:

title journal
ID
wagstaff1983divisors Divisors of Mersenne numbers Mathematics of Computation
pomerance1980pseudoprimes The pseudoprimes to 25⋅ 10⁹ Mathematics of Computation
pomerance1981distribution On the distribution of pseudoprimes Mathematics of Computation
wagstaff1973infinite Infinite matroids Transactions of the American Mathematical Society
alford1994there There are infinitely many Carmichael numbers Annals of Mathematics

Normal Forms 3.5-6 And Beyond

Database researchers have imagined countless normal forms, including some that lie outside this sequence. They all have their own advantages or disadvantages, but 3rd normal form ensures us a lack of redundancy - and thus, the ability to change non-key data in one place and have it changed in all uses.

I won't explain them here - they get more and more cumbersome - but to see the limiting case, 6th normal form ("No Nontrivial Functional Dependencies") only allows key-value stores.

For example, we would have to break our journal table further into:

title
ID
wagstaff1983divisors Divisors of Mersenne numbers
pomerance1980pseudoprimes The pseudoprimes to 25⋅ 10⁹
pomerance1981distribution On the distribution of pseudoprimes
wagstaff1973infinite Infinite matroids
alford1994there There are infinitely many Carmichael numbers
journal
ID
wagstaff1983divisors Mathematics of Computation
pomerance1980pseudoprimes Mathematics of Computation
pomerance1981distribution Mathematics of Computation
wagstaff1973infinite Transactions of the American Mathematical Society
alford1994there Annals of Mathematics

Assignment

Consider the following list of columns, that you might get out of an LMS:

Student Name Student ID Student Nickname Course Name Course ID HW Grade Quiz Grade Exam Grade

Graded Assignment: Submit to Brightspace a database "schema" - lists of column names for one or more tables, seperated by new lines - for a data model that would satisfy the third normal form. Place the key(s) first.

As an example, to submit the schema derived in class, write:

paper id, author name, author number
paper id, title, journal
journal, publisher
Department of Mathematics, Purdue University
150 N. University Street, West Lafayette, IN 47907-2067
Phone: (765) 494-1901 - FAX: (765) 494-0548
Contact the Webmaster for technical and content concerns about this webpage.
Copyright© 2018, Purdue University, all rights reserved.
West Lafayette, IN 47907 USA, 765-494-4600
An equal access/equal opportunity university
Accessibility issues? Contact the Web Editor (webeditor@math.purdue.edu).