ntoll.org

(Everything I say is false...)
home | about | articles | cv | contact

Data and Databases ~ Concise and Simple

Thursday 14th February 2013 (08:00AM)

(As with all "concise and simple" articles I assume no prior knowledge of the subject and keep the length to less than 1500 words.)

"Scientia potentia est." (Knowledge is power.)

Sir Francis Bacon.

In the digital world knowledge is stored, in its latent form, as data in databases. Everyone from your local supermarket to the government collects data about all of us. Information and knowledge is extracted about all aspects of our lives, from our shopping habits to whether we've paid our taxes (it is no accident that the examples above relate to extracting money from us). According to Google it currently takes just two days for us to generate as much information as was created by all of humanity from the dawn of time until 2003.

This article is about data and databases - the technology that makes the accumulation and analysis of this flood of information possible. It starts with a simple question: what is data?

"Data" are values that convey meaningful information about something. Meaning is implied by the context and structure of the data. Structure is formed from the way data is represented. Databases simply store and provide mechanisms for working with structured data.

The most fundamental data-related act is to name a value:

my_name = "Nicholas Tollervey"

My name is Nicholas Tollervey. This is shown by "my_name" being assigned the value "Nicholas Tollervey". Assignment happens when a token name is followed by an equals sign that is, in turn, followed by a value. Notice the three layers: meaning, structure and representation.

Another example:

nicholas = {
    "first_name": "Nicholas",
    "middle_name": "Harry",
    "surname": "Tollervey",
    "birthday": {
        "year": 1973,
        "month": 8,
        "day": 19
    }
    "height": 180,
    "weight": 76.9
}

After reading the above I'm certain you'll know when to send me a birthday card. Something called nicholas has a single value associated with it: an associative array. Associative arrays are key / value pairs - like a dictionary of words or a subject index at the back of a book. In this case it has six entries, one of which is itself a nested associative array with three entries. It's an associative array because it is enclosed by curly braces and contains a list of "key_name": value pairs separated by commas. The associative array is assigned to nicholas in exactly the same way as the first example (token name followed by equals followed by value).

Incidentally, I can retrieve my data like this:

>>> my_name
"Nicholas Tollervey"
>>> nicholas["height"]
180

(This example uses the Python programming language: I type in commands after the >>> prompt and the Python interpreter replies on the new line.)

A final example:

People
id first_name middle_name surname nationality_id
1 Nicholas Harry Tollervey 2
2 Terry Charles Oliver Jones 1
3 Esteve Fernandez 3
Nationalities
id name
1 Australian
2 British
3 Spanish

Obviously, I'm British and my buddies Terry and Esteve are Australian and Spanish (respectively). You know this because you understand how to read tabular information and spot the relationship between the values in the nationality_id column of the People table and the values in the id column of the Nationalities table. You know it's tabular information because it's a two dimensional grid with the vertical columns indicating named fields and horizontal rows representing individual records. Once again: meaning, structure, representation.

Each example is more complex than the one preceding it. The first is a single key / value pair (where my_name is the key and "Nicholas Tollervey" is the value), the second is a key (nicholas) associated with a value that is itself key / value pairs (the associative array). Finally, the third example contains rows of items that all share the same keys (a.k.a. columns) but different values. Each type of record (people, nationalities etc) has its own table.

How do such meaningful data structures / representations relate to databases?

There are four basic actions that all databases may perform on data: create, read, update and delete (often expressed as CRUD). There are many types of database that meet this minimum functionality. The three data examples illustrate the types of data structure stored in three of the most popular types of database: key / value stores (such as Redis or Memcached), document oriented databases (like MongoDB or CouchDB) and relational databases (for example, Postgres or MySQL).

Key / value databases are, at heart, associative arrays. They provide three basic actions:

Such simple databases are extremely fast and well suited for tasks where speed is of the essence. However, there isn't a simple way to search the stored values and keys must be known before being used to retrieve or delete associated values. Nevertheless, they are extremely useful as caches of arbitrary values. If you want to try such a database you should visit the try redis tutorial tutorial ~ a gentle yet comprehensive tour of a popular key / value store.

Document oriented databases store arbitrarily structured documents expressed as associative arrays. Every document has a unique ID created by the database and used as a key to reference the document.

However, documents are not opaque values as in key / value databases. Rather, it is possible to reach in to the documents and construct queries that reference the document's named attributes (keys). For example, given a people database I could create a query to retrieve all documents that have a height attribute with a value greater or equal to 180. Since the structure of the documents is arbitrary, such queries rely on conventions (all people documents have a height attribute whose value is expressed in centimetres).

Because document structure is not fixed (except by convention), new fields can be added at any time and document oriented databases specify default handling for missing fields - usually they'll evaluate to null (unknown).

To play with a document oriented database, explore the MongoDB tutorial on the project's homepage.

Relational databases developed from the work of Edgar Codd. They store data in rows (tuples) that share the same columns (attributes) defined in tables (relations) - bracketed words are the formal terminology for the colloquially named concepts. Each row contains a special field called the primary key (the id column in the example) that uniquely identifies the record. Tables, representing different sorts of things, are related to each other with special fields called foreign keys that reference primary keys of rows in different tables (the nationality_id). Many relational databases also enforce constraints such as referential integrity - where foreign keys may not refer to non-existent primary keys.

To work efficiently, data in relational databases should be normalised to avoid duplication. Data need only be stored in one table representing a specific type of thing. Therefore, modifications only happen once and are propagated via the relationships between tables. For example, the nationalities table is the only place where nationalities may be defined. Tables for things with nationalities (such as people) need only reference the single nationalities table via a foreign key rather than duplicate such information. The process of normalisation usually has three steps of increasing efficiency: first, second and third normal forms. Sometimes it is necessary to de-normalise a database for performance reasons - it's faster to retrieve a single row than it is to retrieve several related rows from different tables.

Structured Query Language (SQL) is the programming language used by relational databases to work with data. Based upon relational algebra, it provides CRUD functionality and capabilities for specifying the structure and relationships between data. It's easy to create powerful queries:

SELECT people.first_name, people.surname, nationalities.name
FROM people
JOIN nationalities ON nationalities.id = people.nationality_id
WHERE people.surname = "Jones"

The result of which is:

people.first_name people.surname nationalities.name
Terry Jones Australian

The query asks: give me the first name, surname and nationality (line 1) of people whose surname is "Jones" (line 4). We have to specify where to collect this data (line 2) and how to join the related tables together (line 3). Notice that SQL doesn't specify how to run the query but describes what the expected result should be.

Interactions with a database happen as transactions. Transactions can contain many different commands and should have several desirable properties:

A database is ACID compliant if its transactions meet these standards.

The pace of change in data-based technology is fast. Relatively recent trends, such as big data (huge data sets and the technology needed to manage them), deserve their own article. Yet, even the technology described above allows companies and governments to build frighteningly accurate pictures of our lives (how do you think Amazon is able to make such useful suggestions?).

Personally, I find the ubiquitous analysis and measurement of our behaviour inadvertently Orwellian. Furthermore, there is a tendency to trust the results obtained from databases without recognizing the meaning level of data relies on interpretation (of which there are many varieties). Nevertheless, there are great opportunities to do good and I am reminded of the Buddhist proverb,

"To every man is given the key to the gates of heaven; the same key opens the gates of hell."

I hope we choose the right gate.

1499 words (not including code examples).