C21CH

Speculative Web Space

Simple Guide To Data Structure

The most important thing when starting on a Digital Humanities project is to maintain consistent, well structured data.

Think about what types of information about your objects of study need to be recorded and presented, ideally before you begin. Don't let worries about data structure stop you from starting though. Often the structure becomes clear as soon as you start gathering the data. While it's best to avoid late changes to structure if you can, you can always add a column or an extra field if you missed something important.

How do you make information well structured? Often it's not as complicated as it seems. The simple answer is, "Just put it in a table under column headings."

This is not well structured data:

The Mona Lisa by Leonardo Da Vinci, between 1503 and 1506, maybe 1517. Most famous painting.
Last Supper, 1495 - unknown, Da Vinci. Referenced in popular culture...
Michelangelo, c. 1511–1512, Sistine Chapel. Commissioned by...

The artists, painting titles and dates are in different orders, the dates are stored in different ways, and sometimes the name of a single individual is stored differently. The descriptions are just notes and you'll want to edit them later (that's ok, but save yourself some trouble by making it as finished as possible).

This is well structured data:

Painting Artist Start Date End Date Date ExactnessDescription
The Mona Lisa Leonardo Da Vinci 1503 1506 c.The most famous painting in the world, etc.
The Last Supper Leonardo Da Vinci 1495 c.Often referenced in popular culture, this work was...
Sistine Chapel Michelangelo 1511 1512 c.This ceiling decoration was commissioned by, etc.

That's not hard to understand. That's the main point but there's a few more things worth bearing in mind:

Be consistent.

Always write the same thing in the same way. Eg: decide if you want to just write 'da Vinci' or 'Leonardo da Vinci' and always write it that way.

What To Gather?

You may want to break this up differently, specifying whether the first or second date is uncertain, or using only the finishing date if that is all that is relevant, and adding whatever other columns are pertinent. What information you put in depends on:

More Is Better

If you can gather more details do. It's easier to take out subsets of information than for you to revisit every data item.

Don't use MS Word.

Avoid MS Word for recording data. Use it for writing letters and essays. Although you can make tables in MS Word, and they are better than just notes, they will ultimately need to be copied to some other format that a computer program can more easily handle. The most commonly used tool, and much easier for a computer to handle, is Excel. If you make columns in Excel you are off to a good start and will save everyone, including yourself, a lot of time and headaches later. This is because Excel files can be saved as .CSV files which are easy for computers and programmers to work with. (Note you can still make a mess of an Excel or .CSV file, just keep all the data broken up in columns with only one type of information in each column)

Structure As You Go

It's easiest to gather your information in the right structure as you go, rather than transcribe it later.

Just Ask

If possible, ask someone what fields (or column headings) are required, or if your data structure is good. If you intend your data to go into a particular system, check what requirements it has. Eg: If you want to put your data into Google maps, for example, even if you're not sure about the technical standards of KML and other acronyms, you can see that you should at least have a 'longitude', 'latitude', 'name' and 'description' for every point you want to plot. If you at least have that in a spreadsheet, it can be converted to the right format.

One Type Of Information, One Column

If types of information can be distinguished, split it up into more columns.

Numbers, Dates and Text

Software usually handles different kinds of data differently. The main distinction to keep in mind is to store numbers as numbers without adding any text to them. Eg: if there is a column for 'Quantity Of Grindstones', don't put 'About 32'. Put '32' and anything you want to say about that in a different column. Text can't be added and subtracted so leaving it as a number allows calculations to be made, which you can add any caveats and explanations to later.

Dates and times are tricky to handle so keep to a consistent format and also don't add extra text to them. Eg: stick to the dd/mm/yyyy HH:MM:SS or some other common format.

Complex Structures

Information structure can sometimes get a bit complex. Let's say you want to have some extra information about the Artists, such as when each was born and died, whether they were sculptors and/or painters, what cities they worked in, who their patrons were etc. You don't want to add all that information to every row in your table of paintings. You need a seperate table that just stores the artist information once for each artist. You can then relate this back to the painting by the artist's name. This is the structure of a 'relational database'. You can still gather this data in Excel for convenience, but make sure you are consistent in using the artists' names, so that it will match up across tables. Keeping these tables makes it possible to convert the information into a proper database, which can then be used to mix and match, filter and display the data in all manner of ways, including for the web.

A Paradox of Structure and Flexibility

Why structure information this way? It seems rigid and inflexible but well structured data is what enables computers to be flexible. A computer doesn't care if there a few or a million records, if they are structured in the same way it will process them quickly. It can filter and mix and match the information, change formats, run calculations, and pass the data to visualisations. Without a consistent format the computer can only display it the way you put it in - it can't do anything with the data. You lose the ability to manipulate it, and so badly structured data, while flexible in your terms, is inflexible for a computer.

In the badly formatted art information above, the computer has no way of knowing which text it should treat as an artist, which as a painting name and so on. If it is in columns, the computer can treat everything in the first column as an artwork, everything in the second as an artist, and so on. 'Structured data' is part of working with computers as a medium - you don't normally work clay with a paint brush, and you don't normally spin paint on a potter's wheel. To work with computers, use structured data.

So while lots of different systems require different formats, the most important thing is to be consistent and structured. Even if you don't know what specialised formats it might have to be in later, if it is well structured it's much easier to write a small program to convert it all into the right format for any system.

If it's too late and you only have badly structured information, even if it takes hours or days to convert your notes into well structured data, it's a small effort for the benefits of being able to query it to identify relationships, extract subsets for other purposes, generate lists for publications, run it through statistics applications to generate graphs, plot it on a map, make an online gallery, turn it into social network diagrams, display it on the web and whatever other relevant thing a computer can do.