Category Archives: data management

Data Bunkers (part 1)

During a rather long meeting this morning the words “Data Bunker” came to me. I don’t really know why, but it stuck with me all day and I could not shake it. There are many traditional terms used to describe data storage instances such as data store/silo/mart/base and probably a dozen or so more. While the phrases differ they ultimately refer to the same idea. I found it interesting that when I started trying to puzzle out what a data bunker is, I was equally trying to define it in the traditional terms. I soon started realising that the terms brought with them a certain bias in mindset.

Data store implies a very generic and passive viewpoint – you kind of put data in boxes or the back room. Data silo has connotations of deliberate separation and insulation. Data mart implies prepackaged data ready for consumption. Finally data base places data as a foundation for things being built on top of.

My musings above are not a criticism, each term has its reason and they are all valid but since language is important to me and the act of naming something shapes how you use it I brought into question my own ways in which I think about and name data management. Throughout my career in data I have used all of these terms many times. Suddenly a new term (to me) appeared and got me thinking – Data Bunkers.

Data silos, to me is the beginnings of examining what I mean – keeping data discretely apart so as one silo cannot spoil another if something goes wrong. However this did not satisfy me. Data bunker sounds much more militaristic, solid and most of all defensible.

The tired old saying of “Garbage in, garbage out”, is unfortunately very true. People are best shot in the foot by their own guns. Data management, which I will define as “The art of planning, storing, transforming and retrieving data.”, often struggles as much with the self introduced errors and misunderstanding as those of the data supplier. Therefore, when designing and maintaining a data store (especially large corporate ones) which becomes the foundation for information systems and/or products you need a certain level of reliability and protection from your own mistakes.

Large data systems, particularly those that have to satisfy more than one role (reporting, product servicing and transaction tracking) tend to require subtly different requirements. Therefore, for the sake of performance it is quite usual to end up with many data storage schemes that represent the same data but in different ways – less normalised and highly indexed for reporting and product usage or highly normalised and less indexed for transactional. It’s the whole seekable versus storeable and maintainable line that a data engineer must tread. Due this mix of needs it is very common that processes are designed to transform the same data into different data marts. Now bringing in the garbage in/garbage out rule, it goes without saying that your databases are only as good as your data importing, processing and exporting scripts. This is approaching the crux of the matter – accuracy, reliability and scalability are not solely the domains of the good data schema. The best data storage designs might go a long way towards keeping the data relational and consistent (constraints, enforcing relational integrity, etc.) but they are fairly blunt tools (and often you can explicitly ignore them). I don’t think that my point is a shocking revelation, we all know that bad/incomplete data rules and simple logic errors in loading scripts have the power to ruin the “database guys” day.

Are we there yet?
Kinda. The traditional terms mentioned in my first paragraph all tend to be passive in that it’s where you put your data. But let’s roll in the data loading and manipulation scripts that maintain the data’s value into the data term – hence a Data Bunker is where data value is protected.

More to come…