So much data, so little time: How to not flip your wig processing it
What if I told you it's possible to store both raw and summaries
Working with data can be a pain in the butt. You do it because you need to, and because there's value in it – data-driven enterprises thrive on being able to eke as much concrete information as possible out of the stuff in order to maximise efficiency and attack the market share of the competition.
But data is complicated and expensive to generate, store and process. Worse, data processing is one of the most extreme examples of IT being one big compromise. Raw data is rich and complete, but tends to be high in volume and hence the problems of processing are tangible (big equals slow and cumbersome to process). Working on summarised data is much faster, but what you lose is the detail and hence potentially some of the value.
So can you do both? Can you retain the detail of the raw data while gaining the speed of the summary? Do you have to process everything all the time?
A new concept in capacity management
Any IT or system manager worth their salt will have a proactive regime of capacity management, and the same applies to a database administrator (in which definition I include data warehouse administrators). So they will ensure that the systems are not overloaded, that they're not going to run out of storage, that the in-built features of the database app – partitioning, indexing, cacheing, and so on – are exercised to the greatest possible extent. There's only so much you can do to tune the underlying systems and apps, though – and there's also only so much you can do with regard to throwing hardware at the processing function.
True, you can chuck RAM at the problem in vast quantities but the same limitation still exists now as existed in the 1990s when I was playing in academia with Transputers, and when companies like ALR were producing Symmetric MultiProcessing (SMP) servers: adding CPUs and cores isn't a linear thing, and there comes a point where the effort of coordinating work and shuffling data between cores and CPUs outweighs the benefit of having the additional oomph.
There comes a time when you have to pause your efforts in making the processing go faster and look instead at doing less processing.
Time is the clincher
I was at a security conference recently, and a presenter told a story about a company that fell victim to data theft thanks to some backup media being nicked. Usual old story, the data on the media wasn't encrypted. Why not? Because at some point, a year or two ago, the time taken to run the overnight backup busted the backup window, so they turned off encryption because time was more important than the storage required for the backup.
And the same applies to being data driven – you can probably afford the hardware to make the data processing happen, but you're bound by the constraints of time.
Raw data or summary?
So, there's a choice to make between storing raw data and summarising it for fast processing. But why not do both? Most Management Information Systems (MIS) have some kind of daily and/or weekly process in which raw data is extracted from the core systems then processed into some kind of summary form into which the users can dip. It works very well, and in some cases there's a tool that lets people dip into raw data if they need the detail (more about the hazards of that later, though).
But why would you want that?
I've been fortunate enough to design and build a CRM system pretty much from the ground up. The bonus with doing such a thing is that you get to include the reporting requirements from day one, and so you can architect the system to allow the reporting and data analysis functions to be as flexible, fast and effective as the core system itself – because they are an integral part of the core system.
So who needs overnight data extracts when you can make the system pour the data into the reporting engine in real time?
The book is sometimes wrong
I have a degree in Computing Science, which means I've read a great many books and done a fair few lumps of coursework based on what academics have taught me. But just as with many other aspects of real life, doing it by the book might be "correct" but it's not necessarily relevant.
For example, anyone fresh out of university who's tasked with designing a new relational database will go with what they were taught and normalise everything to death, eliminating any trace of duplicate data. After a few years in the industry, you'll find yourself deliberately denormalising in places because it lets you do a simple three-table SQL "JOIN" instead of having to hang a dozen tables together just to get the end-to-end linkage you need.
Always remember that IT is always a compromise. Doing something "right" often means doing it clunkily and making it run slowly.
The way to go...
Let's wrap up with five quick points to consider if you want to be data driven and do so effectively.
- Don't do overnight jobs: there are 24 hours in the day, so why the hell are you running data extracts and summary reports in an artificial window between midnight and 6am? Answer: because you don't want to impact live systems or the backups – so fix the problem, not the symptom.
- Fix the above problem by having a read-only, real-time replica of each of the data sources. You're probably running clustered or mirrored databases anyway, so why not add an extra slave system dedicated to reporting: feeding another server with the transaction logs you’re mirroring anyway will give minimal impact.
- Don't scrimp on the hardware for the reporting server: not only do you need it to be fast enough to keep up with the transaction logs it's being fed by the master server, but as you're going to be hammering it with report queries you need it to be able to cope.
- Generate as much reporting data as you can in real time: even if you have to retro-fit it using, say, database triggers and associated stored procedures, it's better to add a few dozen milliseconds to each individual task than to add a pointless many-hour extract each day.
- Have a data retention and disposal policy and have a daily job to throw away the old data: if your policy says you throw data away after 24 months, an annual purge (which is what lots of people do) simply means that you're hanging onto some stuff for 364 days more than you ought, which slows you down and has data protection implications too. ?