Monday, October 4, 2010

Entity Attribute Value model for flexible normalised data capture


I am reproducing a long winded thesis I wrote on StackOverflow about the appropriate data schemata to facilitate OLAP. I wrote it because OLAP vendors are plain lazy to provide hooks to allow flexible and fast cube generation. OLAP at its current status is useless for engineers and good only for mesmerizing managers and their secretaries with imprecise and out-of-phase information. OLAP software writers seem to be so excited about their multi-dimensional Entity-Relationship schema designing skills, but never thought it necessary to optimise the flexibility and response of their cube building processes. Their response has always been, "Duh, it's OLAP, it takes time to build a cube. OLAP is not meant to be real-time."

Here goes ...
As a generalisation, for multiple dimensional analysis to be effective you need two levels data modelling.
  • A substrate detail capture
  • An abstract cube
The answer is indeed EAV as the substrate detail capture. People can afford to be lazy and skip the detail substrate design. Every data mining tool I had been forced (by management) to evaluate has been inadequate because the inability to perform time-profiling.
Ultimately, it required us to write our own data-mining application - because the amount of kludge to make the software work, albeit ineffectively, equals to the amount of effort it takes to create one that works effectively. And even works extra-ordinarily spectacularly when we fed slices of the abstract cube into SAS/Insight. While the vendors' software took an hour to build the cube, and sometimes 24 hours, we needed our cubes to be built within 5 minutes and frequently within 15 seconds - due to effective substrate data modelling to enable efficient slicing of data to build the cube.
In industrial practice, we might need to flexibly move and range the time window of an analysis to match conditions that occured 6 months ago. Or that, we need to review the performance of processes for the last financial year.
We need to have correct time-phase correlation. Let's say a church has a per night statistic for a particular evangelistic rally as number of people "saved", number of "rededications", number of "contribution$", total contribution$, number of attendance - over 10 nights of the rally.
Then the church has statistics for number of tracts distributed by volunteer and the location each volunteer was assigned, the radio and tv advertisements and google ads placed. So the church statistician would need to collect info from each convert and $contributor - location of residence, when and where a tract was received, when and where an advertisement was encountered. And if they could collect such information from everyone who attended and who told them about it and when and where those who told them about the rally received a tract or encountered an advertisement. With such info at hand, the statistician would then be able a more truthfully matched multi-dimensional cube to let church leaders decide how to effectively organise their future pre-rally out-reach.
Data model optimization may not be possible for churches but for it is very possible for a factory full of equipment, robots, and trained operators. It is extraordinarily helpful to get phase-correlated data to treat an epidemic.
The EAV model is necessary because the number of parameters collectible, and number of attributes, vary from equipment to equipment and from process to process and from disease to disease. We cannot afford to have a table each for each set of parameters. Frequently a process or piece of equipment collects different sets of parameters, depending on the product that is being processed.
Sometimes the number of parameters collected can number a thousand. We cannot have a table with a thousand columns, can we? We even violate data normalization principles to store data in a single row or in a blob because of inefficient data access in a highly normalized table.
Further, we also need to version our data set. Let us say we designed an experiment for 2004. In 2006, we discovered that we needed to include new dimensions and discard some useless dimensions, and so we created a new version of the experiment. Then when we analyse the performance of the experiment between 2002 and 2008, we need to provide proper treatment to the change in the sets of dimensions in the experiment. In biological experiments and social behaviour surveys, the version changes would be more frequent.
I have a relational model of variable-dimensionality for attributes and parameters here:http://appdesign.blessedgeek.com/discrete-flow-resource-management. Not exactly EAV, but gives an idea what industrial multi-dimensional data modelling entails.