The Biggest Philosophical Debate in Data Management


Winston Chen

In a recent webcast I did with Jim Harris, he talked about two views of data quality: provider centric and consumer centric. According to the provider centric view, data are just digital representations of real world things. If the representations are accurate, then you can use them for anything. In other words, data is good as long as data providers do their job right. The consumer centric view says, data is good only if it’s fit for use, i.e., if it meets the declared needs of consumers.

In fact, beyond data quality, many of the heated arguments in data management are rooted in this philosophical debate. And it is not just a theoretical argument: the implications to data management practices are huge.

Let’s take the debate between Inmon and Kimball. The Inmon camp takes the provider centric view. Data can be stored in an use case neutral way. Their states are absolute, like objects in classical, Newtonian physics. So, if you cleanse your data properly and organize it based on its intrinsic properties (3NF) in a big data warehouse, you can meet the needs of any consumer.

The Kimball camp, on the other hand, takes the consumer centric view: Data’s value is in the eye of the beholder and therefore relative. So data should be managed based on its specific use. This gives rise to the star schema, which organizes data to answer a bounded set of questions for a bounded set of consumers. Within those boundaries, navigation is easy and intuitive, and queries come back fast.

Provider centric people believe there is a single version of the truth. Consumer centric people are skeptical. Provider centric people think of data as bouillon that you can store in a vault; consumer centric people think of data more as employees, which are valuable only when they’re put to work on jobs that match their competencies.

This debate really comes down to a simple question on the nature of data: can data be consumer independent? In other words, is it possible for there to be a single set of data definitions and rules, which, when instantiated in a repository like MDM or data warehouse, can service any consumption needs? If the answer is yes, we should go with the provider view. If the answer is no, we should go with the consumer view.

As much as I wish for a simple world, I think the answer is, unfortunately, yes and no and maybe. It depends on what kind of data.

Some data are plain, immutable facts:

  • A point-of-sale transactions.
  • A customer’s legal name.
  • A click on a web site.

In general, these types of data represent real world events and physical objects, so they are indeed consumer neutral. We should manage them in a provider centric way and try to establish a single version of the truth.

But other types of data are not so absolute:

  • Hierarchies. They are typically designed with a purpose in mind. A single hierarchy will never meet the needs of all consumers.
  • Customer classification. Unlike legal name, it has no basis in the real world.
  • Web sessions. If the user navigates away f0r 10 minutes and come back, is it the same session? How about 2 hours?

These types of data are typically invented concepts, which, when created for one purpose, maybe utterly useless for another. We need to manage them in a consumer centric way.

Between these extremes there’re shades of gray. Data governance is critical in making these decisions in a collaborative way and express them as policies.

Tags:

7 Responses to “The Biggest Philosophical Debate in Data Management”

  1. Paul Fulton March 25, 2011 at 4:48 am #

    Fascinating article – I tend to agree and it is useful to be reminded of the value in the consumer perspective. I think the tendancy for data professionals is to order, control, and objectively provide the data in the classic “if we build it they will come” approach. Data Governance approaches often reinforce this in their goal for enterprise single version of truth.

    The truth is more subtler and complicated and speaks to the federalism post you made where we need to be relaxed about different perspectives. I think your examples are spot on: merchant hierarchies, segmentation, profile groups etc…

    One I have struggled with in the past though is the definition of a “customer” vs a “prospect” surely this is just a consumer driven view (like session) but yet I’ve seen it run havoc across an enterprise when trying to communicate numbers if it is centrally defined (provider centric)?

    • Winston Chen
      Winston Chen March 25, 2011 at 12:56 pm #

      Thanks Paul for your comment. Yes, we do have a tendency to standardize and control. It’s not a bad things, but we need to understand the limits. An effort to standardize on a single merchant hierarchy enterprise-wide, as you pointed out, is probably futile, and a waste of time.

      Defining “customer” and “prospect” is tough! Because each business function has a different view of what a customer is. And they’re all correct. A data modeler would say, you have to be more precise than just “customer”. Use more precise terms like “Revenue generating customer”, or “ship-to”, or “prospect”. But how successful can we be in changing entrenched business vocabulary? Maybe in this data domain we should accept that we need to “translate”. Translate what marketing means by “customer” to what finance means by “customer.”

  2. Jim Harris March 25, 2011 at 12:32 pm #

    Excellent blog post, Winston.

    Obviously, I agree that this is the biggest philosophical debate in data management. And I also agree, as with many complex challenges, although this can be viewed as a binary problem, where at first it appears we must choose between two polar opposites, either a data provider perspective or a data consumer perspective, reality is never that simple, and it definitely depends on what kind of data we are working with.

    It would be great if the data provider perspective could truly be applicable to all kinds of data, since without the challenges of relativity and subjectivity, an absolute and objective single version of the truth could serve all data consumers. This doesn’t mean that there is no value in the data provider perspective, since as you noted, it does work well for certain kinds of data, for which it should be used.

    The most common data governance failure I have witnessed is attempting to force a data provider strategy onto data that requires a data consumer strategy. Perhaps a good starting point for establishing a data governance program is to intentionally focus on data that can actually be managed using a data provider strategy. Create, manage, and provide a central, shared repository of those kinds of data, and eliminate any data and organizational silos related to them. This would only be the starting point for the overall data governance efforts, but at least it would allow building upon a successful foundation.

    Best Regards,

    Jim

    • Winston Chen
      Winston Chen March 25, 2011 at 1:12 pm #

      Thanks Jim for your comment! And thanks for giving me the idea for this blog.
      I want to highlight your statement “The most common data governance failure I have witnessed is attempting to force a data provider strategy onto data that requires a data consumer strategy“. That is a very astute observation.
      I think consumer centric data also need to be governed. But the strategy, or policies, are different. They would more about making sure that the consumer’s needs are recognized and declared, and when appropriate, the providers are on the hook for meeting the consumer’s demands. For example, to do physician spend reporting for pharma (consumer need for regulatory compliance), the sales reps (the providers) need to enter data differently in the CRM system. This is consumer driven policy, and a high priority one!

  3. Leonard Anderson March 30, 2011 at 1:14 pm #

    Is there possibly another category that is typical of the UK public sector. I will call it the data ‘joiner’ or ‘broker’ until I can find the correct term.

    We need to share data about citizens using multi-agency services. Some will be central government services, such welfare benefits; some are regional, such as healthcare; others are local such as transportation or education.

    Governance of interoperability is highly complex and crosses so many technical and cultural barriers. Multiple data providers servicing one or more data consumers. It requires governance agreed by many independent, self-governing, partners. We have been researching this for years – and I think that there are some practical frameworks. But I can’t get any traction for modeling methods that should help eg those that comply with ISO18876.

    Have you guys got any bright ideas to move people forward?

    • Winston Chen
      Winston Chen March 30, 2011 at 1:28 pm #

      Thanks for your comment Leonard. I wrote a blog recently that sets up a framework for thinking about the problem you’re describing: Managing Master Data Using Federalist Principles. Most big organizations, like one as big as the”UK Public Sector”, are federations. It’s helpful to think about data management that way, too.
      Secondly, Kalido’s policy centric way of doing data governance can help. You want to set up a process for providers and consumers to come to some agreements on how to deal with data, expressed them as data policies, and get them signed off. This framework is here.

Trackbacks/Pingbacks

  1. How to Measure Data Accuracy? | Kalido Conversations - May 6, 2011

    [...] The Biggest Philosophical Debate in Data Management [...]

Leave a Reply