Article

Data Quality Assurance in Enterprise Data Warehousing from a Risk Management Standpoint

By Beng Lim, Data Quality Practice Lead, Broadstreet Data Solutions


Data Quality is the state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for use.

There is no nightmare worse than lost or disappearing profits due to poor decisions made based on incorrect data. If not rapidly addressed, the long term viability of the corporate entity itself may even be threatened.
How can reliable information be obtained then… without necessarily costing an arm and a leg? This is where an integrated approach will help the most. By combining probability theory and risk management we can ensure Data Quality pragmatically; especially with the limited resources characteristic of today’s business climate.

As a matter of fact, certain Data Quality problems are more likely to occur than others. We also know that success is usually incremental and that the riskier the chunk the bigger its success becomes. Integrate these realities into our practice and the solution stares us right in the face - deploy what limited resources we have to solve the higher risk problems, rather than attempt to mitigate each and every data issue!

The strategy is to use risk as a criterion for prioritizing quality improvement tasks. Risk is defined as "the possibility of incurring loss”-- of reliability in the data in our case. There are three key elements in the risk-based approach to improving Data Quality:

  • Define user expectations for the data using metadata and Data Quality metrics that measure the accuracy, validity, completeness, consistency and timeliness of the data as appropriate for each use case.

An Important Caveat:

While Data Quality process improvements should be made as close as feasible to where source data is captured, (i.e., starting in the operational transaction system), do note that since alternative uses of the data impose different quality requirements, transaction system and data warehouse users may not share the same motivation.

If transaction system users are also responsible for data entry, their interests will dictate where greater attention is paid to Data Quality. Take the closing of customer bank accounts for instance. Although transaction system users may consistently set the status of the accounts to signify closure, they may not always enter the actual closing dates. In this case, data warehouse users who rely on closing date values may miss a good number of customers when providing leads to the customer retention team.

Hence it is advisable to leverage any project to scrub data in the transactional systems by incorporating the data warehouse user's data expectations into the project requirements as well so that quality standards can be met at the outset!

  1. Express risk in terms of what can cause the Data Quality not to meet expectations for specific uses, and initiate actions or projects to minimize the risk.

    Risks may arise out of unforeseen events or known situations that will prevent the data from meeting the expectation. For example, consider the risk to data timeliness:
    • Expectation - The data will be available three business days after month end.
    • Event - Server downtime in the source system prevents completion of month end processing prior to the extraction of data for loading into the data warehouse.
    • Risk - The data may be up to two days late if the source system server is down during the last few days of the month. In this instance, an event beyond our control (source system server downtime) is the threat, and the reliability of the source system server is in question.

      A second example depicts a known situation risk:
    • Expectation - The weight and dimension data for delivered products will be accurately captured.
    • Situation - There are no edit checks in the source system interface requiring this data to be entered. And even if the entries were mandatory, they may be "best guesses" instead of actual measurements.
    • Risk - The data may not be present for a significant number of records.
  1. Incorporate Verification, Validation and Certification (VV&C) as an integral practice for proactive detection of potential Data Quality breaches. The goal is to integrate Data Quality assurance into the ETL process rather than treating it as a separate activity. VV&C results would then be analyzed for process improvement opportunities.

Evaluating Risks

To evaluate Data Quality risks undertake the following two tasks:

  • Source Data Verification and Validation (V&V): To generate metrics to compare actual Data Quality against the Enterprise Data Warehouse (EDW) user expectations; and
  • Downstream Validation and Certification (V&C): To evaluate the appropriateness of the data used by each EDW application.

Verification checks the data's format, type, conformity of values to ranges or itemized lists and consistency with other related data, or checks for conformance to metadata describing the database design. Validation checks for data accuracy in comparison to the real world objects described by the data (i.e., observations in the real world or a standard accepted as correct). Certification is an official confirmation that the data has undergone V&V - it provides a measure of confidence to downstream consumers contemplating various uses for the data, thereby reducing the risk of inappropriate utilization.

Source Data V&V can mostly be accomplished using data profiling software to run certain edit checks and measure the quality of the data presenting the greatest risk for losses due to inappropriate use. Metadata describing Data Quality requirements at the database, data entity, data element and data value levels of detail, i.e. data definitions (its type and size, its allowable set of values, if it can be blank or null and the relationship it has with other fields) and business rules, can be readily utilized by the software to conduct the exercise, saving a rather tedious manual effort otherwise. For instance, date data elements can be tested for real dates (measuring occurrences of false dates such as February 30th., 2006) and numeric fields for the occurrence of non-numeric data.

Metrics defined to describe expectations for Data Quality can be translated into edit checks that detect and report on data defects such as duplicate records, missing values, violations to sets of allowable values, and inconsistencies across related sets of data. Let’s say there is a business rule stating that if AGE is 15 or less and the PROVINCE is "ON" then the DRIVERS LICENSE field must be blank. In that case, an edit check for the presence of driver license numbers in the customer records of Ontario youths who maintain bank accounts can be put into place. Edit checks are to be run against data prior to its extraction into the data warehouse.

Customer data integrity pertains to those checks that are performed on data elements that identify and describe people, households or businesses, particularly customers (name and address fields, phone numbers, e-mail addresses, etc.). Data Quality assessment is especially critical when consolidating multiple sources of customer data, i.e. when we are not sure if and how many customer records have been duplicated. For all we know, the following customers may in fact be the same person:

Customer No. Name Address
00361 William Johnson 439 King Street, Toronto
27889 B. Johnson 439 King Street, Toronto
77854 Bill Johnson 439 King Street, Toronto

Ensuring customer data integrity usually demands a specialized tool such as Trillium to help clean and enhance the data. Activities include data standardization, data enrichment and address verification. Passing the above example through the tool, depending on its configuration, it is completely possible to end up with just a single customer record with a set of standardized and enriched data elements and their corresponding values:

Customer ID
First Name
Last Name
Street No
Street Name Prefix
Street Name
Street Type
Street Type suffix
City
Postal Code
00000010
William
Johnson
439
-
King
St.
W
Toronto
M5V 1K4

Edit run results should be released to users as summary statistics to provide them with insight into the data's condition, or distributed to source system representatives as detailed defect listings so they can work on improving the data for future imports into the data warehouse. Ultimately, the same edit checks used to detect Data Quality problems can be subsequently utilized to monitor the impact of projects initiated to increase the quality of problem data.

Downstream V&C, on the other hand, must be performed by subject matter experts (SMEs), i.e., user representatives who can effectively sample the data against its source to measure its accuracy. Documented outcomes of VV&C efforts can be authenticated and subsequently developed into a usage profile of the EDW that describes appropriate and inappropriate uses for the data to other downstream consumers. Doing so will establish a valuable “Certified Guide to using The EDW”. The certification can be implemented with an easily understood grading system that quickly explains how accurate the data is. Letter-grades (i.e., A, B, C, D, E and F) may be adequate. A percentage of rows that pass the respective evaluation are associated with a corresponding grade. For instance, an A means greater than 90 percent accuracy; a C may mean that the data is over 70 and up to 80 percent accurate, and an F denotes highly risky data as it represents below 50 percent accuracy.

When the quality of the data is at risk of or actually failing to meet expectations, it is time to re-evaluate the risk and plan for mitigation.

Evaluating Risks

Data Quality risk has two aspects to it--probability and severity. In cases where VV&C indicates the probability of the Data Quality failing to meet expectations to be high, we must weigh the severity of losses against the cost to avoid the loss. As such, alternative actions should be systematically identified, evaluated and refined as project initiatives to improve the Data Quality based on the cost-effective avoidance of likely losses.

Since Data Quality problems can surface at any point in a data warehouse development effort we can contend them with Data Quality procedures specific to the stages of the life cycle, i.e. during requirement definition, design formulation, acquisition, maintenance and archival as described below:

  • Requirement Definition and Design Formulation Phases in which Data Quality procedures are activated to offer significant early bird benefits - reduced likelihood of problems occurring and lessened severity of their impact. Where possible, the event or situation should be dealt with directly in the design of the data warehouse itself, or through enhancement of the source system. Risks from user misunderstanding would be greatly curtailed if ambiguity in the meaning of data and redundant data elements can be initially removed during formulation.
  • Acquisition and Maintenance Phases can be employed to decrease uncertainty about the risk itself. By incorporating edit checks and quality metrics into the data extraction process, we can demonstrate the risk to be either negligible or considerable. In the latter scenario, appropriate actions should then be planned and built into the data transformation and/or loading processes.
  • Archival Phase can include safety procedures to prevent obsolete data from misuse. But what happens if the expectations are neither technically nor politically reasonable and nor consistent? When existing sources cannot satisfy the expected Data Quality or an expectation is unreasonable, we can seek an alternative source or adjust the expectation of the downstream application by either filtering the data or developing an intermediate process to augment the quality of the existing source system data.

Data Quality problems with cultural or political roots may be better resolved via policy and organizational refinements instead of through technical considerations. For instance, motivating transaction system users to enter quality data for data warehouse use may require fundamental changes to how their work performance is reviewed and/or in the funding of the source transaction system operations.

Conclusion

This risk-based approach to Data Quality assurance in the data warehouse resolves Data Quality problems by incorporating activities for improving the quality of data into the ETL process and then keeping a disciplined vigilance over them. The key to successful risk management, advocates the explicit definition of user Data Quality expectations. It provides a cost effective framework for identifying and evaluating risks--rather than doing it indiscriminately for all data imported into the data warehouse. Managers of data warehouses need only confront the risks by focusing on the major problem data elements likely to cause the most significant losses. Consequently, they have a strategy for mitigating the risks -- by reducing uncertainty, seeking alternatives and adjusting user expectations in addition to addressing the risk factors directly.