Mark Twain once said, “A man with one watch knows what time it is; a man with two watches is never quite sure.”
As Mr. Twain knew, and most of us learn, having more than one of something isn’t always positive. In the context of data, having more than one of the same record can also be the source of headaches, as issues often arise from this predicament. For this reason, DQIT possesses a means to deal with this type of issue and re-solidify the groupings within your data.
Issues involving duplicity of data are called Match and Merge Issues. This issue type incorporates the handling of data sets containing potential duplicates of the same record, all of which correlate to a single real-world entity. Typical usage scenarios are:
- de-duplication (or duplicate elimination)
- establishing a relationship between connected records
- combining data from multiple sources to form a single representative record (the “golden record”)
- removal of inconsistencies across the systems
Resolving an Issue
Starting out in the main panel of the Issue List Screen, the Match and Merge issues are, simply enough, listed as “Manual merge.” Clicking on one of them launches the Issue Resolution Screen.
Once inside the issue, data stewards are presented with a list of candidates—potential duplicates. For every potential duplicate, a separate column appears. Initially, the DQ tool logically chooses the most likely key record candidate to be used going forward or for matching, and assigns the role of merge survivor to the record in the panel row labeled “Resolution.”
In order to resolve an issue of this sort, the data steward must decide whether each record listed belongs to the group, along with verifying the record labeled as merge survivor. Each record column possesses a drop down list containing decision codes. By selecting one of the decision codes, the steward effectively verifies the record inclusion in the group, separates the record (preventing future matching to the group), or sends the record to be reviewed by a supervisor. The issue is not considered resolved unless a proper decision code is selected for each record. Possible individual decisions include the following:
- Merge Survivor. Record from the group used as future reference record, always one record with this resolution value
- Merge. Record belongs to the same group as the merge survivor, will serve as “slave” record
- Reject. Record does not belong to the same group as the merge survivor and will be eliminated from group
- Supervisor. Record requires review by a supervisor before issue is resolved
- (Blank). Record that still requires a decision from the user (default setting)
To further assist the data steward with their initial analysis of the group, colors represent how the individual fields relate to their counterparts in the merge survivor record.
Here is a sample solution for the preceding example.
Records 2 and 4 of the first 5 are clearly not part of the group and are therefore assigned “Reject.” Records 3 and 5 are clearly the same real-life entity, despite one inconsistency. Be sure to also check any additional records not shown on the screen. This is done by clicking on the arrows located directly above the record columns.
Building a Golden Record
In addition to establishing a group of records and a merge survivor, the survivor record might still possess blank, invalid, or inaccurate data fields. DQIT allows users to manually select the individual attribute values from different records. The data in the newly combined record—the golden record—can be subsequently used to update the records in the group or the merge survivor, or saved to the master data repository.
Clicking on any of the merge duplicate records retrieves the missing data. Upon this action, two things occur:
- The clicked record field changes to green, indicating the survivor record is using the contained data.
- The preview column reflects the addition of the retrieved field and displays a complete merge survivor record.
Repeat the above process for as many fields as necessary. Always review all records within the group to ensure that the correct data is being used.
It’s generally said, “Two are better than one.” Is it always true? Perhaps…if you are referring to things like bank notes, training wheels on kids’ bikes, shoes, or cold beverages on a hot summer day. Duplicate data records unfortunately do not fall into that category. At the same time, we are prepared for this overabundance through the utilization of DQIT Match and Merge issue resolution. No longer must we fear of having one too many of something.
Part 4 of the DQIT Series: Multi-Record Inconsistencies is due out next, so be sure to revisit us soon.