“He shoots…he scoooooores!”
Anyone that follows hockey all-star Jaromir Jagr either in the NHL or on the Czech national team has become accustomed to hearing those words. Since his career began in 1988, he has won league scoring titles, MVP awards, helped lead NHL teams to Stanley Cup titles and his homeland to medals in the Olympics and world championships. Quite the resumé.
In hockey, as in most sports, it’s a good thing to score. Common sense has it that the player or team with the higher score at the end of the game or match is declared the winner. A high score in the context of data quality, however, won’t make one enthusiastic.
In fact, it is desired for data to possess the lowest score possible, if not end up with “0”. Why is this true? In this article, we will examine the principle behind data scoring and how you can configure steps in DQA or DQC to perform this function.
So what is the logic behind preferring your data to have a low score? Well, the main thing to understand is that data quality processing measures and analyzes the level of data disruption. In other words, we track how bad data is rather than how good it is. Record by record, a number score is assigned, indicating the level of data inaccuracies within the respective record. The more inaccuracies, the higher the score number. No, it’s not about being pessimistic. Since the amount of good data usually outweighs the amount of bad data (hopefully), it’s a bit easier to keep track of the negative results.
We explained that data possessing a low score (better yet, no score at all) is a positive thing. Let’s now look at a practical example from the DQC application that reinforces that statement. In the plan configuration shown below, a file with U.S. postal codes is being analyzed. The step that first scores the input source records (Simple Scoring) checks for two things:
- that the length of src_zip is 5, 9, or 10 digits (if not, 10 points are assigned)
- that src_zip contains only digits, except for possible spaces or hyphens (if not, 20 points are assigned)
Defining those rules is shown in the next two figures.
For invalid formats, a matches string function containing regular expressions searches for 5-digit ZIP codes and ZIP+4 formats that contain no space, a space, or a hyphen. More characters can logically be added, based on the incoming data.
The file is scored at this phase because only ZIPs with no score (score = 0) will match up to the lookup file step that immediately follows. In the ZIP lookup step, we utilize the scoring capability within to score source records once more, scoring ZIP codes that do not match to any lookup file ZIP code.
All records have been assigned a score, together with an explanation for the score received. Now it is possible to run a profile on the ZIP Lookup output file for basic reports. The first report shows the scores, the second represents the explanation column.
For in-depth numbers and expanded commentary describing a field, the Data Quality Indicator (DQI) further enhances reporting, allowing for leveraging business rules to extract conclusions about a data field’s quality.
In this case, the DQI step is employed to gather all the initial scores assigned and summarize the values in a format that lists the reasons behind the data inaccuracies and the success rates of matching. This step comes with data that is inherently stored in column fields as the step processes the dataset.
Four of the columns were configured in our example plan. Of those four, Name, Code, and Expression are required settings.
Percentage is not one of the inherent column fields of the DQI step, and therefore, was created in the Text File Writer step that follows. The round function was used to fill this field, along with two of the columns that exist in the DQI step—Success Count and Total.
The following figure shows us the fruits of our labor.
This project showed one way of how score is utilized to reflect the quality of our data. Numerous types of Boolean conditions can be used to assign a score to a record. Some examples include functions or steps that:
- measure string length
- test for numeric or non-numeric characters
- test for a certain string within a field
- test if a field is populated
- determine whether values that are equal, less than, or greater than the source value
- determine whether existing source data matches to a standardized look-up file
So remember, if someone tells you that your data received a high score, do not raise your hands into the air and begin a celebration dance, because the loud dance music and arena foghorn will not be sounding off in your favor. Instead, lower your shoulders and keep your stick on the ice—it’s time to find out where your data is having problems.
photo by s. yume via Flickr (Creative Commons license)