Scalability through Parallelism
Ever find yourself commuting to work each day down the same road, wondering why there is so much traffic? You ask yourself, “Did the city planners not predict that a large number of commuters would use this road to get into the city when they built it?” Then comes the time for road construction and the traffic comes to a complete standstill. While it’s true that a population spurt in a particular area or city may be the cause of a rapid roadway overload, it’s clear that roads which frequently bottleneck within a few years, even in non-peak hours, were not made to handle the amount of traffic that has befallen them.
The insufficient passage of cars shows that scalability was not given enough priority. Consider that ultra-fast exotic sports cars sit still just like any other car waiting to pass the bottleneck. Without space to race you are just as well off driving an old grocery-getter station wagon. Having a nicer-sounding engine might give the sports car a slight edge for finding dates, but where time is concerned, width of the traveled route is the key.
The importance of scalability in data processing carries an identical impact. In any data processing scenario where the loads of data are growing at a furious rate, scalability is extremely desirable, if not mandatory. Scalability allows our processes to grow, and with the onset of Big Data, being able to “tweak” an application for higher flow certainly chalks up a point in the positive column. One such way of tweaking involves adjusting the level of parallelism, a feature offered by Ataccama’s Data Quality Center (DQC). As in our road traffic scenario, parallelism helps reduce that bottleneck by adding more “lanes” for the traffic to pass. The magic of the “how” lies within making processors do more work. We’ll come to that in just a bit.
Filter vs. Complex steps
Prior to jumping into the methods of parallelism level configuration and determining which level is best, an important point needs to be made about the plan steps being used. There are two types of step classifications within DQC—Filter and Complex. They differ in their behavior when they process records.
Upon launching the runtime, each Filter Step instance runs in a separate thread and works in a way that obtains a batch of records from the previous step, performs its job on the records and sends the records to the next step. Because of this, using memory and hard drive space is avoided. Examples of Filter Steps are Guess Name Surname, Column Assigner, or Strip Titles.
In contrast, a Complex Step instance requires all the data before it even starts working. And as the step doesn’t know how many records to expect from the input, it usually stores and manipulates the data on a hard drive. Examples of Complex Steps are Unification, Representative Creator, or Data Sampler.
The most important point that comes out of that comparison is that Filter Steps don’t commonly use memory or hard drive space (i.e., slow down processing time) for their runtime, whereas Complex Steps do, and that makes Filter Steps suitable for parallelism.
What number should be used?
If you are wondering what parallelism level number works best, the number should reflect the number of times the Filter Step being used is duplicated for the purposes of splitting the records of a single thread into separate batches. The number of duplicates in turn should match the number of processor cores you wish to engage during the runtime. For small runs (i.e., tens of thousands of records), the difference between one and more cores is not likely to be noticeable, therefore the usual default setting of “1″ is optimal. However, for large runs (records in the millions), utilizing all available cores (2 or higher) in the processor is where things start to happen. Whichever route is taken, it’s key to remember that overall plan file size and step type usage within the plan (complex steps typically require more time to process records) will determine total processing time, but DQC is, for example, capable of matching rates of anywhere from 5 to 15 million matches per hour.
Methods of Level Definition
There are three possible methods of defining the parallelism level for a plan file runtime:
- UI
- Configuration File
- Performance File
For the purposes of this article, we’ll place the spotlight on the User Interface, the most straightforward of the three. For information on utilizing the other, more advanced options, please write to us at support@ataccama.com, and we’ll be happy to assist you.
The User Interface method utilizes the Run Icon (for launching plan files) in the toolbar.
After choosing the Run Configuration option, the Runtime Configuration dialog appears.
Under the Ataccama DQC launch tab, the third text box contains the level of parallelism parameter. Any Filter Steps within the plan will run according to the defined level. That’s it. You have effectively “widened” the roadway. Now, keep in mind that there are other runtime configurations that can be made to further enhance runtime performance, all of which can be found documented in the DQC Help section.
In a time where traffic can become quite overbearing, methods of relieving it must accompany, in order to meet time requirements. Where data processing is concerned, there is no doubt volumes will continue to increase and strain the components (both physical and virtual) that transport them. We said that width is our ally, so leave the car engines alone and employ the performance features that help widen those roadways. Otherwise, you are simply turning your sports cars into grocery-getters.

