View RSS Feed

hsheng

The Challenge of Removing Duplicate Data

Rate this Entry
by , 07-21-2011 at 11:24 AM (377 Views)
Data duplication is not a problem that is specific to any particular industry. Data often gets duplicated for a variety of reasons: manual entry errors, redundant entries in different source systems, missing data constraints, etc. However, before data can be successfully de-duplicated, you must first define the criteria which identify which 2 records are considered duplicates. For example, do you see any duplicate records in the following table?
Click image for larger version. 

Name:	dup_input1.jpg 
Views:	347 
Size:	56.7 KB 
ID:	80

Some may say yes, while others will say no. It all depends on how the criteria for identifying duplicate records are defined.

If the requirement is that every field value must match, then there are no duplicate records in the table. However, if the requirements are that the fields item, size, and color must match, and we want to keep the most recent entry, then the resulting table would look like this:
Click image for larger version. 

Name:	dedup_out.jpg 
Views:	91 
Size:	44.5 KB 
ID:	81

In this example, we removed 3 records because they contained duplicate entries for item, size, and color:
Click image for larger version. 

Name:	dups.jpg 
Views:	87 
Size:	19.7 KB 
ID:	82

Of course this is a very simple exercise with only 11 records. What happens when you are dealing with 11,000 or even 11 million records? Naturally not something you would want to do in Excel or by hand! This is where expressor shines and makes data de-duplication an easy task.

To accomplish this logic using expressor, all you need to do is use the Unique operator in your dataflow:
Click image for larger version. 

Name:	dedup_flow.jpg 
Views:	90 
Size:	13.7 KB 
ID:	83


Click image for larger version. 

Name:	unique_op_pty.jpg 
Views:	90 
Size:	33.1 KB 
ID:	84
The unique operator can be easily configured by
specifying the fields which would constitute a duplicate record. In this example, we would check item, size, and color as show to the right.

The mode setting identifies which record should be kept when duplicates are encountered. The method setting allows you to specify how the processing should occur. If there’s enough memory available and your data volumes are not too large, the In Memory setting will yield the best performance.

Once configured properly, the Unique operator will identify and eliminate duplicate records at runtime.
This simple approach works regardless of the total number of records being processed. So the next
time you encounter duplicate data that shouldn’t be there, just remember how easy it is to use
expressor to eliminate those duplicates.

Comments

Gravatar as Default Avatar by 1e2.it

SEO by vBSEO 3.6.0