If you choose the Discard Duplicates type in a transform process, it enables you to select one or more fields to compare, and discards duplicated rows depending on values in these fields. Before starting this transform, make sure the fields have been sorted first, which minimizes memory use. This is because comparing each record with many other unsorted records requires a significantly large memory if there are huge volumes of data. However, if we only have to compare each record with the previous record, we can run through massive data without needing huge amounts of memory.
The following table shows an example of the input:
Field 1 | Field 2 |
---|---|
A | 5 |
A | 9 |
A | 9 |
D | 1 |
D | 3 |
If you select Field 1 to compare, the following table shows the output:
Field 1 | Field 2 |
---|---|
A | 5 |
D | 1 |
If you select Field 2 to compare, the following table shows the output:
Field 1 | Field 2 |
---|---|
A | 5 |
A | 9 |
D | 1 |
D | 3 |