Skip to content

gristle_differ

ken farmer edited this page May 15, 2021 · 6 revisions

Usage Scenarios:

  • Changed data capture - rather than rely on triggers, change dates, or messaging this utility allows one to compare the before and after file or exported-table records and identify exactly what is new, deleted, changed and the same. And there's no risk of any changes being missed due to incorrect date updating logic, non-logged transactions, etc.
  • Accurate CSV diffing - rather than use the inaccurate diff utility, one can very accurately identify exactly what is different between two csv files for a variety of reasons: for finding missing data, regression testing, etc.
  • Versioning data getting loaded into a data warehouse - this tool can perform the changed data capture and split all change types (insert, delete, change, same) into separate files for different load options. It can ignore columns (such as ids and last-loaded timestamps), and it can perform post-delta transformations (such as increment an id column, add a delete flag and add from and to timestamps).

Alternatives:

  • One could use the unix diff utility - if you can tolerate inaccuracies caused by inability to parse csvs, inability to distinguish a new record from a changed record, inability to ignore columns, and lack of post-comparisons transformations.
  • Custom SQL - can accurately and quickly do the job, but it requires all data to be loaded into a database, and a lot of custom and error-prone SQL written for each set of tables to be compared.
  • ETL tool - large ETL tools can also perform this kind of delta comparison. However, they tend to be costly, bulky, and a lot to learn for just one simple function.

Functionality

Comparisons, sorting and deduping:

The program sorts the two input files based on unique key columns, then removes any duplicates based on this key (leaving 1 of any duplicate set behind).

Comparisons of matching records can be limited to specific columns in two ways: - using --compare-cols to assume no cols will be compared other than these specifically identified. - using --ignore-cols to assume all cols will be compared, and explicitly identify those not to compare.

Transformations:

After the comparison post-delta transformations can be performed in order to ready the file for subsequent processing. Examples of these transformations include: - incrementing a key column in the .insert & .chgnew files - populating a delete flag in the .delete file - populating a row version starting timestamp in the .insert and .chgnew files - populating a row version ending timestamp in the .delete and .chgold files - populating a batch_id in any or all files

Config File

The config file is intended to help with complex operations by allowing the user to easily move common options and args into a file.

Format:

  • The config file uses a yaml format, with names of config items that match their command line argument names (except they use underscores rather than dashes).

Name and Location:

Examples

Example #1 - Simplest comparison

$ gristle_differ -i file0.dat file1.dat --key-cols 0 2 --ignore-cols 19 22 33

Produces the following files:
   - file1.dat.insert
   - file1.dat.delete
   - file1.dat.same
   - file1.dat.chgnew
   - file1.dat.chgold

Example #2 - Complex Operation

$ gristle_differ -i file0.dat file1.dat --config-fn ./foo.yml  \
    --variables batchid:919 --variables pkid:82304

Produces the same output file names as example 1.

But in this case it gets the majority of its configuration items from
the config file ('foo.yml').  This could include key columns, comparison
columns, ignore columns, post-delta transformations, and other information.

The two variables options are used to pass in user-defined variables that
can be referenced by the post-delta transformations.  The batchid will get
copied into a batch_id column for every file, and the pkid is a sequence
that will get incremented and used for new rows in the insert, delete and
chgnew files.

Further References: