What's your opinion man?
While conducting CReW macro training at Inspire18 with Joe Miller, we revealed several common use cases where the "Expect Equal (Beta)" macro was a solution. In fact, Joe Miller has begun incorporating the macro into his weekly challenges as a way of confirming the challenge results. We got a question from the class about this macro which I'm about to share: "Can you set which fields to check between the two files (instead of checking all fields)?"
What are your thoughts? My response was to "Select" the same fields from both sets of data for comparison prior to the macro. Adam was in the room and while I had asked him prior to the conference about why the "CReW Test" macros were still in Beta and his response was that he hadn't received any feedback on them from the Alteryx Community.
Did you know?
- Expect Equals requires every Field to be in the same order?
- Expect Equals is case sensitive for Field Names?
- Expect Equals is NOT case sensitive for data?
Are these bugs or are they features? I don't know. But I did have fun with some exploration. I wanted to create a CReW Delta macro. It would function similarly to the Expect Equal macro in that nothing happens when everything is equal. Above that, it would output every difference found for every record compared. It would also provide an error message explaining where the differences were found.
Error: CReW Delta (WIP) (3): There were 1 changed records found. 2 fields had differences: {Field4, Field5}. 1 unmatched left records and 0 unmatched right records.
In this macro, we compare records and output the Sequence number of each input file along with the field names where differences are found. The left and right value pairs are easily reviewed from the "Delta" output anchor.
This works wonderfully with 2 records. We wanted to explore the performance for millions of records. The number of records along with the number of fields compared will create challenges for computing resources. We experimented with 2 approaches for comparing the data and the results may surprise you.
Approach #1: Do it like the CReW Expect Equal Macro
Transpose the data into Name + Value pairs and JOIN the results. When they match, do nothing and when they don't, then they are errors.
Pros & Cons:
- You will be creating one (1) record for every field in the incoming data times the number of incoming records time two (2). Comparing 2 million records with 20 fields will result in 80 million comparisons. This seems to be a Con.
- It is very easy to configure this macro. This seems to be a Pro.
- We don't have an interface tool to allow you to select which fields to compare. This was a con for our CReW student.
Approach #2: Use MD5 Hash on each record and compare the hash results
For each "selected" field on the incoming records, HASH the selected fields and compare the Hash value in a join that requires NO sort. Where the expected values differ, they are errors.
Pros & Cons:
- You will be comparing one field for every record. This seems to be a Pro.
- It is complex to create the dynamic formula to create the HASH function. This seems to be a Con (for our maintenance).
- We included an interface tool to allow you to select which fields to compare. This was a pro.
Results:
With 2 million input records (4 fields) approach #1 took roughly 10 seconds to complete. With the same input, approach #2 took roughly a minute to complete. Why did it take so long? Because the conversion of all fields to strings and the creation of the MD5 Hash was an expensive function to call. Sorting and handling small records was FASTER than creating the dynamic formula and hash.
Please take a look at the macro and let us know your thoughts. I appreciate your feedback and want to know if there are features or functions that you would want in this macro if we do decide to make it into a future release.
Macro Location
Cheers,
Mark