Well the answer is that module B is faster. That is to say adding an extra sort actually speeds things up. So using my sample data whose counts and sizes you can see in the images: Module A runs in about 45 seconds while Module B runs in less than 30 seconds.
Which kind of seems counter intuitive, how does adding more processing make things run faster overall?
Well a great place to start looking for optimisations in an Alteryx module is to look at where you are sorting data. A sort is a pretty resource intensive task, so the less you can do it the quicker your module will run.
In module A, behind the scenes Alteryx is sorting in two places:
Which kind of seems counter intuitive, how does adding more processing make things run faster overall?
Well a great place to start looking for optimisations in an Alteryx module is to look at where you are sorting data. A sort is a pretty resource intensive task, so the less you can do it the quicker your module will run.
In module A, behind the scenes Alteryx is sorting in two places:
- The summarise needs to sort the data in order to do the group by on the grouping field.
- The batch macro also needs to sort the data to be able to batch it into chunks.
So module A has two sorts.
But if sorts are bad for run time, then how can adding an extra sort speed things up I hear you ask?
Well that's the clever bit... When a tool in Alteryx sorts some data it tells the tools down stream of it that the data is sorted and what it is sorted by.
So back to our example: We noted above that module A was performing two sorts, but actually it was doing the same sort (by GroupingField) twice. By adding a sort in where we did the data gets sorted once there. When the data gets to the summarise tool and the batch macro tool, it is already sorted by the GroupingField and does not need to be sorted again. Rather than adding a sort in we have actually reduced the number of sorts the module needs to do by one, thus gaining the saving in the run time that we see.
No comments:
Post a Comment