The Christmas I turned 12 I got a TI-99/4A and I’ve been programming on and off ever since. Pressing a button and watching a computer translate lines of code into graphics, music, or calculation results is still magic for me. And as much code as I’ve written I feel limited by needing to have just the right number of spaces for an indent or the commas in exactly the right location for the code to work.
Therefore in my world Alteryx is a delightful tool for visual programming with data: instead of variables I’m working with streams of data, instead of typing out function names I’m dragging and dropping tools, instead of paging down from a function call at line M to see the function at line M+N I’m following a connector line on the screen, and instead of creating functions I’m building macros.
Only Alteryx doesn’t always quite work the way I’m used to programming which is mostly in a procedural fashion, so this post is composed of my notes about the differences between my procedural programming mental model and the Alteryx mental model for building out macros. Hopefully it can help others, and thanks to Ken Black for encouraging me to keep my notes and do this!
The Three Kinds of Workflows
What I think of as the “working unit” in Alteryx is a workflow (like a document in Word, a worksheet in Tableau, etc.). There are three kinds of workflows in Alteryx that we use to manipulate data and get stuff done:
- Standard workflows that take in data, transform it, and write data out.
- Analytic Apps. These are self-contained workflows that a user can configure, run, and generate output without requiring Alteryx Designer.
- A workflow that can be dropped as a tool inside other workflows, in other words a re-usable workflow. This is a macro.
Sometimes we build macros when what we’d really like is to just have another function to use in the Expression editor, we can do that too using Alteryx’s SDK (and check out John Dunkerley’s amazing Beyond Alteryx Macros series.
Relating Macros to Programming
I really don’t want to do the same task over and over again, even if it’s as easy as drag-and-drop. So this post is mostly about the third kind of workflow from above: re-usable macros.
In a programming mental model this is creating a subroutine aka function.
In SQL this is creating a user-defined function, stored procedure, or even a database view.
In Alteryx we can duplicate what we do in programming or in SQL, or even just wrap our pre-existing scripting language code or SQL in an Alteryx macro. (This latter piece makes for a nice way to start incrementally using Alteryx to add functionality to existing processes without having to replace the entire data pipeline all at once).
What are the Objects We are Manipulating When We Build Workflows?
This is where Alteryx starts to diverge from other tools I’ve used:
Programming: We use variables that might be single values, vectors, arrays, objects, etc. In many programming languages it doesn’t really matter whether the values are coming from configuration files, run-time arguments, or data loaded by the application, we manipulate them with the same logic.
SQL: We use variables, manipulate columns of data, and use cursors and calculations to work with specific rows or sets of rows.
Alteryx: There are three things that we are manipulating in Alteryx:
- Constants: These are fixed variables that are set in a workflow by the workflow’s developer. We can copy them, but can’t change the values.
- Data streams: These are the data flows that we’re connecting from one tool to another in a workflow. Data streams have rows and columns, we can use Transpose and Cross Tab tools to rearrange them, the Dynamic tools to manipulate row & column names & values based on input data, use the Preparation tools to manipulate the data, write expressions, etc. This is Alteryx’s bread-and-butter. Note that when we’re editing expressions there’s a “Variable” option in the dialogs that refers to columns in the data stream.
- Configuration settings: The Alteryx tools have user interfaces for setting options in the Configuration window, for example setting the file name and whether the first row has headers on a Text Input tool. These are similar to arguments in a programming mental model. If we want to dynamically change the configuration settings or create our own interfaces for entering configuration settings then we need to add Interface tools to our workflow.
Now, as soon as we use the Interface tools the workflow we are building is automatically going to be a macro or an Analytic App, Alteryx won’t let us save it out as a standard workflow. This leads to the next point:
Data Streams vs. Configuration Settings
In building functions in programming we specify dynamic points of control over the function as variables given to the function. If I’m writing an exponentiation function to return X to the Y power then the function might have two arguments, POWER(X,Y) that are expressed as variables.
Using the above example of exponentiation we could certainly have X and Y be columns in a data stream and then write an expression to get the desired results. We could even make Y something that gets entered at runtime in Alteryx Designer by using a Text Input tool and an Append tool. But that runs into all the problems I just described in the prior section.
So in Alteryx we need to think about whether the dynamic point of control that we’re wanting to use is coming from:
- a data stream going into a tool (i.e. based on value(s) of a given column in the data)
- a configuration setting for a tool.
Using the exponentiation example it could be that both X and Y are coming from the data, X and Y are configuration settings for the tool, or anything in-between, because Alteryx is really flexible that way. We just need to make some decisions in advance to help it do the right thing. This is different from programming where I’m “just” manipulating variables, in Alteryx we have these two different ways we can store our dynamic values. The next section covers that in more detail.
On a slight tangent: We can take configuration settings and turn them into data streams, and we can take values from data streams and turn them into configuration settings. Meta!
Macros Split Logic into Different Files
In a general purpose language like Python we can have a single file of code that takes in run-time changes to how it works as arguments or based on the provided data, and that file can have both subroutines that we’ve built and the main logic.
In Alteryx as soon as we add Interface tools then we have a macro or Analytic App, and unless it’s totally self-contained as an Analytic App then we’re looking at building two workflows, one for the macro and one for the main logic.
This means that as soon as we want to use Interface tools to control a workflow we need to be ready to split the tasks of the job we’re trying to do into (at least) two different files. This adds complexity to the logic and the testing process that personally I wasn’t ready for when I first started building macros because of how Alteryx distinguishes between data streams and configuration settings.
Can We Avoid the Overhead of a Macro and Stuff All Our Logic into One Workflow?
Yes we can! I alluded to this above in the example on an exponentiaion function Instead of using Interface tools and having to find good separation points in our logic sometimes we can throw a Text Input tool onto a standard workflow and build a table of configuration options, then use other tools like the Append tool to bring those options into the data stream and then have that information affect downstream tools to get the desired results. There are several downsides to working this way:
- We lose the opportunities for code re-use that macros provide.
- Macros have more ability to force the user to make choices because the configuration window of a macro automatically appears when it’s brought into a workflow.
- Macros have a user interface vs. just a table of values.
- We couldn’t use this method for an Analytic App because the app wouldn’t expose the Text Input for the user.
These days I pretty much only use this method when I’m prototyping a macro, which leads to…
Gotta Have Some Data
In programming and writing SQL we don’t actually need data to write our code, in other words we can write code independent of the data.
Alteryx works differently. We must have at least a tiny bit of data (even a single column name with no rows) because most tools in our Analytic Apps and macros won’t even let us configure them without data: They’ve got to have something connected to the input anchor in order to have a schema to work with.
We can do provide this data in a couple of ways:
- For an Analytic App or a macro we can provide a Text Input or Input Data tool inside the app or macro.
- Also for macros that are meant to be used inside an Alteryx workflow we can add the Macro Input tool to take a data stream from the workflow, however that Macro Input tool requires a Template Input that is either a Text Input or a file before it will work.
A pro-tip from Adam Riley’s incredible CReW macros : if you’re planning to replace fields in the starter input data stream with fields from the “real” data stream (such as by using a Drop Down tool) then prefix the starter field names with something. Adam uses(two underscores) in his macros, I tend to use tmp.
Types of Operations in Macros (and Types of Macros)
Alteryx currently supports four types of macros, though the documentation at Macro Workflows only mentions three (confusing)! The macro type (or Analytic app) is set on the Workflow tab in the Configuration window.
Standard Macros. These macros typically have one or more input and output data streams and process data as any other tool would. A number of the Alteryx tools are technically written as macros. In a procedural programming paradigm these can generally be thought of as subroutines.
Batch Macros. These are a form of FOR/NEXT loops. The batch macro generally has two input data streams, one is the (optional) Control Parameter that defines the “for each” and the (optional) second stream is the data set that is processed. At a maximum the batch macro will be run once for each record in the control input. For example the control stream might identify a list of files to process and the other input data stream is a set of data to write out with certain changes for each file.
- When we add a Control Parameter tool to a workflow it automatically becomes a batch macro. If you do this by accident and have already placed the macro into a workflow this can cause workflow to fail because it can be looking for configuration as a batch macro. I wouldn’t know anyone who has done this.
- When a batch macro is brought into a workflow theControl Group By setting determines whether the data stream for the control input will be grouped into possibly fewer records (and thereby fewer batches). There’s also an Input Group By option that groups the input data stream and effectively acts as a join key on the Control Group By. A couple of examples: If I have 100 control records and 00 input data stream records and no grouping, the batch will run 100 times, once for each control record. If I use the Control Group by and cut it down to 5 batches then that’s 5 batches.
- The data structure of each batch for the input data stream can be different. I haven’t had to use this option yet, it’s a “Nice to know it’s there because when I do use it I’m going to save hours of time.” feature.
- Here’s some pseudocode:
//Given i as the particular control record (after potential GROUP BY) //and n as the number of control records for i = 1-n …do stuff on all the input data stream… next i //This is the situation where we add j as an input grouping for i = 1-n if i=j then ….do stuff on all the matching records…. end if next i
Iterative Macros. These are a form of WHILE loops. This kind of macro has one or more input data streams and two output data streams. One output is the iterator that feeds back into the macro until done, the second output is the final stream. Each iteration runs on the whole input data stream. The iterations could end based on a user-entered condition or the number of iterations.
- Here’s some pseudocode:
while [iterator stream] != [the condition] … do stuff on the records fed in from previous iteration, then output iterator stream back into the macro… output final data stream
Location Optimizer. This is an iterative macro that helps determine the best locations to add to or remove from a network.
So that’s an initial summary of my notes on the differences between a procedural programming mental model and Alteryx. If I got anything wrong let me know! In my next post I’ll cover more on using the Interface tools to update configuration settings and what all those Qs and lightning bolts mean and why some connections work on some anchors and why they don’t on others.