Interested in trying Alteryx? Click here for a free trial!
When you're working with messy or complex data sets, frequently there is a need to join two different sources on a shared column, which is not necessarily identical in each of the sources. For instance, you might want to determine how many of the Fortune 500 (F500) companies are customers of your business, but the naming convention of the published F500 list doesn't match your internal naming scheme.
In these cases, one way to resolve the problem for small data sets is to simply do the matching manually. Beyond a few hundred records though, this quickly becomes untenable. Enter the realm of fuzzy matching, an automated way to resolve these problems - if you can get it working.
Alteryx has a Fuzzy Match tool, but out of the box it isn't appropriate for a "fuzzy join" of the type described above. It is built to look through a single source of information for duplicated records – such as when your records have accounts for ABC Inc. and ABC Co. when in fact they are the same organization. However, our more complicated problem requires us to take data from two different sources - one internal and one external - and join on the shared column.With this in mind, let's dive into using Altreyx to create a fuzzy join workflow.
We'll start by bringing in our two data sets using "Input Data" tools. We'll use the situation described above, where we're trying to join a company data set to the Fortune 500. For those who just want to get things working right away, here is the finished workflow – and here it is as a macro you can use in other workflows.
*Note: Sample files are available at the end of this post.
First, we need to union the two data sets, so that we have a single input for the Fuzzy Match tool. This brings us to our first difficulty, for a union to make sense, the data must have exactly the same columns. However, our internal data set has completely different columns from the F500 list.
We'll start with the simplest solution. By using the Select tool, we can get rid of all the columns which aren't shared, leaving only the company name in each data set. We'll configure the union tool to "Auto-Config by Position" so that we don't have to worry about column headers. Then we can take the output of the union as the input for the Fuzzy Match. Your data flow should look like this:
Configure the Fuzzy Match tool by selecting the match field - "Name" - and the match style - "Company Name". So far so good, however switching to "Merge Only" - the mode we need for a fuzzy join - you'll notice that it requires a "Source ID." This is simple enough to add using the Append Fields tool. My method of choice is to add two Text Inputs each with a single column and row "Source", "A" and "Source", "B" for the two tools respectively.
Then using our source data in the "T" input of the Append Fields tool, and the Text Inputs we just made in the appropriate "S" inputs, we've successfully added a Source ID column. We'll add it to the select tool so that it flows into the join (make sure the columns are in the same order in each select!) and select "Source" as the Source ID in our fuzzy match tool. The data flow should look like this:
We're now ready for a trial run of the workflow. Add a "Browse" to the output of the Fuzzy Match, and choose an appropriate match threshold, then run the workflow. Looking at the browse, you'll notice that we've only got a few columns - we have the two names that have been matched, and a match score if you enabled that, but we're missing all the other information – i.e. the F500 rank and the revenue information from the internal data set.
From Fuzzy Match to Output
Remember, we removed this data to pass the sources into the union - but now we need it back! Now we run into two difficulties. First is a quirk of the Alteryx Fuzzy Match tool. You might think that we could join to the first source on the Name1 column, and the second source on the Name2 column, but sometimes Alteryx switches the orders! Joining on names also brings up another issue - what if we have two accounts with identical names? This could happen if they were “Ministry of Education” accounts in different regions, for instance.
Luckily, we can solve both issues at once. It will take three new tools. First, we'll use the Record ID tool to give each row its own unique ID. Second we need to keep the records from each source distinct, so we'll need to append an "A" or "B" respectively to each record ID. Use the formula tool and this expression:
Make sure our Select passes the unique ID's through to the join, then finally configure the Fuzzy match tool to use these ID's as Record ID's. The workflow should look like this:
Now that we've got record ID's, joining the output of the match back to our source data is almost within our grasp. We still need to resolve the issue of swapped orders - sometimes the "A" ID will appear first, sometimes it will be the "B" ID. To solve this, we'll use another formula tool, configured to create two new columns "A-ID" and "B-ID". Each will use a simple formula - check the first character of the first ID. If it's an A, then it's the A-ID. If it's a B, then the second ID is the "A-ID". The formula for "A-ID" is:
IF LEFT([ID],1)="A" THEN [ID] ELSE [ID2] ENDIF
Where for the "B-ID" we simply replace the "A" with "B". Now comes the trickiest part of the process to understand in one go – joining back to the original source data.
We'll start by joining the final output of the Fuzzy Match - Formula flow to the original data from our first input - A. Note that we have to join to the data /after/ having added the record ID's and appended the letters - make sure you join on the ID's with letters, or you'll get an error. In the end, we simply join ID to A-ID, and we get all of the A-Data added to our records. Now if the fuzzy match output is the left input, and the source data is the right, let's walk through what each of the outputs of the join represents. First, the left output must be empty. Why is this?
We are joining on record ID. Every record coming out of the fuzzy match corresponds to exactly one record from source A, so the A-ID must correspond to the record ID of one of the original records. There can't be any unjoined records on the left!
The joined records are the simplest - these are the records from A which found a match according to our threshold. This is not our final output, but we will carry it forward. Finally, the right output is records which came from source "A" but which have no corresponding record in the output of the fuzzy match. That is to say, they are records for which no match was found in source B. Going back to the idea that this represents a "join" this is exactly the "left" output of a usual join!
Now as mentioned above, we carry the joined records forward. We have the "A" columns, but need to add the "B" columns. For this, we repeat the process - join the ID from the "B" source to "B-ID" and we'll walk through the outputs again. Again, the left output will be /empty/ for the same reasons mentioned above. The joined output is our final output! This is the "J" output of a usual join, containing the relevant columns from both sources.
Finally, the right output is all records from the "B" source that don't have a corresponding entry in the fuzzy match output - in the language of joins, the unjoined records on the right! Throughout our transformations, we've now added quite a few columns to our data, extra columns we don't care about in the end result. So, before outputting, we can use a select to get rid of the extra pieces
Your final workflow should look like this:
I've included the workflow below for reference. Look to this blog in the near future for a guide on how to take our workflow and convert it into a macro! If you made it all the way through, you have my congratulations and thanks, and I hope you got something useful from the explanation!
Posted by admin