In this article I will provide a simple walk through of how to compare two different data sources that (should) contain the same data.
Skill Level: Beginner
Target Audience: Business Analysts; Data Professionals; Auditors
Background and Context:
A quick note about myself, I spent the first five years of my career in a Finance organization of a large corporation. My first two years I worked as an Accountant and the next three years as a Business Intelligence Consultant. I was more interested in the data, processes and overall quality that lead me to switch roles. The purpose of our BI/Reporting team was to provide the details of the ledger balances for internal and external reporting purposes (details from the sub-ledgers).
Every day during our nightly load, the last step in our process was to summarize our sub-ledger totals for the period and compare them with the ledger balance. For any differences that existed we created a record in our data warehouse for the amount of the difference, with all of the shared attributes between the sub-ledgers and ledger, along with a special identifier that distinctly noted the record as a difference. We called this difference a “wedge”. If there ever was a wedge, it was our responsibility to investigate the source, determine the root cause and work with the data owners to resolve the issue.
For the non-Accounting majors, a financial ledger is critical, but does not contain any details. It is a high level summarization that is used for reporting. Considered the only source of truth for financial data as it is reported to the SEC, Investors, External Auditors and other agencies. So if it doesn’t contain the details, where does the ledger get its data? There are multiple sub-ledgers (detailed sources of financial data) that feed into the ledger. Although not all sub-leger entries go to the ledger as some are used for internal reporting, adjustments and international reporting (which use different standards and calculations).
So why is this important? It is the operational processes that move this data from each of the sub-ledgers to the ledger. I believe we all know that processes can fail and human intervention can have unexpected consequences when our actions affect data or processes that we were not aware of. For data professionals to prove their value to their clients (and auditors) it is important to us to fulfill the requirements of existence and completion. Here is a link to a short document assembled by the AICPA explaining the importance and objectives of internal control which include those requirements.
Comparison Methodology Options
- An automated comparison can be performed using code, such as SQL, to compare two data sources by joining the data together and looking for differences
- Requires systematic access to both systems for the same user
- Requires SQL coding expertise
- Manual comparison can be performed by downloading the data from two disparate systems
- Data can be downloaded manually using different access credentials
- SQL is not required to compare the data in Excel
Walk-though: How to compare values from two different data sources in Excel
The following instructions describe how to compare a value (aka. measure) across two systems using an Excel pivot table.
In our example below “NumUnits” is the measure we will compare. The attributes that we will compare are “Attribute1” and “Date”.
1. Copy the data from your first source into Excel
a. Extract only the common attributes of the two datasets. (Keep your focus narrow. Don’t try to boil the ocean in your first analysis.)
b. Add two empty columns:
2. Enter a name in the Source column to track the source of these rows.
3. Append data from your second source data, without headers, in excel just below the data from your first source
• Reuse the same columns
• Provide a meaningful name in the Source column that can identify where this data came from (i.e. “Database 2”)
4. Fill the SourceValue column by entering a formula that will enable your pivot table comparison
This conditional IF statement will:
• Check the value of the Source column and compares it against a constant
• If they are the same return the value from your measure
• If they are different return the value from your measure time negative one
The data will look like this after entering the formula for every cell in the column.
5. From the Insert tab, create a PivotTable
6. Setup the PivotTable as follows:
Your PivotTable will now look like this:
7. Lastly, filter the pivot table to find differences:
a. On the Row Labels, click the down arrow and choose Value Filters, then choose Does Not Equal
b. Type 0 and click Ok
• This filter will operate on the Grand Total of each attribute in your Rows
• If the value is equal to zero, then the data is consistent across the datasets
That’s it! Your remaining rows are now just the instances where the two datasets are different.
Using this pivot you can now run additional queries to investigate differences.
Posted by admin