Why Data Remediation Projects Fail
Whenever someone writes an analysis of why a product data project failed, the primary reason they give is that the tool failed to deliver on expectations. Having been in the product data space for 2 decades, I can tell you that this analysis is often flawed. I’m not going to defend every platform and every solution implementor by blaming the data, because some platforms do not understand data management well. However, blaming these failures solely on tools and platforms is missing a huge red flag in your project management, and more often than not a lack of data preparedness is key to the project failure description.
More often than not, product data projects fail because companies, solution implementors, and platform vendors don’t understand the basics of data remediation. It’s the data preparation that often fails, which leads to poorly designed data models, delays in project timelines due to initial load concerns, and a lack of understanding of the goals of data remediation within the project. In short, the platform or tool takes the blame for a lack of understanding of how to prepare the data for the tool.
This post takes a look at methodologies to counter product data issues during project design to avoid platform failures during implementation.
#1 Know Your Data
The first mistake we see companies making when starting a product data project is skipping the step of understanding what condition their data is in at the start of the project. The assumption is that the data is obviously bad, so the project will fix it. But how bad is bad, and what defines bad? Is your data incomplete, inaccurate, or both? If it’s incomplete, how do you define complete? If it’s inaccurate, how do you quantify the level of inaccuracy?
Skipping the step of understanding your data is like putting air in the tires of your car because the check engine light came on. Unless your plan is to re-collect every single data point within the project, there is a complete lack of strategy in how you are going to load data if you don’t comprehend the scale of the problems with that data. When you attempt an initial data load, it will take days or weeks of work to understand why it failed, delaying your project timeline.
Most of the time, companies see this as a cost issue. It does take effort to understand your data quality. However, failing to understand your data quality will often create costly problems later in the project. Essentially, not understanding your data quality up front will not avert any costs that do not generally come up later in the project. It should be part of your plan to understand what state your data is in before you start designing data models, process flows, and your data remediation plan.
Many years ago, I worked on a project with an extremely aggressive timeline and a very small budget. To save costs, I was told I didn’t have to worry about data quality, as all the data was in pre-defined templates stored as PDFs. As long as I could harvest the data from the PDFs, everything was “guaranteed” to be there. Nothing could be farther from the truth. Every PDF had its own template, and therefore had its own set of attributes, which means those PDFs couldn’t be read programmatically against a single data schema. As was predictable, there were also not PDFs for every product. The data was incomplete, not normalized, and in some cases, missing entirely.
Had we been allowed to assess the data up front, it would have been easier to see these issues and deal with them before the initial data load. We weren’t given that opportunity, and instead spent 2 months determining alternate data sources to fill in the missing data. We had no methodology to measure accuracy, as the budget only allowed us to attempt to complete the missing data, not validate it.
Therefore, the first thing you must do is understand the condition of your data today to have any chance of loading it in the future. TrailBreakers always starts every engagement by scorecarding the customer’s product data, regardless of any indiciations a customer may give about the quality of that data. Don’t wait for your project to start: Understand what your data looks like before you start the project, instead of finding out during the project how poor your data quality really is.
#2 Know Your Remediation Goals
Once you know the condition of your data, the next logical step is to determine what you want the data to look like in the future. Saying you want every attribute complete on every record is not only impossible at scale, but it’s also expensive and foolish. Not every attribute is going to apply to every product, and some products just aren’t worth remediating. As a consultant, I also recommend the following guidelines:
- The top products that you sell are either over-performing or performing to expectations. Remediating those items has little value, as every dollar invested will have a limited opportunity to increase the sales volume of that product. Unless an attribute that is mandatory is missing, you should load this data as-is to avoid causing changes that might change that performance.
- The bottom products that you sell may never sell at a rate to make remediating them cost-effective. They may be part of a long tail strategy that are only there to drive customers to the site to buy other products. They may be part of a quantity purchase that is never going to be repeated. Maybe they just don’t sell as well as your top products on purpose. Regardless, a percentage of your worst-performing products probably don’t need remediation.
- Don’t remediation items that are discontinued or are soon-to-be discontinued. The short lifespan of a product that already has a short-term discontinued date leaves little time for the investment in remediation to have a net positive effect. Unless the item will not load without some sort of intervention, leave these products alone.
- You don’t need to remediate every attribute. Some attributes are nice-to-have, but have very little impact on sales. Concentrate on remediating the attributes that have the highest visibility (facets, descriptions and features, images, etc.) over the attributes that have low visibility (some specifications, extra documents). If an attribute doesn’t have a direct impact on the buying decision, it doesn’t need to be remediated at the outset of the project.
Data remediation is all about sales lift, but some activities will not produce any additional sales. Understanding which products have the greatest potential for sales improvement is key to limiting your remediation spend while increasing the return-on-investment from that spend. Although the definition of what products and/or attributes require remediation may change company-to-company, the guidelines for how to determine what products to remediate stay the same.
#3 Understand What Complete Means
One of the biggest gaps I find with customers is that they have no idea what a complete product record looks like. They know they need data for their website, their print applications, their syndication channels, and/or their data warehouse, but have no view into what a summation of those downstream system requirements. If you can’t define a holistic view of every attribute that is required for every output, you can’t possibly understand what to remediate.
That isn’t to say that requirements don’t change over time. It is a known problem in the industry that, due to the Google Effect and the Amazon Effect (I’ll deal with the definitions of these in another post), data requirements for syndication channels change daily. However, for data remediation, a company must pick a point in time that they want to remediate to, and remediation to the requirements at that moment. Although the definition of complete may change over time, remediation is a discrete timeframe within that greater lifecycle.
This doesn’t even begin to quantify what accurate means in terms of data quality. Most companies struggle to define what complete means, so accuracy is considered a byproduct of completeness. There are plenty of ways to use AI Agents (Trailbreakers has AI Agents built specifically for this task) to handle improving accuracy after the data is remediated, so the consideration of accuracy in a data remediation plan may be to remediate accuracy after the data is complete enough to load. Handling both accuracy AND completeness in the same step is bound to fail, but also bound to make your remediation project many times larger and longer.
In Summary: Data Governance
I personally believe that completeness and accuracy work hand-in-hand. You can be 100% complete and still be 1% accurate, or 100% accurate but 1% complete. Both of those scenarios are data quality nightmares. However, aspiring to be 100% complete and 100% accurate for every product after a data remediation project is expensive, time-consuming, and kills project success metrics.
The overall solution to this set of problems is a well-defined data governance framework. Defining Complete, Accurate, Timely, Verifiable, and Contextual is vital to improving data quality, and that cannot be performed outside of data governance without data entropy quickly setting in and quietly eroding those data quality metrics. The overall cure for data remediation woes is to have a data governance framework set before the project to understand what your data quality goals are, what your current state is, and how you are going to reach your goals.