How to Address Data Quality Issues for AI Implementation

579,513 followers 2y

Over the last few years, you have seen me posting about Data Centric AI, why it is important, and how to implement it in your ML pipeline. I shared resources on a key step: building a Data Validation module, for which there are several libraries. Two drawbacks I observed in many libraries are: (i) the data validation/quality checks need to be manually developed, (ii) the quality checks do not support different data modalities. While investigating, I discovered a standard open-source library for Data-Centric AI called Cleanlab. Curious to learn more, I got on a call where their scientists, Jonas Mueller shared research on Confident Learning, an algorithm for *automated data validation* in a general-purpose way that works for all data modalities (including tabular, text, image, audio, etc). This blew my mind! The library has been updated with all sorts of automated data improvement capabilities, and I am excited to share what I tried it out for. Let me first explain Confident Learning (CL) - CL is a novel probabilistic approach that uses a ML model to estimate which data/labels are not trustworthy in noisy real-world datasets (see blogpost linked below for more theory). In essence, CL uses probabilistic predictions from any ML model you trained to perform the following steps: 📊 Estimate joint distribution of given, noisy labels and latent (unknown) true labels to fully characterize class-conditional label noise. ✂️ Find and prune noisy examples with label issues. 📉 Train a more reliable ML model on filtered dataset, re-weighting the data by the estimated latent prior. This data-centric approach helps you turn unreliable data into reliable models, regardless what type of ML model you are using. What can you do with Cleanlab: 📌 Detect common data issues (outliers, near duplicates, label errors, drift, etc) with a single line of code 📌 Train robust models by integrating Cleanlab in your MLOps/DataOps pipeline 📌 Infer consensus + annotator-quality for data labeled by multiple annotators 📌 Suggest which data to (re)label next via ActiveLab - a practical Active Learning algorithm to collect a dataset with the fewest total annotations needed to train an accurate model. To reduce data annotation costs, ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. Try improving your own dataset with this open-source library via the 5-minute tutorials linked on their github: https://lnkd.in/gWtgPUXw (⭐ it to support free open-source software!) More resources: 👩🏻💻 Cleanlab website: https://cleanlab.ai/ 👩🏻💻 Confident Learning blogpost: https://lnkd.in/gDKccShh 👩🏻💻 ActiveLab blogpost: https://lnkd.in/giXHaPBF PS: Did you know Google also uses Cleanlab to find and fix errors in their big speech dataset in a scalable manner. #ml #datascience #ai #data #datacentricai

19 Comments

Bill Shube

Gaining better supply chain visibility with low-code/no-code analytics and process automation. Note: views are my own and not necessarily shared with my employer.

2,592 followers 1y

Want a simple way to earn trust from your stakeholders, analysts? Send them data quality alerts when things go wrong. This is data 101 for engineers, but my team and I are citizen developers. We don't have the same kind of training - things like this simply aren't immediately obvious to us. Here's an example of why you should do this, from just this week: An analysis that we run depends on A LOT of inputs, including some manually uploaded files. Lots of opportunity for things to go wrong. On Monday, I heard from one of the file providers that her upload had been failing for almost 2 weeks. One of my end users spotted the problem at about the same time that I heard from my file provider. Not great being the last one to find out about a data quality problem in an analysis that you're responsible for. I had been working on some data quality alerts, and sure enough, they would have spotted the problem right away. So I'm eager to finalize them and get them into production. Here are some easy things I'm implementing: 1. Record count checks: do today's inputs have roughly the same number of records as yesterday's? This doesn't catch all problems, but it's very easy to implement - it's all I needed to spot the problem I just described. 2. Consistency check: Make sure your inputs "look" the way you expect them to. In this case, the reason the file upload was failing was that one of the columns in the file changed from being numerical to text, and our SQL database didn't like that. 3. Check for null values: You might get the right number of records and the right data types, but the data could all be null. 4. Automated alerts: You don't want to hear from your stakeholders about data quality issues the way that I did. Put in some basic alerts like these with automatic emails when they're triggered. Copy all your stakeholders. This will sound remedial to data engineers, but these are habits that we citizen developers don't always have. There's a lot that we citizen developers can learn from our friends in IT, and simple things like this can go away toward earning our stakeholders' trust. #citizendevelopment #lowcode #nocode #analytics #supplychainanalytics

6 Comments

Chad Sanderson

CEO @ Gable.ai (Shift Left Data Platform)

88,861 followers 2y

I spent years trying to build automatic mechanisms for detecting and resolving data quality issues before I realized getting people to talk to each other before a change was shipped solved 75+% of the problems. As interesting as Artificial Intelligence and ChatGPT might be, some of the biggest, hairiest challenges in all of data can probably be fixed by making sure humans are in the loop at the right time. Here's what I've found to be unbelievably valuable: - Business users have identified which data assets at the company are critical and their corresponding tier of importance - Data producers (software engineers) understand how their data is being used downstream, for what, and by who - Data producers have visibility into the impact changes they make will have on business-critical use cases downstream - Data producers can see who all is still using data from fields or tables they plan to deprecate - Data consumers are informed any time a change is proposed on a data asset that will break their important use cases (ideally in their Looker or Tableau dashboards) - Data consumers are informed any time there is new data available from an upstream producer If producers have a solid understanding of where their data is flowing, and consumers understand when changes are coming and why - these two teams will actively collaborate as needed without needing a data engineer constantly in the loop. Good luck! #dataengineering

35 Comments

How to Address Data Quality Issues for AI Implementation

More in data quality for ai

Explore categories