Over the last few years, you have seen me posting about Data Centric AI, why it is important, and how to implement it in your ML pipeline. I shared resources on a key step: building a Data Validation module, for which there are several libraries. Two drawbacks I observed in many libraries are: (i) the data validation/quality checks need to be manually developed, (ii) the quality checks do not support different data modalities. While investigating, I discovered a standard open-source library for Data-Centric AI called Cleanlab. Curious to learn more, I got on a call where their scientists, Jonas Mueller shared research on Confident Learning, an algorithm for *automated data validation* in a general-purpose way that works for all data modalities (including tabular, text, image, audio, etc). This blew my mind! The library has been updated with all sorts of automated data improvement capabilities, and I am excited to share what I tried it out for. Let me first explain Confident Learning (CL) - CL is a novel probabilistic approach that uses a ML model to estimate which data/labels are not trustworthy in noisy real-world datasets (see blogpost linked below for more theory). In essence, CL uses probabilistic predictions from any ML model you trained to perform the following steps: 📊 Estimate joint distribution of given, noisy labels and latent (unknown) true labels to fully characterize class-conditional label noise. ✂️ Find and prune noisy examples with label issues. 📉 Train a more reliable ML model on filtered dataset, re-weighting the data by the estimated latent prior. This data-centric approach helps you turn unreliable data into reliable models, regardless what type of ML model you are using. What can you do with Cleanlab: 📌 Detect common data issues (outliers, near duplicates, label errors, drift, etc) with a single line of code 📌 Train robust models by integrating Cleanlab in your MLOps/DataOps pipeline 📌 Infer consensus + annotator-quality for data labeled by multiple annotators 📌 Suggest which data to (re)label next via ActiveLab - a practical Active Learning algorithm to collect a dataset with the fewest total annotations needed to train an accurate model. To reduce data annotation costs, ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. Try improving your own dataset with this open-source library via the 5-minute tutorials linked on their github: https://lnkd.in/gWtgPUXw (⭐ it to support free open-source software!) More resources: 👩🏻💻 Cleanlab website: https://cleanlab.ai/ 👩🏻💻 Confident Learning blogpost: https://lnkd.in/gDKccShh 👩🏻💻 ActiveLab blogpost: https://lnkd.in/giXHaPBF PS: Did you know Google also uses Cleanlab to find and fix errors in their big speech dataset in a scalable manner. #ml #datascience #ai #data #datacentricai
Data Quality for AI
Explore top LinkedIn content from expert professionals.
-
-
Here are a few simple truths about Data Quality: 1. Data without quality isn't trustworthy 2. Data that isn't trustworthy, isn't useful 3. Data that isn't useful, is low ROI Investing in AI while the underlying data is low ROI will never yield high-value outcomes. Businesses must put an equal amount of time and effort into the quality of data as the development of the models themselves. Many people see data debt as another form of technical debt - it's worth it to move fast and break things after all. This couldn't be more wrong. Data debt is orders of magnitude WORSE than tech debt. Tech debt results in scalability issues, though the core function of the application is preserved. Data debt results in trust issues, when the underlying data no longer means what its users believe it means. Tech debt is a wall, but data debt is an infection. Once distrust drips in your data lake, everything it touches will be poisoned. The poison will work slowly at first and data teams might be able to manually keep up with hotfixes and filters layered on top of hastily written SQL. But over time, the spread of the poison will be so great and deep that it will be nearly impossible to trust any dataset at all. A single low-quality data set is enough to corrupt thousands of data models and tables downstream. The impact is exponential. My advice? Don't treat Data Quality as a nice to have, or something that you can afford to 'get around to' later. By the time you start thinking about governance, ownership, and scale it will already be too late and there won't be much you can do besides burning the system down and starting over. What seems manageable now becomes a disaster later on. The earliest you can get a handle on data quality, you should. If you even have a guess that the business may want to use the data for AI (or some other operational purpose) then you should begin thinking about the following: 1. What will the data be used for? 2. What are all the sources for the dataset? 3. Which sources can we control versus which can we not? 4. What are the expectations of the data? 5. How sure are we that those expectations will remain the same? 6. Who should be the owner of the data? 7. What does the data mean semantically? 8. If something about the data changes, how is that handled? 9. How do we preserve the history of changes to the data? 10. How do we revert to a previous version of the data/metadata? If you can affirmatively answer all 10 of those questions, you have a solid foundation of data quality for any dataset and a playbook for managing scale as the use case or intermediary data changes over time. Good luck! #dataengineering
-
The saying "more data beats clever algorithms" is not always so. In new research from Amazon, we show that using AI can turn this apparent truism on its head. Anomaly detection and localization is a crucial technology in identifying and pinpointing irregularities within datasets or images, serving as a cornerstone for ensuring quality and safety in various sectors, including manufacturing and healthcare. Finding them quickly, reliably, at scale matters, so automation is key. The challenge is that anomalies - by definition! - are usually rare and hard to detect - making it hard to gather enough data to train a model to find them automatically. Using AI, Amazon has developed a new method to significantly enhance anomaly detection and localization in images, which not only addresses the challenges of data scarcity and diversity but also sets a new benchmark in utilizing generative AI for augmenting datasets. Here's how it works... 1️⃣ Data Collection: The process starts by gathering existing images of products to serve as a base for learning. 2️⃣ Image Generation: Using diffusion models, the AI creates new images that include potential defects or variations not present in the original dataset. 3️⃣ Training: The AI is trained on both the original and generated images, learning to identify what constitutes a "normal" versus an anomalous one. 4️⃣ Anomaly Detection: Once trained, the AI can analyze new images, detecting and localizing anomalies with enhanced accuracy, thanks to the diverse examples it learned from. The results are encouraging, and show that 'big' quantities of data can be less important than high quality, diverse data when building autonomous systems. Nice work from the Amazon science team. The full paper is linked below. #genai #ai #amazon
-
How do we ensure that the future of AI is safe for everyone? Listen to women. Specifically, the brilliant women of color researchers who, like Timnit Gebru, Dr. Rumman Chowdhury, Safiya Noble, Ph.D., Seeta Pena Gangadharan, and Dr. Joy Buolamwini, have been sounding the alarm about the societal discrimination and biases that AI can magnify. An analysis of data sources that feed GPT-2 revealed that less than 15% of Wikipedia contributors were women or girls, only 34% of Twitter users were women, and 67% of Redditors were men. These sources are where large language models (LLMs) get their training data (aka the data you use to train a machine learning algorithm or model). Even more disheartening, Gebru’s research proves that white supremacist and misogynistic views are prevalent in the training data. Buolamwini’s project also revealed that darker-skinned women were 34.7% more likely to be misclassified than white men at 0.8%. This resulted from the datasets being simply not diverse enough, as the systems were not given enough Black and brown faces to learn how to understand what they look like. We must be aware of the consequences of bias in the automated systems used by 99% of Fortune 500 companies for hiring practices. AI-powered discrimination is a pressing issue affecting real lives. As artificial intelligence continues gaining traction, it’s time for us to take responsibility for our decisions about how these technologies are trained and where the data is coming from. By including different perspectives, we can uncover blind spots, mitigate biases, and ensure that AI benefits everyone.
-
“Garbage in, garbage out” is the reason that a lot of AI-generated text reads like boring, SEO-spam marketing copy. 😴😴😴 If you’re training your organization's self-hosted AI model, it’s probably because you want better, more reliable output for specific tasks. (Or it’s because you want more confidentiality than the general use models offer. 🥸 But you’ll take advantage of the additional training capabilities, right?) So don’t let your in-house model fall into the same problem! Cull the garbage data, only feed it the good stuff. Consider these three practices to ensure only high-quality data ends up in your organization’s LLM. 1️⃣ Establish Data Quality Standards: Define what “good” data looks like. Clear standards are a good defense against junk info. 2️⃣ Review Data Thoroughly: Your standard is meaningless if nobody uses it. Check that data meets your standards before using it for training. 3️⃣ Set a Cut-off Date: Your sales contracts from 3 years ago might not look anything like the ones you use today. If you’re training an LLM to generate proposals, don’t give them examples that don’t match your current practices! With better data, your LLM will provide more reliable results with less revision needed. #AI #machinelearning #fciso
-
If you’re drowning in bad data, I’ve got some bad news… data monitoring isn’t your life boat. At some point in your data journey, data quality will become a bottleneck to delivering new value for your stakeholders. Whether we’re talking about dashboards or AI models, you can’t deliver useful data products until you can deliver usable and trustworthy data to power them. And real data quality is more than data monitoring. A lot more. To really tackle data quality in a meaningful way, teams also need: - Robust testing and CI/CD - Change management (data contracts, SLAs, SLIs, SLOs, etc.) - Buy-in from leaders and the broader team - End-to-end coverage across data feeding your most important products - Comprehensive root cause analysis workflows - Some type of knowledge graph or lineage to map dependencies - Investment in data platforms as a first-class citizen Anything I missed? Check out the full article via link in the comments! #data #dataquality #dataengineering #datamonitoring
-
Sanjeev Mohan dives into why the success of AI in enterprise applications hinges on the quality of data and the robustness of data modeling. Accuracy Matters: Accurate, clean data ensures AI algorithms make correct predictions and decisions. Consistency is Key: Consistent data formats allow for smoother integration and processing, enhancing AI efficiency. Timeliness: Current, up-to-date data keeps AI-driven insights relevant, supporting timely business decisions. Just as a building needs a blueprint, AI systems require robust data models to guide their learning and output. Data modeling is crucial because: Structures Data for Understanding: It organizes data in a way that machines can interpret and learn from efficiently. Tailors AI to Business Needs: Customized data models align AI outputs with specific enterprise objectives. Enables Scalability: Well-designed models adapt to increasing data volumes and evolving business requirements. As businesses continue to invest in AI, integrating high standards for data quality and strategic data modeling is non-negotiable.
-
Healthcare AI faces an existential crisis. With no accountability mechanism for population-scale clinical data, models trained at one location will not generalize to another. This means lower efficacy (value capture), more reinventing the wheel (increased costs), and worse patient care (the whole point). Why is data quality so hard in healthcare? While it is always difficult to influence upstream data producers + suppliers, now imagine that those producers are able to dodge the data quality issue by appealing to the private, proprietary, and complex nature of the data (that "you'd probably need an MD/PhD to understand"). Thank you to AE Lewis, Nicole Weiskopf, Zachary B Abrams, Randi Foraker, Albert Lai, Philip Payne, and Aditi Gupta for your thorough review highlighting this problem space. In their words: Conclusion: Guidelines are needed for EHR data quality assessment to improve the efficiency, transparency, comparability, and interoperability of data quality assessment. These guidelines must be both scalable and flexible. Automation could be helpful in generalizing this process. Electronic health record data quality assessment and tools: a systematic review: https://lnkd.in/gh9iyFh9 In my words: There are many tailwinds at our back, from HL7 FHIR to 21st Century Cures Act, but we need to plug this data quality gap. The time to act is now. Who's with me?
-
67% of senior leaders are prioritizing generative AI (GenAI) for their business within the next 18 months — and it’s introducing huge potential risks to their organizations. Since ChatGPT launched in November 2022, execs have become increasingly fixated on GenAI. Whether they’re driven by competitive pressures, a desire to boost efficiency, or plain old hype, the race is on to implement GenAI for internal and external use cases. And instead of aiming for a strategic journey towards trustworthy AI, the goal is often to just get it up and running as fast as possible. So they sideline the most important part of any AI-powered system: data quality and the data team that manages it. This leads to a vicious cycle. Bad data, with enough nods of approval, becomes “good enough” data. And when this “good enough”-but-not-actually-good data goes into the AI models at the data team’s rebuke, garbage comes out. Trust is lost. We've seen this mess unfold over and over, especially through last decade’s data science wave. Yet somehow, we still haven’t put the spotlight on our data quality. But now, with execs full-speed-ahead on AI, it’s up to data teams to throw up the “yield” sign and make some changes, starting with: • Implementing robust data validation processes to ensure accuracy and reliability from the get-go. • Fostering a culture of data literacy, where questioning and verifying data sources becomes second nature. • Establishing clear guidelines for data usage and model training to prevent the normalization of low-quality data inputs. We need to fix our data — and now’s a better time than ever. Because if we can't trust our data, how are we supposed to trust AI? #dataengineering #dataquality #genai #ai
-
In the age of AI, maintaining data sovereignty is paramount for CIOs and CTOs navigating the complex interplay between innovation and strict data protection regulations. Our role is critical in aligning robust protection measures with rapid AI advancements, ensuring forward-thinking, globally compliant data governance. This requires us to focus on several key considerations. Strengthening AI Infrastructure for Future Challenges: 🔄 Develop Dynamic Governance: Evolve data governance frameworks to keep pace with technological advancements. 🔧 Continuous Capability Enhancement: Regularly update and improve AI capabilities to ensure data protection and compliance with global standards. 🌐 Proactive Approach: Leverage AI innovations proactively to maintain the integrity and security of data ecosystems. Securing Data in the Generative AI Landscape: 🔒 Implement Stringent Security Protocols: Establish and maintain high-security measures tailored for Generative AI interactions. 🛡️ Secure Access and Environments: Control access to generative models and secure training environments to prevent unauthorized use. 📜 Develop Robust Policies and Agreements: Craft comprehensive use policies and solid supplier agreements to protect proprietary data and ensure on-going compliance. Tell me, where do you think organizations most urgently need support in maintaining data sovereignty (amidst rapid AI advancements)? Share your thoughts and insights below. #DataSovereignty #AI #Speed #Safety #DigitalTransformation