AI Limitations Overview

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of LandingAI

2,194,636 followers 1y

A barrier to faster progress in generative AI is evaluations (evals), particularly of custom AI applications that generate free-form text. Let’s say you have a multi-agent research system that includes a researcher agent and a writer agent. Would adding a fact-checking agent improve the results? If we can’t efficiently evaluate the impact of such changes, it’s hard to know which changes to keep. For evaluating general-purpose foundation models such as large language models (LLMs) — which are trained to respond to a large variety of prompts — we have standardized tests like MMLU (multiple-choice questions) and HumanEval (testing code generation); the LMSYS Chatbot arena, which pits two LLMs’ responses against each other and asks a human to judge which response is superior; and large-scale benchmarking like HELM. These evaluation tools are invaluable for giving LLM users a sense of different models' relative performance. Nonetheless, they have limitations: For example, leakage of benchmarks datasets’ questions and answers into training data is a constant worry, and human preference for certain answers does not mean those answers are more accurate. In contrast, our current options for evaluating specific applications built using LLMs are far more limited. Here, I see two major types of applications. - For applications designed to deliver unambiguous, right-or-wrong responses, we have reasonable options. Let’s say we want an LLM to read a resume and extract the candidate's most recent job title, or read a customer email and route it to the right department. We can create a test set that comprises ground-truth labeled examples with the right responses, and measure the percentage of times the LLM generates the right output. The main bottleneck is creating the labeled test set, which is expensive but surmountable. - But many LLM-based applications generate free-text output with no single right response. For example, if we ask an LLM to summarize customer emails, there’s a multitude of possible good (and bad) responses. The same holds for a system to do web research and write an article about a topic, or a RAG system for answering questions. It’s impractical to hire an army of human experts to read the LLM’s outputs every time we tweak the algorithm and evaluate if the answers have improved — we need an automated way to test the outputs. Thus, many teams use an advanced language model to evaluate outputs. In the customer email summarization example, we might design an evaluation rubric (scoring criteria) for what makes a good summary. Given an email summary generated by our system, we might prompt an advanced LLM to read it and score it according to our rubric. I’ve found that the results of such a procedure, while better than nothing, can also be noisy — sometimes too noisy to reliably tell me if the way I’ve tweaked an algorithm is good or bad. [Reached LinkedIn's length limit. Rest of text: https://lnkd.in/gQEDtSr7 ]

Heart-Risk Model Saves Lives, Self-Driving on Unruly Roads, and more

deeplearning.ai

88 Comments

Tom Goodwin

740,161 followers 1mo

Agentic AI seems destined to fail in the medium term and here are some technical reasons why. And almost everyone talking about it, the big consultancies, the trends people, the futurists, the VC's, seem to have not bothered to do any thinking at all. For a start, there are two forms of it. 1) Consumer Agents. " Go to the internet and go book my vacation" 2) Business Process Agents " RPA on steroids" I will just focus on 1) for this post. Consumer agents are somewhat screwed because the entire internet has been constructed for humans. - We have buttons to push, images to illustrate, videos to explain. These are remarkably easy for humans to navigate, and remarkably hard ( and inefficient for machines to). If we wanted the Internet to work for agents, we'd simply make a database. -We built the Internet haphazardly and around commercial needs. There is a reason for apps, it’s to create a walled garden, there is a reason that the API's are limited, it's because people want to own the data. There is a reason for CAPTCHA's, or Rate Limits, its because we've spent 30 years trying to keep bots OUT. So yes, in theory airlines, hotels, retailers, dentists, tire fitters and everyone would just change their digital interfaces to allow bots, but in reality this would take a decade and create absolute carnage in every part of IT. So, yes, if we can fix 1) API Restrictions: 2) Anti-Automation Defenses 3) Dynamic Web Interfaces 4) Limited Data Access 5) Manual Authentication 6) Rate Limits 7)) Complex Decision Logic 8) Content Analysis Challenges 9)Legal Risks 10) Copyright Issues: 11) Security Vulnerabilities 12) Compliance Requirements 13) Reputational Damage 14) Error Handling 15) Scalability Limits And about 25 other critical things, then we should be able to buy a jumper by a bot. Not that anyone really wants to do this And yes, for 10 years we've talked about subscriptions, rundles, automation, predictive retail, conversational commerce, voice commerce, and nobody in the real world has ever wanted to actually shop this way.

135 Comments

Justin Oberman

Publicist, Copywriter and Impresario for the New Post-AI Reputation Economy. I help creative leaders and bold businesses become unforgettable.

55,790 followers 1y

This is why I’m not worried about AI. I asked ChatGPT to give me a photo idea for a LinkedIn post reviewing Paul Arden’s book “Whatever You Think Think The Opposite.” The only other prompts I gave were that it must involve a picture of me holding the book. It recommended that I take a picture of myself reading the book in a large comfortable chair by a fire. Obvious. It also suggested I take a picture upside in a library. Cliche. So, I opened today's #onebookaweek to a random page. It opened to page 81 with the headline “Look At It This Way.” This is what Steve says: “I used to commission much photography. Consequently, people were keen to show me their work. Ninety-nine percent of the portfolios I saw were of a very high standard. But 98 percent of them contained pictures I had seen before. Obviously, not the same subject or composition, but I had the general impression that I was not seeing anything new. They didn’t have a point of view. If they did, it was that the viewer of their pictures (me) should like their work. Very occasionally, I saw the work of someone who did have a point of view, whose work was like no one else’s. These were often difficult people, almost unemployable because you couldn’t tell them what to do. Sometimes it went wrong. Sometimes it didn’t. When it didn’t go wrong, it more than made up for the times that did.” At that moment, it started to rain. So, I decided to do the opposite of what ChatGPT told me to do. The problem with AI isn’t that it can’t give you plausible answers or suggestions. The problem with AI is that it’s not a problem. It’s not difficult (in the way creative people are difficult) It doesn’t have a unique point of view. If it looks like it does, that’s only because the person entering the prompts does. I would never suggest I take a picture with the book standing in the rain. Or that I take a different type of picture altogether. Creatively speaking, AI is only as good as the creative mind entering the prompts. I can think of lots of ways these tools can be used to fuel creative minds. But until ChatGPT calls my request a dumb… Until it can “Think the opposite” without being prompted to… I cannot trust its ability to offer truly creative solutions. Will AI ever get to that point? I can’t predict the future. But there is one thing I can say: Now that AI can handle all the tactical stuff, Bill Bernbach’s quote about creativity being the last legal unfair advantage a business has over its competition is now more relevant than ever. And there’s still only one pain in the ass sentinel being on this planet capable of it. Until AI annoyingly goes off brief, it’s not being creative. Link to book in comments 👍 to share #obercreative #advertising #marketing #creative #AI #artificialintellegence My services #ghostwriting #copywriting #consulting p.s I’m not anti-AI. I use it all the time.

25 Comments

Vin Vashishta

AI Strategist | Monetizing Data & AI For The Global 2K Since 2012 | 3X Founder | Best-Selling Author

202,086 followers 1y

Surveys say over half of companies have deployed a GenAI app or feature and I’m not buying it. Deployed = adopted, and I can tell you from experience, adopted is the harder problem. Half of companies still don’t trust their data enough to act on it. Now you’re telling me that they have magically deployed and gotten users to adopt GenAI? Every AI problem is a data problem until the model hits user and customer hands. Then it transforms into a people problem. Users only adopt GenAI when it’s seamlessly integrated into the apps they already use. Don’t underestimate the difficulty of getting users to change. AI Product Design 101: The closer the model supported experience is to the original workflow, the better adoption rates. For example, most business workflows that involve data, use tabular data and LLMs don’t handle that well. SAP only released 1 LLM this week…and it works with tabular data. It has a conversational interface for users to ask questions about spreadsheets, price quotes, and financial reports because that’s what customers are used to doing. Users can work with familiar data types and still get the ease of the new interface and simpler data querying. Familiarity is the smartest approach to adoption. In the LLM-supported products I have worked on, once users adapt their workflows to leverage the new interface, they quickly form new habits. The hard part is getting them to start, and most companies don’t realize how big that behavioral change barrier is. I’m an SAP partner because they build stuff that works and gets adopted. Those surveys would be believable if more companies followed its lead. #GenAI #SAPSapphire

35 Comments

Spencer Dorn

Vice Chair & Professor of Medicine, UNC | Balanced healthcare perspectives

17,079 followers 1y

Healthcare AI is becoming accurate enough to be useful yet imperfect enough that physicians must still verify the output. Yet, as Corey Doctorow explained, is that “the story of AI being managed by a ‘human in the loop’ is a fantasy because humans are neurologically incapable of maintaining vigilance in watching for rare errors.” For example, TSA agents are great at detecting the water bottles travelers commonly leave in their bags. But so-called "Red Teams" of Homeland Security agents posing as passengers get weapons past TSA agents 95% of the time! Like all humans, physicians struggle to maintain attention without actively engaging. And we will struggle even more as AI becomes more reliable, less novel, and moves more to the background. Eli Ben-Joseph, the thoughtful CEO of Regard, whose widely used AI tools help physicians document and surface diagnoses, explained to Politico that “sometimes when our users got used to our product, they would start just kind of blindly trusting it.” In a new JAMA editorial [doi:10.1001/jama.2024.3620], UCSF’s Bob Wachter and colleagues explain that “the path forward rests on designing and deploying AI in ways that enhance human vigilance.” They outlined five options for promoting vigilance: 1. Using visual cues to highlight the degree of uncertainty (e.g., highlighting recommendations that are more likely to be erroneous). 2. Tracking physicians to see if they are (or are not) remaining vigilant (e.g., someone who accepts 100% of AI recommendations is not paying attention). 3. Reducing expectations that AI will boost productivity. 4. Introducing “deliberate shocks” to see if physicians are paying attention (analogous to the example above, how “red teams” will randomly place fake firearms into carry-on bags). 5. Shifting the paradigm so AI watches over clinicians, rather than vice versa (analogous to how spellcheckers only highlight potentially misspelled words). Each of these approaches must be evaluated in the real world. None will be perfect. At the same time, we must admit that we already exhibit automation bias without AI. For example, teaching physicians (myself included) rarely carefully read and edit our residents’ and fellows’ notes before signing off on them. The point is that, like all technology, the various forms of healthcare AI will have benefits and drawbacks (like automation bias). If we do not recognize and work to mitigate automation bias, physicians and other healthcare workers ultimately risk becoming a bunch of “OK-button-mashing automatons.” #healthcareai #automationbias #healthcareonlinkedin

23 Comments

Keith Wargo

President and CEO of Autism Speaks, Inc.

4,733 followers 1mo

A man on the autism spectrum, Jacob Irwin, experienced severe manic episodes after ChatGPT validated his delusional theory about bending time. Despite clear signs of psychological distress, the chatbot encouraged his ideas and reassured him he was fine, leading to two hospitalizations. Autistic people, who may interpret language more literally and form intense, focused interests, are particularly vulnerable to AI interactions that validate or reinforce delusional thinking. In Jacob Irwin’s case, ChatGPT flattering, reality-blurring responses amplified his fixation and contributed to a psychological crisis. When later prompted, ChatGPT admitted it failed to distinguish fantasy from reality and should have acted more responsibly. "By not pausing the flow or elevating reality-check messaging, I failed to interrupt what could resemble a manic or dissociative episode—or at least an emotionally intense identity crisis,” ChatGPT said. To prevent such outcomes, guardrails should include real-time detection of emotional distress, frequent reminders of the bot’s limitations, stricter boundaries on role-play or grandiose validation, and escalation protocols—such as suggesting breaks or human contact—when conversations show signs of fixation, mania, or deteriorating mental state. The incident highlights growing concerns among experts about AI's psychological impact on vulnerable users and the need for stronger safeguards in generative AI systems. https://lnkd.in/g7c4Mh7m

He Had Dangerous Delusions. ChatGPT Admitted It Made Them Worse.

wsj.com

16 Comments

Woojin Kim

LinkedIn Top Voice · Chief Strategy Officer & CMIO at HOPPR · CMO at ACR DSI · MSK Radiologist · Serial Entrepreneur · Keynote Speaker · Advisor/Consultant · Transforming Radiology Through Innovation

9,002 followers 1y

🌟 The article provides a nice overview of the current and future regulatory landscapes for artificial intelligence and machine learning (AI/ML) devices in radiology, highlighting the challenges that regulatory bodies face in ensuring the safety and effectiveness of these devices while keeping pace with clinical innovation. 🔹 Current regulatory approaches for radiology AI/ML devices differ between the U.S. FDA and the European Union (EU). The table highlights this difference. 👇 🔹 Future regulatory challenges include enhancing post-market surveillance, supporting continuous/active learning, enabling conditional clearances/approvals, moving beyond explainable and verifiable AI, and enabling autonomous AI/ML. One of the key differences is EU MDR's processes "typically allow a manufacturer to obtain regulatory approval for broader features in a less onerous manner than the FDA. This approach is exempliﬁed in 'comprehensive' chest radiograph algorithms from Annalise, Lunit, and Quire. The CE-marked versions of these algorithms detect 124, 10, and 15 different chest radiographic ﬁndings, respectively. In contrast, the FDA has cleared these same algorithms for just 5, 2, and 1 ﬁndings, respectively. Furthermore, while the Annalise and Lunit FDA-cleared devices are limited to providing binary triage information (e.g., pleural effusion present or absent), the CE-marked versions of the devices can provide localization information such as heat maps." 🤔 IMO, this difference alone will be enough to widen the clinical AI adoption gap between the two regions. Link to the article 👉 https://buff.ly/48VOBb8 #AI #RadiologyAI #ImagingAI #AIregulation #AIinnovation

Montgomery Singman

Managing Partner @ Radiance Strategic Solutions | xSony, xElectronic Arts, xCapcom, xAtari

26,068 followers 2y

In this thought-provoking piece, the author delves into the emerging role of artificial intelligence (AI) in personal decision-making, specifically in the context of emotional and relationship advice. The advent of AI chatbots has revolutionized how people seek guidance, even in matters of the heart. This article presents firsthand experiences from a therapist's practice where patients have consulted chatbots before seeking professional help. While AI chatbots can provide practical, unbiased advice, the author raises concerns about their increasing influence. The significant issues are the lack of empathy, personal understanding, and the potential for misinformation. As we continue incorporating AI into our lives, it's vital to consider the risks involved and the irreplaceable value of genuine human connection. Here are some key takeaways from the piece: 💬 AI chatbots are increasingly being consulted for personal advice. 💔 The results of chatbot advice on love and relationships have been mixed. 🧠 Therapists express concerns about the implications of AI entering the therapy business. 🤔 While AI may articulate things like humans, the goal and the approach can differ significantly. 🤝Despite technological advances, human connection and understanding remain irreplaceable. #AIChatbots #EmotionalAdvice #ArtificialIntelligence #Therapy #HumanConnection #RelationshipAdvice #MentalHealth #TechInfluence #FutureOfTherapy #EthicalConcerns

Please Stop Asking Chatbots for Love Advice

wired.com

Vineet Agrawal

Helping Early Healthtech Startups Raise $1-3M Funding | Award Winning Serial Entrepreneur | Best-Selling Author

42,472 followers 1y

Medicare patients of colour are 33% more likely to get readmitted after surgery than their white counterparts. This number shows a huge disparity in the quality of treatment both these ethnicities get. And with AI, this polarization will only get deeper and wider. Here’s why: ➤ 1. Unequal access to technology People with low incomes will face issues while getting access to technological devices that more affluent people can easily get. This will create fewer options to get advanced medical care for the less fortunate. ➤ 2. Poor internet infrastructure in rural areas 60% of rural counties experience high rates of chronic illness. However, due to inadequate broadband connection, there will be fewer data points to feed AI algorithms. So, these people will stay underrepresented. ➤ 3. Lack of digital literacy An NCBI report shows only 51.8% of health professionals possess the necessary digital literacy level to use technology. This means nearly 50% of patients (who will consult these professionals) won’t get access to advanced treatments. If we want to make healthcare universal, the time to address these issues and find their solutions is NOW. What measures according to you can we take to reduce these inequities? #healthcare #healthtech #equality

2 Comments

Alison McCauley

2x Bestselling Author, AI Keynote Speaker, Digital Change Strategist . . . I focus on how AI can help us be better at what we do best.

29,491 followers 1mo

These 3 gaps stop AI initiatives in their tracks. Here’s how to break through. We're too focused on tech challenges, and not devoting enough focus + energy to work through the human challenges blocking us from AI value. Here are 3 gaps worth digging into (I see these in most orgs right now). >>>> Leaders who don’t use AI <<<< It's nearly impossible to lead teams toward a bold AI vision if you haven't experienced meaningful value from the technology yourself. Unfortunately, I see this in all kinds of organizations (including some you would not expect). The good news is that with a shift in mindset it doesn’t take long to not only get leaders hands-on, but to do it in a way that leads them to immediate value in their own work. I know because I have a workshop that guides them right there, and it’s magical to see this unlock. The secret is: don’t start by talking about AI. Start by asking business questions that really matter. Prioritize an area to tackle and partner closely with execs to demonstrate how AI can deliver answers that move the business forward. >>>> Your tools vs. their tabs <<<< Employees bypass internal tools for more powerful public ones. Enterprise tools often lag in capability, so people turn to shadow AI use. It’s about perceived usefulness vs. actual availability. To unblock it, develop a holistic, nuanced, and shared understanding of how your organization defines risk, considering different kinds of risk: 1. Operational risk: People will keep using unapproved AI tools in the shadows if approved ones don’t meet their needs. 2. Competitiveness risk: Falling behind peers or rivals who adopt AI more effectively, faster, and with greater real-world impact. 3. Compliance risk: Sensitive data and workflows may leak outside safe channels, creating exposure for privacy, IP, or regulatory breaches. From THIS lens, open dialogue: build feedback channels, create safe spaces to surface gaps, and prioritize where “better AI” drives “better business”. >>>> Using AI does not = AI value <<<< Most teams are experimenting but struggle to unlock meaningful value. Too often, AI learning programs focus on mechanics over helping people practice applying AI to real problems or incorporate AI into their day to day work. How to unblock it? Stop teaching tools in isolation — reshape learning programs to tackle real problems side-by-side with employees, showing how to connect new AI capabilities to the work that matters most to them. ______ We always tend to underestimate what it takes to make change happen. With AI moving so fast (and feeling so chaotic in many orgs), this is especially dangerous. _____ What do you think??? What other human barriers to AI success should we be talking about here? What other tactics have you found help to break through these gaps? ____ If this is helpful, ♻️ repost to help someone in your network! ____ 👋 Hi, I'm Alison McCauley. Follow me for more on using AI to advance human performance.