A breakthrough in video super-resolution. Upscale-A-Video is a Temporal-Consistent Diffusion Model that leverages text prompts to upscale low-resolution videos. Overcoming challenges of fidelity and temporal consistency, their model integrates temporal layers, a recurrent latent propagation module, and a fine-tuned VAE-Decoder for exceptional results. You can enjoy flexibility with adjustable noise levels and text-guided texture creation, striking the perfect balance between restoration and generation. Extensive experiments demonstrate superior performance in both synthetic and real-world benchmarks, showcasing impressive visual realism and temporal consistency. Check out more details about the project: https://lnkd.in/dmmnG46m Research paper: https://lnkd.in/dPJCf277 #AI #Video #Innovation
Advanced Computer Vision Techniques
Explore top LinkedIn content from expert professionals.
-
-
𝐀𝐫𝐞 𝐲𝐨𝐮𝐫 𝐜𝐨𝐦𝐩𝐮𝐭𝐞𝐫 𝐯𝐢𝐬𝐢𝐨𝐧 𝐦𝐨𝐝𝐞𝐥𝐬 𝐟𝐚𝐥𝐥𝐢𝐧𝐠 𝐬𝐡𝐨𝐫𝐭 𝐝𝐞𝐬𝐩𝐢𝐭𝐞 𝐡𝐢𝐠𝐡 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲? 𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫 𝐭𝐡𝐞 𝐡𝐢𝐝𝐝𝐞𝐧 𝐩𝐢𝐭𝐟𝐚𝐥𝐥𝐬 𝐚𝐧𝐝 𝐞𝐟𝐟𝐞𝐜𝐭𝐢𝐯𝐞 𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬 𝐭𝐨 𝐨𝐯𝐞𝐫𝐜𝐨𝐦𝐞 𝐭𝐡𝐞𝐦. 𝐋𝐞𝐚𝐫𝐧 𝐡𝐨𝐰 𝐭𝐨 𝐭𝐚𝐜𝐤𝐥𝐞 𝐢𝐦𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐝𝐚𝐭𝐚, 𝐦𝐢𝐬𝐥𝐞𝐚𝐝𝐢𝐧𝐠 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐦𝐞𝐭𝐫𝐢𝐜𝐬, 𝐚𝐧𝐝 𝐞𝐧𝐡𝐚𝐧𝐜𝐞 𝐦𝐨𝐝𝐞𝐥 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐰𝐢𝐭𝐡 𝐚𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬. 𝐈𝐦𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 → Underrepresented classes compared to others. → Leads to biased models favoring majority class. → Common in medical diagnosis, fraud detection, object recognition. → Requires resampling, data augmentation, class weight adjustment. → Metrics like Precision, Recall, F1-Score needed for evaluation. 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐃𝐨𝐞𝐬𝐧'𝐭 𝐀𝐥𝐰𝐚𝐲𝐬 𝐆𝐢𝐯𝐞 𝐭𝐡𝐞 𝐂𝐨𝐫𝐫𝐞𝐜𝐭 𝐈𝐧𝐬𝐢𝐠𝐡𝐭𝐬 𝐀𝐛𝐨𝐮𝐭 𝐘𝐨𝐮𝐫 𝐓𝐫𝐚𝐢𝐧𝐞𝐝 𝐌𝐨𝐝𝐞𝐥 → Misleading with imbalanced datasets. → High accuracy may hide poor minority class performance. → Use Precision, Recall, F1-Score instead. → Confusion matrices provide detailed performance breakdown. → Comprehensive evaluation ensures effectiveness across classes. 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐌𝐞𝐭𝐫𝐢𝐜𝐬 𝐀𝐬𝐬𝐨𝐜𝐢𝐚𝐭𝐞𝐝 𝐰𝐢𝐭𝐡 𝐋𝐚𝐛𝐞𝐥 1 → Precision: True positives out of all positive predictions. → Recall: True positives out of all actual positives. → F1-Score: Harmonic mean of Precision and Recall. → Specificity: True negatives out of all actual negatives. → Balanced Accuracy: Average Recall across all classes. 𝐑𝐞𝐜𝐞𝐢𝐯𝐞𝐫 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐂𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫𝐢𝐬𝐭𝐢𝐜 𝐄𝐱𝐩𝐥𝐚𝐢𝐧𝐞𝐝 → ROC Curve: True Positive Rate vs. False Positive Rate. → AUC-ROC: Area summarizing model's discriminative ability. → Threshold Selection: Impacts True Positive and False Positive Rates. → Interpreting the Curve: Closer to top-left, better model. → Comparing Models: AUC-ROC allows straightforward performance comparison. 𝐌𝐮𝐥𝐭𝐢-𝐜𝐥𝐚𝐬𝐬 𝐄𝐱𝐚𝐦𝐩𝐥𝐞 → One-vs-All Approach: Binary classification for each class. → Macro-Averaging: Average metrics treating all classes equally. → Micro-Averaging: Aggregate metrics, often favors majority classes. → Confusion Matrix: Visualize multi-class misclassifications. → Per-Class Metrics: Precision, Recall, F1-Score for each class. 𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧𝐬 → Data Augmentation: Increase minority class samples through transformations. → Resampling Techniques: Balance dataset by oversampling or under sampling. → Class Weights Adjustment: Higher importance to minority class. → Advanced Algorithms: Models for imbalanced data, like Balanced Random Forest. → Ensemble Methods: Combine multiple models to improve performance. ♻️ Repost it to your network and follow Timothy Goebel for more. #computervision #machinelearning #datascience #modelperformance #aitechniques
-
Lessons from a full day with SAM 2 on satellite imagery. First off, what is SAM 2? It’s a zero‑shot, promptable segmentation model, meaning it can segment unseen objects out-of-the-box, without any training on those classes using only simple prompts like clicks, boxes, or text descriptions (what I used) to guide the process. Why apply it to satellite imagery? SAM 2 excels at segmenting environmental features (ex. roads, buildings, orchards) without retraining. My top tips? 🛰️ Use high‑res imagery (30 cm–1 m/pixel) for crisp segmentation especially for small objects. 🍃 Adjust prompts for the overhead view (e.g., "green leaves" or "shrubs" instead of "trees" - I even used "grey boxes" to find air conditioning units on top of buildings) 🚗 Small objects are detectable with careful prompting, even counting cars works. At Wherobots we embed SAM 2 into our raster inference engine. Users write simple SQL/Python prompts with text, inference runs in parallel on tiles, and results are stored as Iceberg tables in S3. From there, you can use the vector objects that are returned just like regular geospatial data with no special modeling needed. SAM 2 brings zero‑shot segmentation to geospatial data and when you combine it with prompt tuning, high‑res imagery, and distributed inference, and you can pull out earth scale insights in a day. Would love to hear your experiences with vision models on remote sensing! 🌎 I'm Matt and I talk about modern GIS, geospatial data engineering, and AI and geospatial is changing. 📬 Want more like this? Join 7k+ others learning from my newsletter → forrest.nyc
-
It’s W10 / Q4 / 2023 and this week in our MGMT Boston startup series we’ve got SportsVisio, Inc., a seed stage startup building AI video software to help you see your (hoop) game like never before. With just a smartphone trained on a court, this app can help players track all their stats from points to rebounds to assists to steals and more. Do you believe that everything happens for a reason? Founded in 2021 by veteran New England entrepreneur Jason Syversen , SportsVisio is a follow up act. Jason founded Siege Technologies, a cybersecurity firm, bootstrapping the company and selling the business 9 years later to a private equity firm for an eight figure exit. Jason has what the people call “relevant startup experience”. During his whirlwind of a career, life changed a lot. Jason & his wife were raising 6 kids at home and donated the majority of their post exit assets to a foundation they created. Then, he ran for NH State Senate in 2020. He lost to the incumbent by just 1,179 votes or 0.3% of the vote. Jason was at a crossroads. He’d already accomplished so much in his career. Maybe it was time to become a full time investor & philanthropist through his foundation. Or, he could start another company to grow the impact he wanted to have on the world. Life (and sports) are games of inches and basketball players everywhere are the beneficiaries of that extremely slim margin of defeat. Jason grew up playing basketball and coached his kids throughout their childhoods (his oldest is now 22) and saw an opportunity. There isn’t much technology on the court. And basketball turns out to be a venture scale market. SportsVisio is building a technology and service to help disadvantaged schools & players get better insights out of their game. Today, there are a couple dozen high schools & colleges (Oxford University in fact) using the product with players as far away as Idaho & Spain. The team is in discussions with a large gym franchise and close to signing on an NBA player to help out as an advisor. This year the company raised an additional $3M seed round with Sapphire Ventures, hoping to get the onboarding process fully automated by the end of Q1 2024. Next, they’ll be gearing up for their Series A by securing their IP and growing the research team. They hope to launch a second sport in 2024 and scale to 1000s of users and dozens of customers. Operators to Know: -Scott Byers, Principal Software Architect -Charlotte Corbitt, Principal Business Analyst -Mark Scott Lichty, Director of Growth -Miguel Gutierrez, Head of Vendor Operations -Michael Mathieu, Senior Mobile Engineer -Dan Oblinger, CTO -Jack Ryan Potvin, Product Management Lead -James Sullivan III, Senior Business Analyst If you're interested in learning more, including the key roles SportsVisio is hiring for, the full post is linked in the comments! And sign up for the newsletter to see who we bring you next week, will you?? #startup #bostontech #sports
-
Deci's YOLO-NAS architecture provides today's state of the art in Machine Vision, specifically the key task of Object Detection. Harpreet Sahota joins us from Deci today to detail YOLO-NAS as well as where Computer Vision is going next. Harpreet: • Leads the deep learning developer community at Deci AI, an Israeli startup that has raised over $55m in venture capital and that recently open-sourced the YOLO-NAS deep learning model architecture. • Through prolific data science content creation, including The Artists of Data Science podcast and his LinkedIn live streams, Harpreet has amassed a social-media following in excess of 70,000 followers. • Previously worked as a lead data scientist and as a biostatistician. • Holds a master’s in mathematics and statistics from Illinois State University. Today’s episode will likely appeal most to technical practitioners like data scientists, but we did our best to break down technical concepts so that anyone who’d like to understand the latest in machine vision can follow along. In the episode, Harpreet details: • What exactly object detection is. • How object detection models are evaluated. • How machine vision models have evolved to excel at object detection, with an emphasis on the modern deep learning approaches. • How a “neural architecture search” algorithm enabled Deci to develop YOLO-NAS, an optimal object detection model architecture. • The technical approaches that will enable large architectures like YOLO-NAS to be compute-efficient enough to run on edge devices. • His “top-down” approach to learning deep learning, including his recommended learning path. Many thanks to Amazon Web Services (AWS), WithFeeling.AI and Modelbit for supporting this episode of SuperDataScience, enabling the show to be freely available on all major podcasting platforms and on YouTube (see comments for details). #superdatascience #deeplearning #machinevision #machinelearning #ai
-
How #AR and #VR headsets track hand movement? Following the release of #AppleVisionPro, let's talk how hand tracking work. Apple is known for making things simple. But is it best to control your headset just with your hands and voice? Two primary methods of tracking are - OUTSIDE-IN TRACKING (external sensors / cameras around a person) - INSIDE-OUT (sensors built in the headset or controllers) Outside-in solutions, like on #htcvive headset, give really high accuracy, whole body pose tracking but initial setup complexity prevent widespread application. Inside-out solutions rely on cameras built in the headset to track infrared markers in the hand controllers. Computer vision algorithms analyze markers orientation to find controller position and orientation. But cameras are SLOW. Even if you shoot at 60-120fps, stitch together image from several camera, and run position analysis code, you can only get updated position data a few times per second. That is not fast enough. Human eye would see a lag on fast movement. That's where #IMU (accelerometers and gyroscopes) come in play. Accelerometers built into the hand controllers, like on #oculusquest, capture high speed movement and provide accurate data its position hundreds time per second. IMUs are great - inexpensive and fast - but they give information about the movement, trajectory, and speed of your hands. They don't the controller where exactly in 3D space hands are. Thus, #quest2 and many other controllers use sensor fusion: cameras capture precise hand position few dozen times a second and accelerometer data augments that with more data points hundreds times a second. That's how many headsets like #Oculus #valve and others work. Apple #visionpro relies solely on cameras. Computer vision algorithms run continuously analyzing images from 3D cameras and #Lidar to detect hands and their orientation in 3D space. Amazing piece of technology running on Apple Designed chips! It works well for general applications (like clicking icons on the screen or moving pieces in a chess game). But it would fall short in some scenarios. - Situations when you need to track fast movements. For example, playing virtual sports like baseball, tennis, or other sports in virtual reality. - Working in low-light situations. - Detecting hands when other objects might obstruct the view ---- Interested to know how to improve hand tracking? We've been working on a few solutions that might be used in #virtualreality and #augmentedreality applications! Excited to see more applications for AR and VR
-
Your AI Will See You Now: Unveiling the Visual Capabilities of Large Language Models The frontier of AI is expanding with major advancements in vision capabilities across Large Language Models (LLMs) such as OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude. These developments are transforming how AI interacts with the world, combining the power of language with the nuance of vision. Key Highlights: • #ChatGPTVision: OpenAI’s GPT-4V introduces image processing, expanding AI’s utility from textual to visual understanding. • #GeminiAI: Google’s Gemini leverages multimodal integration, enhancing conversational abilities with visual data. • #ClaudeAI: Anthropic’s Claude incorporates advanced visual processing to deliver context-rich interactions. Why It Matters: Integrating visual capabilities allows #AI to perform more complex tasks, revolutionizing interactions across various sectors: • #Robots and Automation: Robots will utilize the vision part of multimodality to navigate and interact more effectively in environments from manufacturing floors to household settings. • #Security and Identification: At airports, AI-enhanced systems can scan your face as an ID, matching your image against government databases for enhanced security and streamlined processing. • #Healthcare Applications: In healthcare, visual AI can analyze medical imagery more accurately, aiding in early diagnosis and tailored treatment plans. These advancements signify a monumental leap towards more intuitive, secure, and efficient AI applications, making everyday tasks easier and safer. Engage with Us: As we continue to push AI boundaries, your insights and contributions are invaluable. Join us in shaping the future of multimodal AI. #AIRevolution #VisualAI #TechInnovation #FutureOfAI #DrGPT 🔗 Connect with me for more insights and updates on the latest trends in AI and healthcare. 🔄 Feel free to share this post and help spread the word about the transformative power of visual AI!
-
Anthropic claims that Claude 3, the company’s most recent AI release, has achieved “near-human” capabilities in various cognitive tasks. It’s a bold claim. Let’s put it in perspective. #Anthropic’s claims for #Claude3 center around its performance across a range of cognitive tasks, including reasoning, expert knowledge, mathematics, and language fluency. The company suggests that the Opus model (in particular) exhibits near-human levels of comprehension and fluency on complex tasks. This claim is supported by Claude 3 Opus outperforming #OpenAI’s GPT-4 (the underlying model that powers #ChatGPT) on 10 #AI benchmarks, including MMLU (undergraduate level knowledge), GSM8K (grade school math), HumanEval (coding), and HellaSwag (common knowledge). Despite these achievements, it’s important to note that achieving “near-human” capabilities on specific benchmarks does not equate to Claude 3 possessing general intelligence akin to human cognition. The AI research community often uses terms like “know” or “reason” to describe large language models’ capabilities, but use of these words does not imply that these models have consciousness or understanding in the human sense. This new iteration of the Claude AI model series includes three versions: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus, each offering different levels of complexity and performance. The most powerful among them, Claude 3 Opus, is available through a subscription service, while Sonnet powers the Claude.ai chatbot accessible for free with an email sign-in. Claude 3’s advancements are not limited to cognitive tasks. The models demonstrate improved performance in areas like coding, understanding non-English languages, and adhering to brand voice guidelines. They also feature advanced vision capabilities, enabling them to process a wide range of visual formats, including photos, charts, graphs, and technical diagrams. This makes Claude 3 models particularly useful for applications that involve PDFs, flowcharts, or presentation slides. Anthropic says that it trained Claude 3 on both nonpublic internal and public-facing data, utilizing hardware from Amazon Web Services (AWS) and Google Cloud. They also claim the model is more accurate and less likely to hallucinate. That said, you should keep Anthropic’s claims about Claude 3’s “near-human” capabilities in perspective. Outperforming its competitors on AI benchmarks does not equate to human-like consciousness or understanding. When artificial general intelligence (AGI) is achieved, you won’t need to read my daily newsletter to get the news.
-
🚨 New Preprint Alert! Are Vision Foundation Models Ready for Out-of-the-Box Medical Image Registration? 🧠🖼️ Foundation models like SAM and DINO-v2 are making waves across computer vision, with growing interest in their zero-shot registration performance. Recent efforts like Dino-Reg show how foundation model encoders can generate domain-invariant features for tasks like abdominal CT-MRI registration — but can they handle more complex, deformable anatomy like the breast? 🤔 In our latest study, we put five popular vision foundation models to the test on one of the hardest challenges in medical image registration: breast MRI. This setting is especially tough due to: 🔹 Large variations across patients, timepoints, and lesion status 🔹 Complex, deformable internal structures like fibroglandular tissue 🔹 Domain shifts across sequences and modalities To rigorously evaluate performance, we designed four diverse registration tasks covering key clinical and technical variations: 📌 Register longitudinal scans with the same MRI sequence 📌 Register scans taken across different dates and sequences 📌 Register lesion-present cases to baseline lesion-absent cases 📌 Register PET-CT to MRI 🧪 What did we find? ✔️ SAM performed better than traditional methods for global breast alignment— even under major domain shifts (e.g., CT → MRI) ❌ But none of the models (even those fine-tuned on medical data!) captured fine tissue deformation well ⚠️ Interestingly, in some cases, extra pretraining on MRI actually hurt performance These results suggest that while foundation models are promising for zero-shot registration, current versions still fall short when it comes to high-precision alignment of deformable anatomy. There’s still a lot to explore — especially around how domain-specific pretraining affects performance. We hope this study offers useful insights into how foundation models can be adapted or extended for deformable medical registration in future research. As new models keep coming out, we’re excited to keep testing and including them! 🧑💻 We’ve made the code publicly available, feel free to try it on your tasks! Always open to feedback, discussion, and collaboration 💬✨ 📄 Paper: https://lnkd.in/eSZarkwu 🔗 Code: https://lnkd.in/efHnupET Thanks to the co-authors for this work: Yaqian Chen, Nick Konz, Qihang Li and Maciej Mazurowski! #FoundationModels #MedicalImaging #BreastMRI #ImageRegistration #ZeroShot #SAM #DINOv2 #MedAI #ComputerVision #MedicalAI
-
With Tong Ding, Ming Yang (Max) Lu, Drew Williamson, Faisal Mahmood, and the rest of the Mahmood Lab at Harvard Medical School and Mass General Brigham, we are excited to present our latest work on UNI, a general-purpose self-supervised visual model for computational pathology pretrained using 100M+ images across 100K+ WSIs – the largest histology slide dataset used for visual self-supervised learning to date! 🤗 As a potential visual-centric foundation model for CPath, we evaluate UNI on 33 clinical tasks across anatomic pathology that range in diagnostic difficulty, including a challenging 108-class pan-cancer subtyping task based on the OncoTree cancer classification system. Overall, we are excited about the future of building foundation models for CPath, and can't wait to see how these models can be adapted to build AI-SaMDs for underrepresented and rare diseases, assist in biomarker discovery, and plugged into other research / clinical workflows! :^) Check out our latest preprint at: https://lnkd.in/e28FPxTg #machinelearning #computervision #ai #ml #deeplearning #pathology #healthcare #foundationmodels #cancerresearch #precisionmedicine #medicalimaging