Object Detection and Segmentation in Computer Vision

Director / Head of Digital Solutions at Zeiss

8,024 followers 1y

Excited to share my in-depth #YouTube tutorial on object-level segmentation using #Detectron2 and #YOLO v8! We explore a public dataset of Nuclei from human and mouse organs, covering every step of the project: 1. Data download from Kaggle 2. Data cleanup 3. Conversion of masks to COCO JSON and YOLOv8 annotations 4. Visualization of annotations 5. Training Detectron2 (Mask R-CNN) for object detection 6. Training YOLOv8 for object detection 7. Image segmentation, object parameter calculation, and result plotting. Each task comes with downloadable code. Check out the tutorial https://lnkd.in/gEqxQCtp #bioimageanalysis #microscopy #digitalpathology #segmentation #deeplearning #computervision

336 - Nuclei segmentation and analysis using Detectron2 & YOLOv8

https://www.youtube.com/

2 Comments

Santiago Valdarrama

Computer scientist and writer. I teach hard-core Machine Learning at ml.school.

118,421 followers 2y

Here is the best example you'll see all week: 3 individual models that work together to solve a problem that was near impossible months ago. The example uses the following: • A Zero-Shot Object Detection model • A Segmentation model • A Diffusion model Let me show you: We want to take an image and ask our application to replace an object in that image with something else. For example: • We show it an image of a fox. • And we tell it to replace the fox with a dog. The output should be the same image with a dog instead of a fox. To solve this problem, we can create a pipeline using the following models: • Grounding DINO for object detection • Segment Anything for segmentation • Stable Diffusion to do the painting (The full tutorial and source code is at the end of this post.) Grounding DINO is a zero-shot object detector. It's pretty nuts! You give the model a prompt and an image, and it will return bounding boxes around that object on the image. We call it "zero-shot" because it can identify objects even if they are not part of the training data. The second model in the pipeline will be @MetaAI's Segment Anything Model (SAM). It's also zero-shot and capable of segmenting objects on an image even when those objects weren't part of the training data. SAM was released earlier this year. Here is how the pipeline will work: 1. The user will write the name of an object 2. Grounding DINO will find the object and return a bounding box around it. 3. We'll input the bounding box to SAM and use it to create a mask around the object. Only one more step after this: We'll use Stable Diffusion (SD) to generate a new image within the mask coming from SAM. This is called "inpainting." But we can also do "outpainting" and generate a new background around the mask. This completes the architecture of the pipeline: Grounding DINO + SAM + SD. For the source code and a full explanation, take a look at this tutorial, courtesy of @Cometml and @anmorgan2414: https://lnkd.in/e3rdchzQ You will learn: 1. How to use 3 state-of-the-art models 2. How to combine every piece together 3. How to track everything using @Cometml

5 Comments

Ahsen Khaliq

ML @ Hugging Face

35,609 followers 2y

Segment Anything Meets Point Tracking paper page: https://lnkd.in/gZyGVxXe The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, employing interactive prompts such as points to generate masks. This paper presents SAM-PT, a method extending SAM's capability to tracking and segmenting anything in dynamic videos. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation, demonstrating that a SAM-based segmentation tracker can yield strong zero-shot performance across popular video object segmentation benchmarks, including DAVIS, YouTube-VOS, and MOSE. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information that is agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. To further enhance our approach, we utilize K-Medoids clustering for point initialization and track both positive and negative points to clearly distinguish the target object. We also employ multiple mask decoding passes for mask refinement and devise a point re-initialization strategy to improve tracking accuracy.

Sai Shreyas Bhavanasi

ML Engineer | 3x 1st author ML publications

3,884 followers 2y

If data annotation is holding you back, you should check out GroudingDINO GroundingDINO is a zero-shot object detection model that allows you to search for an object in an image using natural language queries. In my experience, it has worked much better than even a fine-tuned Detectron2. To illustrate its power, in the image below, I searched “black dog” and it was able to annotate just that while ignoring the other dogs — all without fine-tuning. How it works: It produces an embedding for the input text using models such as BERT and embeddings for the image from Transformers. It uses the language embedding to identify the relevant parts of the image. It combines the two embeddings and uses a decoder to make the final object detection prediction. Some interesting applications using this model: - Combine with segmentation models and diffusion models to edit images using natural language - Assist with labeling objects by prepopulating bounding boxes - Enhanced image search: eg. find all images with a person wearing a brown hat - Better object counting with more intuitive labels: eg. how many red cars are in this image? #machinelearning #computervision #objectdetection

Matt Forrest

🌎 Helping geospatial professionals grow using technology · Scaling geospatial at Wherobots

63,945 followers 1mo

Lessons from a full day with SAM 2 on satellite imagery. First off, what is SAM 2? It’s a zero‑shot, promptable segmentation model, meaning it can segment unseen objects out-of-the-box, without any training on those classes using only simple prompts like clicks, boxes, or text descriptions (what I used) to guide the process. Why apply it to satellite imagery? SAM 2 excels at segmenting environmental features (ex. roads, buildings, orchards) without retraining. My top tips? 🛰️ Use high‑res imagery (30 cm–1 m/pixel) for crisp segmentation especially for small objects. 🍃 Adjust prompts for the overhead view (e.g., "green leaves" or "shrubs" instead of "trees" - I even used "grey boxes" to find air conditioning units on top of buildings) 🚗 Small objects are detectable with careful prompting, even counting cars works. At Wherobots we embed SAM 2 into our raster inference engine. Users write simple SQL/Python prompts with text, inference runs in parallel on tiles, and results are stored as Iceberg tables in S3. From there, you can use the vector objects that are returned just like regular geospatial data with no special modeling needed. SAM 2 brings zero‑shot segmentation to geospatial data and when you combine it with prompt tuning, high‑res imagery, and distributed inference, and you can pull out earth scale insights in a day. Would love to hear your experiences with vision models on remote sensing! 🌎 I'm Matt and I talk about modern GIS, geospatial data engineering, and AI and geospatial is changing. 📬 Want more like this? Join 7k+ others learning from my newsletter → forrest.nyc

2 Comments

Zhicheng Yan

Senior Staff Research Scientist, Meta Reality Labs

6,007 followers 1y

Before heading to Paris, I'd like to introduce another ICCV 23' work **ZeroSeg** from my team. Open vocabulary semantic segmentation usually requires a good amount of annotated data to bootstrap the development of such ML models. On the other hand, recent progress on large vision-language, such as #CLIP and #Flamingo, has been very exciting. We demonstrate there exists feasible approaches to build an open vocabulary semantic segmentation without human annotation, but via distilling the rich visual knowledge in a pre-trained CLIP vision encoder. **Title** Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only **Arxiv** https://lnkd.in/gVrq82Wx **Abstract** Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level. However, existing approaches often rely on expensive human annotations as supervision for model training, limiting their scalability to large, unlabeled datasets. To address this challenge, we present ZeroSeg, a novel method that leverages the existing pretrained vision-language (VL) model (e.g. CLIP) to train open-vocabulary zero-shot semantic segmentation models. Although acquired extensive knowledge of visual concepts, it is non-trivial to exploit knowledge from these VL models to the task of semantic segmentation, as they are usually trained at an image level. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. We evaluate ZeroSeg on multiple popular segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO, in a zero-shot manner (i.e., no training or adaption on target segmentation datasets). Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data, while also performing competitively compared to strongly supervised methods. Finally, we also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation, through both human studies and qualitative visualizations.

2 Comments

Ashish Bhatia

AI Product Leader | GenAI Agent Platforms | Evaluation Frameworks | Responsible AI Adoption | Ex-Microsoft, Nokia

15,443 followers 1y

Last week Microsoft's Azure AI team dropped the paper for Florence-2: the new version of the foundation computer vision model. This is significant advancement in computer vision and is a significant step up from the original Florence model. 📥 Dataset: Florence-2 has the ability to interpret and understand images comprehensively. Where the original Florence excelled in specific tasks, Florence-2 is adept at multitasking. It's been trained on an extensive FLD-5B dataset encompassing a total of 5.4B comprehensive annotations across 126M images, enhancing its ability to handle a diverse range of visual task such as object detection, image captioning, and semantic segmentation with increased depth and versatility. 📊 Multi-Task Capability: Florence-2's multitasking efficiency is powered by a unified, prompt-based representation. This means it can perform various vision tasks using simple text prompts, a shift from the original Florence model's more task-specific approach. 🤖 Vision and Language Integration: Similar to GPT-4's Vision model, Florence-2 integrates vision and language processing. This integration is facilitated by its sequence-to-sequence architecture, similar to models used in natural language processing but adapted for visual content. 👁️ Practical Applications: Florence-2's capabilities can enhance autonomous vehicle systems' environmental understanding, aid in medical imaging for more accurate diagnoses, surveillance, etc. Its ability to process and understand visual data on a granular level opens up new avenues in AI-driven analysis and automation. Florence-2 offers a glimpse into the future of visual data processing. Its approach to handling diverse visual tasks and the integration of large-scale data sets for training sets it apart as a significant development in computer vision. Paper: https://lnkd.in/deUQf9NG Researchers: Ce Liu, Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Lu Yuan #Microsoft #AzureAI #Florence #computervision #foundationmodels

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

arxiv.org

Rob Sloan

Creative Technologist & CEO | ICVFX × Radiance Fields × Digital Twins • Husband, Father, & Grad School Professor

22,143 followers 2y

🖲️ The evolution of Meta's Segment Anything Model (SAM) continues... Here it is applied to object tracking for clean segmentation of elements within a scene. Once this combines with some additional tracking tools and gets embedded into compositing software... you're going to see so many compositors freed up to do more creative work! "The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, employing interactive prompts such as points to generate masks. This paper presents SAM-PT, a method extending SAM’s capability to tracking and segmenting anything in dynamic videos. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation, demonstrating that a SAM-based segmentation tracker can yield strong zero-shot performance across popular video object segmentation benchmarks, including DAVIS, YouTube-VOS, and MOSE. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information that is agnostic to object semantics." - via ETH Zürich Project Page: https://lnkd.in/ezwiJsUr arXiv: https://lnkd.in/ebQZMizM GitHub: https://lnkd.in/eJGEkdQK (code release later this month; license unknown) For more like this ⤵️ 👉 Follow Orbis Tabula // Digital Twins • Reality Capture • Generative AI #segmentanythingmodel #segmentation #tracking

3 Comments

Jon Krohn

Co-Founder of Y Carrot 🥕 Fellow at Lightning A.I. ⚡️ SuperDataScience Host 🎙️

41,658 followers 2y

Deci's YOLO-NAS architecture provides today's state of the art in Machine Vision, specifically the key task of Object Detection. Harpreet Sahota joins us from Deci today to detail YOLO-NAS as well as where Computer Vision is going next. Harpreet: • Leads the deep learning developer community at Deci AI, an Israeli startup that has raised over $55m in venture capital and that recently open-sourced the YOLO-NAS deep learning model architecture. • Through prolific data science content creation, including The Artists of Data Science podcast and his LinkedIn live streams, Harpreet has amassed a social-media following in excess of 70,000 followers. • Previously worked as a lead data scientist and as a biostatistician. • Holds a master’s in mathematics and statistics from Illinois State University. Today’s episode will likely appeal most to technical practitioners like data scientists, but we did our best to break down technical concepts so that anyone who’d like to understand the latest in machine vision can follow along. In the episode, Harpreet details: • What exactly object detection is. • How object detection models are evaluated. • How machine vision models have evolved to excel at object detection, with an emphasis on the modern deep learning approaches. • How a “neural architecture search” algorithm enabled Deci to develop YOLO-NAS, an optimal object detection model architecture. • The technical approaches that will enable large architectures like YOLO-NAS to be compute-efficient enough to run on edge devices. • His “top-down” approach to learning deep learning, including his recommended learning path. Many thanks to Amazon Web Services (AWS), WithFeeling.AI and Modelbit for supporting this episode of SuperDataScience, enabling the show to be freely available on all major podcasting platforms and on YouTube (see comments for details). #superdatascience #deeplearning #machinevision #machinelearning #ai

10 Comments

Paras Varshney

Sr. Data Scientist, Fidelity | Kaggle Master | Global Data Science Ambassador HP

11,843 followers 2y

🎥 Lights, Camera, Segmentation! SAM-PT: Unraveling the Video Segmentation! 📹🎩 Lately, I was working on a video segmentation project but was struggling with inconsistent video segmentation performance on unseen data with traditional open-source libraries like OpenCV and Mask R-CNN. Recently I came across SAM-PT, which extends the powerful SAM model to track and segment anything in dynamic videos. 📹💡 No need for extensive training data - SAM-PT achieves impressive zero-shot results by leveraging robust point tracking with state-of-the-art point trackers like PIPS and sparse point selection techniques. It intelligently prompts SAM with these points to produce per-frame segmentation masks, allowing for precise tracking of diverse objects in various video environments. 🎯💡 Check out the SAM-PT code on GitHub: https://buff.ly/3pYSxY6 Link to the paper: https://buff.ly/3K7oAfh #datascience #ai #computervision #videosegmentation

Object Detection and Segmentation in Computer Vision

336 - Nuclei segmentation and analysis using Detectron2 & YOLOv8

https://www.youtube.com/

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

arxiv.org

More in advanced computer vision techniques

Explore categories