diff --git a/CogVLM/.deepspeed_env b/CogVLM/.deepspeed_env deleted file mode 100644 index 9078efcc..00000000 --- a/CogVLM/.deepspeed_env +++ /dev/null @@ -1,2 +0,0 @@ -SAT_HOME=~/.sat_models -LOCAL_WORLD_SIZE=8 \ No newline at end of file diff --git a/CogVLM/.gitignore b/CogVLM/.gitignore deleted file mode 100644 index 768ee34d..00000000 --- a/CogVLM/.gitignore +++ /dev/null @@ -1,12 +0,0 @@ -.hypothesis/ -__pycache__ -output.png -fewshot-data/ -checkpoints/ -records.db -server.py -examples/*grounding.png -archive* -hostfile -runs/ -*.idea/ \ No newline at end of file diff --git a/CogVLM/CogAgent_README.md b/CogVLM/CogAgent_README.md deleted file mode 100644 index 3d16006d..00000000 --- a/CogVLM/CogAgent_README.md +++ /dev/null @@ -1,673 +0,0 @@ -# Design2Code Finetuning - -We keep a snapshot of the CogAgent repo here, which we use for finetuning and inference. - -The code base is based on swissarmytransformer version 0.4.10. - -The finetuning script is [finetune_cogagent_lora_design2code.sh](finetune_demo/finetune_cogagent_lora_design2code.sh). - -Note that the LoRA modules are only added to the language decoder. - -We provide [the example inference script](finetune_demo/inference_design2code.py) and the [model weight](https://huggingface.co/SALT-NLP/Design2Code). - -# CogVLM & CogAgent - -📗 [中文版README](./README_zh.md) -🔥🔥🔥 🆕: ```2023/12/26```: We have released the [CogVLM-SFT-311K](dataset.md) dataset, -which contains over 150,000 pieces of data that we used for **CogVLM v1.0 only** training. -Welcome to follow and use. - -🌟 **Jump to detailed introduction: [Introduction to CogVLM](#introduction-to-cogvlm), -🆕 [Introduction to CogAgent](#introduction-to-cogagent)** - -📔 For more detailed usage information, please refer to: [CogVLM & CogAgent's technical documentation (in Chinese)](https://zhipu-ai.feishu.cn/wiki/LXQIwqo1OiIVTykMh9Lc3w1Fn7g) - -
- CogVLM-📖 Paper: CogVLM: Visual Expert for Pretrained Language Models -CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion visual parameters and 7 billion language parameters, supporting image understanding and multi-turn dialogue with a resolution of 490*490. -CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC. - |
-
- CogAgent-📖 Paper: CogAgent: A Visual Language Model for GUI Agents -CogAgent is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters and 7 billion language parameters, supporting image understanding at a resolution of 1120*1120. On top of the capabilities of CogVLM, it further possesses GUI image Agent capabilities. -CogAgent-18B achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. It significantly surpasses existing models on GUI operation datasets including AITW and Mind2Web. - |
-
|
- 🌐 Web Demo for both CogVLM and CogAgent: this link - |
- |
-
-| Method | -LLM | -MM-VET | -POPE(adversarial) | -TouchStone | -
| BLIP-2 | -Vicuna-13B | -22.4 | -- | -- | -
| Otter | -MPT-7B | -24.7 | -- | -- | -
| MiniGPT4 | -Vicuna-13B | -24.4 | -70.4 | -531.7 | -
| InstructBLIP | -Vicuna-13B | -25.6 | -77.3 | -552.4 | -
| LLaMA-Adapter v2 | -LLaMA-7B | -31.4 | -- | -590.1 | -
| LLaVA | -LLaMA2-7B | -28.1 | -66.3 | -602.7 | -
| mPLUG-Owl | -LLaMA-7B | -- | -66.8 | -605.4 | -
| LLaVA-1.5 | -Vicuna-13B | -36.3 | -84.5 | -- | -
| Emu | -LLaMA-13B | -36.3 | -- | -- | -
| Qwen-VL-Chat | -- | -- | -- | -645.2 | -
| DreamLLM | -Vicuna-7B | -35.9 | -76.5 | -- | -
| CogVLM | -Vicuna-7B | -52.8 | -87.6 | -742.0 | -
| - | RefCOCO | -- | - | RefCOCO+ | -- | - | RefCOCOg | -- | Visual7W | -
| - | val | -testA | -testB | -val | -testA | -testB | -val | -test | -test | -
| cogvim-grounding-generalist | -92.51 | -93.95 | -88.73 | -87.52 | -91.81 | -81.43 | -89.46 | -90.09 | -90.96 | -
| cogvim-grounding-generalist-v1.1 | -**92.76** | -**94.75** | -**88.99** | -**88.68** | -**92.91** | -**83.39** | -**89.75** | -**90.79** | -**91.05** | -
-
-
-
-
-
-
- CogVLM-📖 Paper: CogVLM: Visual Expert for Pretrained Language Models -CogVLM 是一个强大的开源视觉语言模型(VLM)。CogVLM-17B拥有100亿的视觉参数和70亿的语言参数,支持490*490分辨率的图像理解和多轮对话。 -CogVLM-17B 17B在10个经典的跨模态基准测试中取得了最先进的性能包括NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA 和 TDIUC 基准测试。 - |
-
- CogAgent-📖 Paper: CogAgent: A Visual Language Model for GUI Agents -CogAgent 是一个基于CogVLM改进的开源视觉语言模型。CogAgent-18B拥有110亿的视觉参数和70亿的语言参数, 支持1120*1120分辨率的图像理解。在CogVLM的能力之上,它进一步拥有了GUI图像Agent的能力。 -CogAgent-18B 在9个经典的跨模态基准测试中实现了最先进的通用性能,包括 VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, 和 POPE 测试基准。它在包括AITW和Mind2Web在内的GUI操作数据集上显著超越了现有的模型。 - |
-
|
- 🌐 CogVLM 和 CogAgent 的网络演示: this link - |
- |
-
-| Method | -LLM | -MM-VET | -POPE(adversarial) | -TouchStone | -
| BLIP-2 | -Vicuna-13B | -22.4 | -- | -- | -
| Otter | -MPT-7B | -24.7 | -- | -- | -
| MiniGPT4 | -Vicuna-13B | -24.4 | -70.4 | -531.7 | -
| InstructBLIP | -Vicuna-13B | -25.6 | -77.3 | -552.4 | -
| LLaMA-Adapter v2 | -LLaMA-7B | -31.4 | -- | -590.1 | -
| LLaVA | -LLaMA2-7B | -28.1 | -66.3 | -602.7 | -
| mPLUG-Owl | -LLaMA-7B | -- | -66.8 | -605.4 | -
| LLaVA-1.5 | -Vicuna-13B | -36.3 | -84.5 | -- | -
| Emu | -LLaMA-13B | -36.3 | -- | -- | -
| Qwen-VL-Chat | -- | -- | -- | -645.2 | -
| DreamLLM | -Vicuna-7B | -35.9 | -76.5 | -- | -
| CogVLM | -Vicuna-7B | -52.8 | -87.6 | -742.0 | -
| - | RefCOCO | -- | - | RefCOCO+ | -- | - | RefCOCOg | -- | Visual7W | -
| - | val | -testA | -testB | -val | -testA | -testB | -val | -test | -test | -
| cogvim-grounding-generalist | -92.51 | -93.95 | -88.73 | -87.52 | -91.81 | -81.43 | -89.46 | -90.09 | -90.96 | -
| cogvim-grounding-generalist-v1.1 | -**92.76** | -**94.75** | -**88.99** | -**88.68** | -**92.91** | -**83.39** | -**89.75** | -**90.79** | -**91.05** | -
-
-
-
-
-
-
-
-扫码关注公众号,加入「ChatGLM交流群」
-Scan the QR code to follow the official account and join the "ChatGLM Discussion Group"
-