Let's explore the world modeling potentials on MindSpore ;)
Note
The interactive UI below is better rendered in vscode.
MindSpore implementation for Janus-Pro training/inference is now released! Supporting both multimodal understanding and visual generation on Ascend NPU. Decoupling visual encoding generation/understanding-specific tasks surely bring Omni-capability. Details can be found here.
MVDream is a diffusion model that is able to generate consistent multiview images from a given text prompt. It shows that learning from both 2D and 3D data, a multiview diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. Details can be found here
| Input Prompt | Rendererd MView Video | 3D Mesh Generation in Color |
|---|---|---|
an astronaut riding a horse |
ast.mp4 |
<iframe title="an astronaut riding a horse_ms" frameborder="0" allowfullscreen mozallowfullscreen="true" webkitallowfullscreen="true" allow="autoplay; fullscreen; xr-spatial-tracking" xr-spatial-tracking execution-while-out-of-viewport execution-while-not-rendered web-share src="https://sketchfab.com/models/2191db5b61834839aac5238f60d70e59/embed"> </iframe> |
Michelangelo style statue of dog reading news on a cellphone |
mich.mp4 |
<iframe title="Michelangelo style statue of dog reading news_ms" frameborder="0" allowfullscreen mozallowfullscreen="true" webkitallowfullscreen="true" allow="autoplay; fullscreen; xr-spatial-tracking" xr-spatial-tracking execution-while-out-of-viewport execution-while-not-rendered web-share src="https://sketchfab.com/models/c21773f276884a5db7d47e41926645e4/embed"> </iframe> |
These videos are rendered from the trained 3D implicit field in our MVDream model. Color meshes are extracted with the script MVDream-threestudio/extract_color_mesh.py.
We support instantmesh for the 3D mesh generation using the multiview images extracted from the sv3d pipeline.
Using the multiview images input from 3D mesh extracted from the sv3d pipeline, we extracted 3D meshes as below. Please kindly find the input illustrated by following the link to the sv3d pipeline below.
akun |
anya |
|---|---|
<iframe title="akun_ms" frameborder="0" allowfullscreen mozallowfullscreen="true" webkitallowfullscreen="true" allow="autoplay; fullscreen; xr-spatial-tracking" xr-spatial-tracking execution-while-out-of-viewport execution-while-not-rendered web-share src="https://sketchfab.com/models/c8b5b475529d48589b85746aab638d2b/embed"></iframe> |
<iframe title="anya_ms" frameborder="0" allowfullscreen mozallowfullscreen="true" webkitallowfullscreen="true" allow="autoplay; fullscreen; xr-spatial-tracking" xr-spatial-tracking execution-while-out-of-viewport execution-while-not-rendered web-share src="https://sketchfab.com/models/180fd247ba2f4437ac665114a4cd4dca/embed"></iframe> |
The illustrations here are better viewed in viewers than with HTML support (e.g., the vscode built-in viewer).
Output Multiview Images (21x576x576)
A camera-guided diffusion model that can generate the multiview snippet of a given image! Details can be found here.
More Inference Demos
| Input | Output |
|---|---|
To install MindONE v0.3.0, please install MindSpore 2.5.0 and run pip install mindone
Alternatively, to install the latest version from the master branch, please run.
git clone https://github.com/mindspore-lab/mindone.git
cd mindone
pip install -e .
We support state-of-the-art diffusion models for generating images, audio, and video. Let's get started using Stable Diffusion 3 as an example.
Hello MindSpore from Stable Diffusion 3!
import mindspore
from mindone.diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
mindspore_dtype=mindspore.float16,
)
prompt = "A cat holding a sign that says 'Hello MindSpore'"
image = pipe(prompt)[0][0]
image.save("sd3.png")- mindone diffusers is under active development, most tasks were tested with mindspore 2.5.0 on Ascend Atlas 800T A2 machines.
- compatibale with hf diffusers 0.32.2
| component | features |
|---|---|
| pipeline | support text-to-image,text-to-video,text-to-audio tasks 160+ |
| models | support audoencoder & transformers base models same as hf diffusers 50+ |
| schedulers | support diffusion schedulers (e.g., ddpm and dpm solver) same as hf diffusers 35+ |
| task | model | inference | finetune | pretrain | institute |
|---|---|---|---|---|---|
| Image-to-Video | hunyuanvideo-i2v π₯π₯ | β | βοΈ | βοΈ | Tencent |
| Text/Image-to-Video | wan2.1 π₯π₯π₯ | β | βοΈ | βοΈ | Alibaba |
| Text-to-Image | cogview4 π₯π₯π₯ | β | βοΈ | βοΈ | Zhipuai |
| Text-to-Video | step_video_t2v π₯π₯ | β | βοΈ | βοΈ | StepFun |
| Image-Text-to-Text | qwen2_vl π₯π₯π₯ | β | βοΈ | βοΈ | Alibaba |
| Any-to-Any | janus π₯π₯π₯ | β | β | β | DeepSeek |
| Any-to-Any | emu3 π₯π₯ | β | β | β | BAAI |
| Class-to-Image | varπ₯π₯ | β | β | β | ByteDance |
| Text/Image-to-Video | hpcai open sora 1.2/2.0 π₯π₯ | β | β | β | HPC-AI Tech |
| Text/Image-to-Video | cogvideox 1.5 5B~30B π₯π₯ | β | β | β | Zhipu |
| Text-to-Video | open sora plan 1.3 π₯π₯ | β | β | β | PKU |
| Text-to-Video | hunyuanvideo π₯π₯ | β | β | β | Tencent |
| Text-to-Video | movie gen 30B π₯π₯ | β | β | β | Meta |
| Video-Encode-Decode | magvit | β | β | β | |
| Text-to-Image | story_diffusion | β | βοΈ | βοΈ | ByteDance |
| Image-to-Video | dynamicrafter | β | βοΈ | βοΈ | Tencent |
| Video-to-Video | venhancer | β | βοΈ | βοΈ | Shanghai AI Lab |
| Text-to-Video | t2v_turbo | β | β | β | |
| Image-to-Video | svd | β | β | β | Stability AI |
| Text-to-Video | animate diff | β | β | β | CUHK |
| Text/Image-to-Video | video composer | β | β | β | Alibaba |
| Text-to-Image | flux π₯ | β | β | βοΈ | Black Forest Lab |
| Text-to-Image | stable diffusion 3 π₯ | β | β | βοΈ | Stability AI |
| Text-to-Image | kohya_sd_scripts | β | β | βοΈ | kohya |
| Text-to-Image | stable diffusion xl | β | β | β | Stability AI |
| Text-to-Image | stable diffusion | β | β | β | Stability AI |
| Text-to-Image | hunyuan_dit | β | β | β | Tencent |
| Text-to-Image | pixart_sigma | β | β | β | Huawei |
| Text-to-Image | fit | β | β | β | Shanghai AI Lab |
| Class-to-Video | latte | β | β | β | Shanghai AI Lab |
| Class-to-Image | dit | β | β | β | Meta |
| Text-to-Image | t2i-adapter | β | β | β | Shanghai AI Lab |
| Text-to-Image | ip adapter | β | β | β | Tencent |
| Text-to-3D | mvdream | β | β | β | ByteDance |
| Image-to-3D | instantmesh | β | β | β | Tencent |
| Image-to-3D | sv3d | β | β | β | Stability AI |
| Text/Image-to-3D | hunyuan3d-1.0 | β | β | β | Tencent |
| task | model | inference | finetune | pretrain | features |
|---|---|---|---|---|---|
| Image-Text-to-Text | pllava π₯ | β | βοΈ | βοΈ | support video and image captioning |













