Nvidia Cosmos 에서 제공하는 모델인 Text2World, Video2World 테스트

Cosmos

Nvidia에서 제공하며, 합성 데이터 생성 기술을 통해 AI 모델의 훈련을 지원하는 플랫폼.
AI 학습을 위한 World : 물리법칙이 적용된 가상세계를 생성함
Diffusion 모델 기반의 Text2World, Video2World 사전 학습된 모델을 제공함.
Git: NVIDIA Cosmos

실행 환경 및 설치

실행 환경

CPU: Intel(R) Xeon(R) w5-3423 @ 12cores
RAM: 256GB
OS: Ubuntu-22.04 LTS
GPU: NVIDIA RTX A6000 48G x 2ea
CUDA Toolkit: CUDA 11.8, cuDNN 8.9.7

설치

Diffusion based model 설치 및 실행 참고 : Cosmos/cosmos1/models/diffusion/README.md at main · NVIDIA/Cosmos · GitHub

Set Up Docker Environment

NVIDIA Container Toolkit 설치.
레파지토리 복사

git clone git@github.com:NVIDIA/Cosmos.git
cd Cosmos

Docker 이미지 빌드

docker build -t cosmos .
docker run -d --name cosmos_container --gpus all --ipc=host -it -v $(pwd):/workspace cosmos
docker attach cosmos_container

Download Checkpoints

모든 작업은 생성된 Docker 컨테이너 내에서 실행함.

Hugging Face 엑세스 토큰 생성. 엑세스 토큰을 ‘Read’ 권한으로 설정.
생성된 토큰으로 Hugginh Face 로그인.

huggingface-cli login

Mistral AI의 Pixtral-12B 모델에 대한 엑세스 요청 및 동의 (Pixtral’s Hugging Face model page). Pixtral-12B 모델은 Video2World 작업의 프롬포트 업샘플링에 사용됨.
Hugging Face로부터 Cosmos 모델 가중치 파일 다운로드.

PYTHONPATH=$(pwd) python cosmos1/scripts/download_diffusion.py --model_sizes 7B 14B --model_types Text2World Video2World

다운받은 가중치 파일은 아래 구조를 따름.

checkpoints/
├── Cosmos-1.0-Diffusion-7B-Text2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Diffusion-14B-Text2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Diffusion-7B-Video2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Diffusion-14B-Video2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Tokenizer-CV8x8x8
│   ├── decoder.jit
│   ├── encoder.jit
│   └── mean_std.pt
├── Cosmos-1.0-Prompt-Upsampler-12B-Text2World
│   ├── model.pt
│   └── config.json
├── Pixtral-12B
│   ├── model.pt
│   ├── config.json
└── Cosmos-1.0-Guardrail
    ├── aegis/
    ├── blocklist/
    ├── face_blur_filter/
    └── video_content_safety_filter/

사용방법

Model Types

Diffusion 베이스의 World 생성은 두 가지 타입의 모델을 제공함.

Text2World: 입력 텍스트로부터 World 생성 지원.

Models: Cosmos-1.0-Diffusion-7B-Text2World and Cosmos-1.0-Diffusion-14B-Text2World
Inference script: text2world.py

Video2World: 입력 이미지/비디오로부터 World 생성 지원.

Models: Cosmos-1.0-Diffusion-7B-Video2World and Cosmos-1.0-Diffusion-14B-Video2World
Inference script: video2world.py

Text2World

자연어 입력을 해석하여 비디오를 생성함.
Cosmos-1.0-Diffusion-7B-Text2World, Cosmos-1.0-Diffusion-14B-Text2World 두 가지 모델 제공.
옵션과 모델에 따라 실행에 필요한 GPU 메모리 사용량은 아래와 같음.
샘플 명령어 참고: Cosmos/cosmos1/models/diffusion/README.md at main · NVIDIA/Cosmos · GitHub

Offloading Strategy	7B Text2World	14B Text2World
Offload prompt upsampler	74.0 GB	> 80.0 GB
Offload prompt upsampler & guardrails	57.1 GB	70.5 GB
Offload prompt upsampler & guardrails & T5 encoder	38.5 GB	51.9 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer	38.3 GB	51.7 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer & diffusion model	24.4 GB	39.0 GB

H100 80G GPU에서 생성에 걸리는 소요시간은 아래와 같음.

7B Text2World (offload prompt upsampler)	14B Text2World (offload prompt upsampler, guardrails)
~380 seconds	~590 seconds

실행 결과

Cosmos-1.0-Diffusion-7B-Text2World 모델을 이용한 테스트 결과.

프롬포트 1

(데모) A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot’s metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot’s poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect.

프롬포트2

A red tractor is in the middle of a field after the harvest on a clear day. Our scene seems to have been shot from a height of about 3 meters. Around us are several other fields of the same size, all of which have rough soil surfaces and are separated from each other by berms. The fields look like a checkerboard. There are about 5 white bales of silage on the fields, and they are in the shape of cylinders with the same width and height.

Video2World

비디오나 이미지를 입력으로 받아 비디오를 생성함.
프롬포트를 추가하여 생성되는 비디오를 꾸며줄 수 있음.
옵션과 모델에 따라 실행에 필요한 GPU 메모리 사용량은 아래와 같음.
샘플 명령어 참고: Cosmos/cosmos1/models/diffusion/README.md at main · NVIDIA/Cosmos · GitHub

Offloading Strategy	7B Video2World	14B Video2World
Offload prompt upsampler	76.5 GB	> 80.0 GB
Offload prompt upsampler & guardrails	59.9 GB	73.3 GB
Offload prompt upsampler & guardrails & T5 encoder	41.3 GB	54.8 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer	41.1 GB	54.5 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer & diffusion model	27.3 GB	39.0 GB

H100 80G GPU에서 생성에 걸리는 소요시간은 아래와 같음.

7B Video2World (offload prompt upsampler)	14B Video2World (offload prompt upsampler, guardrails)
~383 seconds	~593 seconds

실행 결과

(데모) 샘플이미지의 테스트 결과

💻️ MMMSK

탐색기

최근 게시글

(Hailo) Hailo 컴파일과 메모리 할당

(Hailo) Hailo Model Zoo 데이터 전처리

3D Object Detection on Ground Plane

Nvidia Cosmos 개발 환경 및 테스트

Cosmos

실행 환경 및 설치

실행 환경

설치

Set Up Docker Environment

Download Checkpoints

사용방법

Model Types

Text2World

실행 결과

프롬포트 1

프롬포트2

Video2World

실행 결과

그래프 뷰

목차

백링크

최근 게시글

(Hailo) Hailo 컴파일과 메모리 할당

(Hailo) Hailo Model Zoo 데이터 전처리

3D Object Detection on Ground Plane