(LLM) Polyglot_12.8b 세팅

Polyglot-ko-12.8b

모델 설명

TuNiB AI에서 수집한 1.2TB 규모의 한국어로 학습 (전처리 후 863GB 규모의 데이터가 필터링됨)

1.3B ~ 12.8B 모델 제공

Polyglot-ko-12.8b 모델을 불러와 inference 또는 양자화 및 학습하기

Git: GitHub - EleutherAI/polyglot: Polyglot: Large Language Models of Well-balanced Competence in Multi-languages

HF: EleutherAI/polyglot-ko-12.8b · Hugging Face

환경 세팅

설치 환경

OS: Windows 10 64bit
GPU: RTX 3080 10GB
CUDA: CUDA Toolkit 11.8, cuDNN 8.9
Python: Anaconda 가상 환경 Python 3.8
Date: 2023.08.21

아나콘다 설정

# 가상환경 생성 및 라이브러리 설치
 
conda create --name llm python=3.8
conda activate llm
 
# pytorch 설치, cuda=11.8, pytorch=2.0.1
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 
# LLM 관련 라이브러리 설치
pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git
pip install -q datasets

bitsandbytes 라이브러리 수정

기본 bitsandbytes 라이브러리는 리눅스 전용이기 때문에 cuda 로드 시 에러 발생
windows 환경에서 cuda 로드를 하기위해 라이브러리 재설치 및 일부수정

윈도우 지원 버전 bitsandbytes-0.39.1-py3-none-win_amd64.whl

# 기존 bitsandbytes 라이브러리 제거
pip uninstall bitsandbytes
 
# whl 파일을 받은 경로에서 재설치
pip install bitsandbytes-0.39.1-py3-none-win_amd64.whl

실행 테스트

anaconda 프롬포트에 python -m bitsandbytes 입력 후 문제없는지 확인
error가 없으면 완료, error 발생 시 3번부터 진행

bitsandbytes 설치 루트로 이동하여 __main__.py 일부 수정 (anaconda 설치 환경: C:\ProgramData\Anaconda3\envs\llm\Lib\site-packages\bitsandbytes)

# 36번 줄 ~ 아래와 같이 수정
def find_file_recursive(folder, filename):  
    cmd = f'find {folder} -name {filename}' if not IS_WINDOWS_PLATFORM else f'where /R "{folder}" "{filename}"'  
    # out, err = execute_and_return(cmd)  
    # if len(err) > 0:    
    #     raise RuntimeError('Something when wrong when trying to find file.')
 
	# CUDA 경로 직접 입력
    out = "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin" 
    
    return out

bitsandbytes 설치 루트로 이동하여 cuda_setup/main.py 일부 수정 (251번 줄 부터 아래와 같이 주석처리함)

# 검색할 필요없어서 제거
# if "CONDA_PREFIX" in candidate_env_vars:  
#     conda_libs_path = Path(candidate_env_vars["CONDA_PREFIX"]) / "bin"  
#  
#     conda_cuda_libs = find_cuda_lib_in(str(conda_libs_path))  
#     warn_in_case_of_duplicates(conda_cuda_libs)  
#  
#     if conda_cuda_libs:  
#         return next(iter(conda_cuda_libs))  
#  
#     conda_libs_path = Path(candidate_env_vars["CONDA_PREFIX"]) / "lib"  
#  
#     conda_cuda_libs = find_cuda_lib_in(str(conda_libs_path))  
#     warn_in_case_of_duplicates(conda_cuda_libs)  
#  
#     if conda_cuda_libs:  
#         return next(iter(conda_cuda_libs))  
#     CUDASetup.get_instance().add_log_entry(f'{candidate_env_vars["CONDA_PREFIX"]} did not contain '  
#         f'{CUDA_RUNTIME_LIBS} as expected! Searching further paths...', is_warning=True)

실행 테스트

anaconda 프롬포트에 python -m bitsandbytes 입력 후 문제없는지 확인

실행 테스트

모델 다운로드

허깅페이스(EleutherAI/polyglot-ko-12.8b at main) 내에 있는 파일을 모두 다운로드하여 개발 폴더에 위치시킴
학습시, 학습에 사용될 데이터셋 필요

코드 작성

허깅페이스 파이프라인에서 로컬 데이터셋(json) 로드하는 방법

 dataset = datasets.load_dataset('json', data_files='dataset.json')

허깅페이스 파이프라인에서 로컬 모델 로드하는 방법

model = AutoModelForCausalLM.from_pretrained("모델 경로", quantization_config=bnb_config)

허깅페이스 파이프라인에서 로컬 토크나이저 로드하는 방법

tokenizer = AutoTokenizer.from_pretrained("토크나이저 경로")
 
# 이때 폴더안에 아래 파일들이 존재해야함  
config.json,  
special_tokens_map.json,  
tokenizer.json,  
tokenizer_config.json

학습과 관련된 다른 코드는 KoAlpaca 학습 예제를 참고

참고

NLP LLM DeepLearning

💻️ MMMSK

탐색기

최근 게시글

(IMU Fusion) IMU 센서를 이용한 Ground Plane 보정

메인 배터리 교체

(Hailo) Hailo 컴파일과 메모리 할당

(LLM) Polyglot_12.8b 세팅

Polyglot-ko-12.8b

환경 세팅

설치 환경

아나콘다 설정

bitsandbytes 라이브러리 수정

실행 테스트

모델 다운로드

코드 작성

참고

그래프 뷰

목차

백링크

최근 게시글

(IMU Fusion) IMU 센서를 이용한 Ground Plane 보정

메인 배터리 교체

(Hailo) Hailo 컴파일과 메모리 할당