Microsoft MVP성태의 닷넷 이야기
글쓴 사람
정성태 (seongtaejeong at gmail.com)
홈페이지
첨부 파일
(연관된 글이 1개 있습니다.)

C# - Hugging Face에 공개된 LLM 모델을 Foundry Local에서 사용하는 방법

지난 글에서 살펴본,

Windows + Foundry Local - 로컬에서 AI 모델 활용
; https://www.sysnet.pe.kr/2/0/13943

Foundry Local의 기본 모델은 17개 정도가 등록된 상태지만 원한다면 (모델을 ONNX 포맷으로 바꿀 수만 있다면) 자유롭게 추가 등록하는 것이 가능합니다. 이와 관련해 아래의 공식 문서에서 Hugging Face에 공개된 모델을 가져다 쓰는 방법을 다루는데요,

Compile Hugging Face models to run on Foundry Local
; https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-compile-hugging-face-models?tabs=Bash

잠깐 실습해 볼까요? ^^




문서에 나오는 것처럼, 우선 Olive 도구를 실행할 수 있는 환경을 준비합니다.

// WSL 2 Ubuntu 24.04 환경 (혹은, 윈도우의 경우 Anaconda3 환경을 준비)

// conda create -n huggingface-build python=3.10 -y
// conda activate huggingface-build
// conda remove -n huggingface-build --all -y

// Olive latest documentation - Getting started
// https://microsoft.github.io/Olive/getting-started/getting-started.html

// github.com/microsoft/Olive
// https://github.com/microsoft/Olive

// GPU가 있는 경우
pip install olive-ai[gpu,finetune]
pip install transformers onnxruntime-genai-cuda

// CPU만 있는 경우
pip install olive-ai[cpu,finetune]
pip install transformers onnxruntime-genai

그다음 Hugging Face 사이트를 통해 User access token을 발급받고 로그인을 한 번 해줍니다.

C:\temp> huggingface-cli login
...[생략]...

C:\temp> huggingface-cli whoami
...[생략]...

이제 원하는 모델을 Hugging Face 사이트에서 선택해야 하는데요, 문서에서는 "meta-llama/Llama-3.2-1B-Instruct" 모델을 예로 들고 있지만,

olive auto-opt --model_name_or_path meta-llama/Llama-3.2-1B-Instruct --trust_remote_code --output_path models/llama --device cpu --provider CPUExecutionProvider --use_ort_genai --precision int4 --log_level 1

실제로 해보면 오류로 인해 실습이 안 됩니다. 이에 대해 이슈가 있는데,

Quickstart doesn't work. #1916
; https://github.com/microsoft/Olive/issues/1916

해결이 될 때까지 일단은 다른 모델로 실습하는 것이 좋겠습니다. ^^; 문제는, 제가 이 분야는 잘 모른다는 점입니다. 분위기를 보면, Hugging Face의 모든 모델이 OLIVE를 이용한 ONNX로 변환되는 것은 아닌 듯하고, 따라서 적당한 것을 고르기가 힘들었는데요, 다행히 아래의 영상에서,

Windows Dev Chat - October 24, 2024
; https://www.youtube.com/live/lAc1fq_0ftw?t=775s

"Qwen/Qwen2.5-Math-1.5B-Instruct" 모델을 다루는데, 저 역시 해당 영상의 명령어로 잘 변환이 됐습니다.

(huggingface-build) testusr@testpc:/mnt/c/temp$ olive auto-opt --model_name_or_path Qwen/Qwen2.5-Math-1.5B-Instruct --trust_remote_code --output_path models/Qwen2.5-Math-1.5B-Instruct --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1
Loading HuggingFace model from Qwen/Qwen2.5-Math-1.5B-Instruct
[... 08:52:14,438] [INFO] [run.py:142:run_engine] Running workflow default_workflow
[... 08:52:14,450] [INFO] [cache.py:138:__init__] Using cache directory: /mnt/c/temp/.olive-cache/default_workflow
[... 08:52:14,478] [INFO] [accelerator_creator.py:217:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[... 08:52:14,488] [INFO] [engine.py:223:run] Running Olive on accelerator: cpu-cpu
[... 08:52:14,490] [INFO] [engine.py:864:_create_system] Creating target system ...
[... 08:52:14,491] [INFO] [engine.py:867:_create_system] Target system created in 0.000242 seconds
[... 08:52:14,491] [INFO] [engine.py:879:_create_system] Creating host system ...
[... 08:52:14,491] [INFO] [engine.py:882:_create_system] Host system created in 0.000174 seconds
[... 08:52:14,743] [INFO] [engine.py:683:_run_pass] Running pass model_builder:modelbuilder
GroupQueryAttention (GQA) is used in this model.
Reading embedding layer
Reading decoder layer 0
Reading decoder layer 1
Reading decoder layer 2
...[생략]...
Reading decoder layer 27
Reading final norm
Reading LM head
Saving ONNX model in /mnt/c/temp/.olive-cache/default_workflow/runs/27b66e29/models
Saving GenAI config in /mnt/c/temp/.olive-cache/default_workflow/runs/27b66e29/models
Saving processing files in /mnt/c/temp/.olive-cache/default_workflow/runs/27b66e29/models for GenAI
[... 09:05:24,768] [INFO] [engine.py:757:_run_pass] Pass model_builder:modelbuilder finished in 790.024721 seconds
[... 09:05:24,790] [INFO] [engine.py:683:_run_pass] Running pass extract_adapters:extractadapters
[... 09:05:43,445] [INFO] [extract_adapters.py:177:_run_for_config] No lora modules found in the model. Returning the original model.
[... 09:05:43,482] [INFO] [engine.py:757:_run_pass] Pass extract_adapters:extractadapters finished in 18.692435 seconds
[... 09:05:43,503] [INFO] [engine.py:241:run] Run history for cpu-cpu:
[... 09:05:43,504] [INFO] [engine.py:499:dump_run_history] Please install tabulate for better run history output
[... 09:05:43,514] [INFO] [cache.py:195:load_model] Loading model 030c6a62 from cache.
[... 09:06:53,028] [INFO] [engine.py:266:run] Saved output model to /mnt/c/temp/models/Qwen2.5-Math-1.5B-Instruct
Model is saved at /mnt/c/temp/models/Qwen2.5-Math-1.5B-Instruct

자, 그럼 이제 Foundry Local에서 연동할 수 있도록 위의 출력을 등록해야 하는데요, 사실 이 과정은 Foundry Local의 작업 디렉터리를 변경하는 것으로 해결됩니다.

C:\temp\models> foundry cache cd c:\temp\models
Restarting service...
🔴 Service is stopped.
🟢 Service is Started on http://localhost:5273, PID 60904!

하지만 현재 상태로는 해당 모델에 대한 정보를 열거하면 이런 식으로 나오는데요,

C:\temp\models> foundry cache ls
Models cached on device:
   Alias                         Model ID
💾 Model was not found in catalogmodel

여기서 "Model ID"는 ./models/Qwen2.5-Math-1.5B-Instruct/model 디렉터리에 "inference_model.json" 파일을 다음과 같이 생성해 바꿀 수 있습니다.

// inference_model.json 파일이 없으므로 새로 생성

C:\temp\models> type .\Qwen2.5-Math-1.5B-Instruct\model\inference_model.json
{
  "Name": "Qwen2.5-Math-1.5B-Instruct",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{Content}<|im_end|>\n<|im_start|>assistant\n"
  }
}

C:\temp\models> foundry cache ls
Models cached on device:
   Alias                         Model ID
💾 Model was not found in catalogQwen2.5-Math-1.5B-Instruct

아쉽게도 여전히 "Alias"가 안 나오는데요, ^^; 어떻게 바꿀 수 있는지는 아직 모르겠습니다. (혹시 아시는 분은 댓글로 알려주세요. ^^;)




어쨌든, 저 상태까지 만들었으면 이제 C# 클라이언트를 다음과 같이 구성해 실습할 수 있습니다.

using OpenAI;
using OpenAI.Chat;
using System.ClientModel;

namespace ConsoleApp1;

internal class Program
{
    // Install-Package OpenAI 
    static async Task Main(string[] args)
    {
        string ep = "http://localhost:5273/v1"; // Foundry Local의 기본 엔드포인트 ("foundry service status" 명령으로 확인 가능)
        string key = "OPENAI_API_KEY";
        string alias = "Qwen2.5-Math-1.5B-Instruct";

        OpenAIClientOptions options = new OpenAIClientOptions();
        options.Endpoint = new Uri(ep);

        ApiKeyCredential akc = new ApiKeyCredential(key);
        ChatClient client = new(alias, akc, options);

        ChatCompletion completion = client.CompleteChat("Why is the sky blue?");

        foreach (var message in completion.Content)
        {
            Console.WriteLine($"[{message.Kind}]: {message.Text}");
        }
    }
}

/* 실행 결과:
[Text]: The sky is blue because of a process called Rayleigh Scintometry. When light from the sun reaches the Earth's atmosphere, it gets spread out into a range of colors due to theatic particles in the air. This is because the atoms and molecules in the air can be Absorbed and re radiated at different frequencies, causing the light to be split into its component colors. The blue color is the result of the light being split into a range of colors, with the blue being the most dominant. This is why the sky appears blue.
*/

이 정도면... 그런대로 쓸만하죠? ^^ 재미있는 건, 저런 과정을 Foundry Local이 없어도 ONNX 모델을 직접 다루는 API를 사용하는 것도 가능하다는 점입니다. 실제로 Olive 문서에 위의 모델을 Foundry Local 없이 사용하는 코드가 이렇게 실려 있습니다.

using Microsoft.ML.OnnxRuntimeGenAI;

internal class Program
{
    // Install-Package Microsoft.ML.OnnxRuntimeGenAI

    private static void Main(string[] args)
    {
        string modelPath = @"C:\temp\models\Qwen2.5-Math-1.5B-Instruct\model";

        Console.Write("Loading model from " + modelPath + "...");
        using Model model = new(modelPath);
        Console.Write("Done\n");
        using Tokenizer tokenizer = new(model);
        using TokenizerStream tokenizerStream = tokenizer.CreateStream();


        while (true)
        {
            Console.Write("User:");

            string prompt = "<|im_start|>user\n" +
                            Console.ReadLine() +
                            "<|im_end|>\n<|im_start|>assistant\n";
            var sequences = tokenizer.Encode(prompt);

            using GeneratorParams gParams = new GeneratorParams(model);
            gParams.SetSearchOption("max_length", 200);
            using Generator generator = new(model, gParams);
            generator.AppendTokenSequences(sequences);

            Console.Out.Write("\nAI:");
            while (!generator.IsDone())
            {
                generator.GenerateNextToken();
                var token = generator.GetSequence(0)[^1];
                Console.Out.Write(tokenizerStream.Decode(token));
                Console.Out.Flush();
            }
            Console.WriteLine();
        }
    }
}

/* 실행 결과:
Loading model from E:\foundry_cache\models\Qwen2.5-Math-1.5B-Instruct\model...Done
User:Why is the sky blue?

AI:The sky is blue because of the way our eyes interpret the light that reaches us. When light from the sun reaches us, it is composed of all the colors of the visible spectrum. However, our eyes are not equally sensitive to all colors. The light that our eyes see as blue is actually a combination of many colors, with the blue color being the dominant one.

The sky is blue because the light that reaches us from the sun is a mixture of colors, and our eyes interpret this mixture to give us the appear blue. The exact composition of the light that reaches us from the sun is complex, but it is known that the blue color is the result of the light that is reflected by the Earth's surface and the air, which contains small particles that scatter the light. This process is known as Rayleigh Scating, and it is what gives us the blue color we see in the sky.

In summary, the sky
User:




참고로, 위에서 임의로 생성했던 "inference_model.json" 파일을 파이썬 코드로 이렇게도 생성할 수 있다고 문서에서 소개하고 있습니다.

# generate_inference_model.py
# This script generates the inference_model.json file for the Llama-3.2 model.
import json
import os
from transformers import AutoTokenizer

model_path = "models/Qwen2.5-Math-1.5B-Instruct/model"

tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]

template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

json_template = {
  "Name": "Qwen2.5-Math-1.5B-Instruct",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": template
  }
}

json_file = os.path.join(model_path, "inference_model.json")

with open(json_file, "w") as f:
    json.dump(json_template, f, indent=2)

위의 코드에서 의미가 있는 것은, json 파일의 "prompt" 필드에 들어가는 (다소 낯선 포맷의) 내용을,

{
  "Name": "Qwen2.5-Math-1.5B-Instruct",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{Content}<|im_end|>\n<|im_start|>assistant\n"
  },
  "Alias": "Qwen2.5-Math-1.5B"
}

익숙한 chat 리스트로부터 생성해 준다는 정도가 되겠습니다.

chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]

그 부분을 자주 고치는 것이 아니라면 그냥 json 파일 그대로 다뤄도 될 것입니다.





처음에 언급하긴 했지만, 문서에 따라 "meta-llama/Llama-3.2-1B-Instruct" 모델로 Olive 도구를 실행하면,

// conda create -n foundry-local-build python=3.10 -y
// conda activate foundry-local-build

// pip install olive-ai[gpu,finetune]
// pip install transformers onnxruntime-genai-cuda

$ olive auto-opt --model_name_or_path meta-llama/Llama-3.2-1B-Instruct --trust_remote_code --output_path models/llama --device cpu --provider CPUExecutionProvider --use_ort_genai --precision int4 --log_level 

이런 경고와,

/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/cache_utils.py:556: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  or not self.key_cache[layer_idx].numel()  # the layer has no cache
/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:589: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if sequence_length != 1:
/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/cache_utils.py:539: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  elif (
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.

이런 오류가 발생합니다.

Traceback (most recent call last):
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1594, in extract_vocab_merges_from_model
    from tiktoken.load import load_tiktoken_bpe
ModuleNotFoundError: No module named 'tiktoken'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1737, in convert_slow_tokenizer
    ).converted()
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1631, in converted
    tokenizer = self.tokenizer()
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1624, in tokenizer
    vocab_scores, merges = self.extract_vocab_merges_from_model(self.vocab_file)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1596, in extract_vocab_merges_from_model
    raise ValueError(
ValueError: `tiktoken` is required to read a `tiktoken` file. Install it with `pip install tiktoken`.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/testusr/miniconda3/envs/foundry-local-build/bin/olive", line 8, in <module>
    sys.exit(main())
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/cli/launcher.py", line 64, in main
    service.run()
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/cli/auto_opt.py", line 178, in run
    self._run_workflow()
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/cli/base.py", line 42, in _run_workflow
    output = olive_run(run_config)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/workflows/run/run.py", line 255, in run
    return run_engine(package_config, run_config)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/workflows/run/run.py", line 199, in run_engine
    return engine.run(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 230, in run
    run_result = self.run_accelerator(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 313, in run_accelerator
    output_footprint = self._run_no_search(input_model_config, input_model_id, accelerator_spec, output_dir)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 356, in _run_no_search
    should_prune, signal, model_ids = self._run_passes(input_model_config, input_model_id, accelerator_spec)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 639, in _run_passes
    model_config, model_id = self._run_pass(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 740, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/systems/local.py", line 29, in run_pass
    output_model = the_pass.run(model, output_model_path)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/passes/olive_pass.py", line 242, in run
    output_model = self._run_for_config(model, self.config, output_model_path)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 140, in _run_for_config
    additional_files.extend(model.save_metadata(str(output_dir), exclude_load_keys=["quantization_config"]))
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/model/handler/mixin/hf.py", line 107, in save_metadata
    get_tokenizer(output_dir)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/common/hf/utils.py", line 196, in get_tokenizer
    tokenizer = from_pretrained(AutoTokenizer, model_name_or_path, "tokenizer", **kwargs)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/common/hf/utils.py", line 85, in from_pretrained
    return cls.from_pretrained(get_pretrained_name_or_path(model_name_or_path, mlflow_dir), **kwargs)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 1032, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2025, in from_pretrained
    return cls._from_pretrained(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2278, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 154, in __init__
    super().__init__(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 139, in __init__
    fast_tokenizer = convert_slow_tokenizer(self, from_tiktoken=True)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1739, in convert_slow_tokenizer
    raise ValueError(
ValueError: Converting from SentencePiece and Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast converters: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']

오류 메시지를 따라가다 보면 "tiktoken"과 "SentencePiece" 모듈까지 설치할 텐데요, 그래도 오류가 발생하는 것은 마찬가지입니다.

Traceback (most recent call last):
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\Scripts\olive.exe\__main__.py", line 7, in <module>
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\cli\launcher.py", line 64, in main
    service.run()
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\cli\auto_opt.py", line 178, in run
    self._run_workflow()
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\cli\base.py", line 42, in _run_workflow
    output = olive_run(run_config)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\workflows\run\run.py", line 255, in run
    return run_engine(package_config, run_config)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\workflows\run\run.py", line 199, in run_engine
    return engine.run(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 230, in run
    run_result = self.run_accelerator(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 313, in run_accelerator
    output_footprint = self._run_no_search(input_model_config, input_model_id, accelerator_spec, output_dir)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 356, in _run_no_search
    should_prune, signal, model_ids = self._run_passes(input_model_config, input_model_id, accelerator_spec)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 639, in _run_passes
    model_config, model_id = self._run_pass(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 740, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\systems\local.py", line 29, in run_pass
    output_model = the_pass.run(model, output_model_path)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\passes\olive_pass.py", line 242, in run
    output_model = self._run_for_config(model, self.config, output_model_path)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\passes\onnx\conversion.py", line 140, in _run_for_config
    additional_files.extend(model.save_metadata(str(output_dir), exclude_load_keys=["quantization_config"]))
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\model\handler\mixin\hf.py", line 107, in save_metadata
    get_tokenizer(output_dir)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\common\hf\utils.py", line 196, in get_tokenizer
    tokenizer = from_pretrained(AutoTokenizer, model_name_or_path, "tokenizer", **kwargs)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\common\hf\utils.py", line 85, in from_pretrained
    return cls.from_pretrained(get_pretrained_name_or_path(model_name_or_path, mlflow_dir), **kwargs)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 1032, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\tokenization_utils_base.py", line 2025, in from_pretrained
    return cls._from_pretrained(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\tokenization_utils_base.py", line 2063, in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\tokenization_utils_base.py", line 2278, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\models\llama\tokenization_llama.py", line 171, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\models\llama\tokenization_llama.py", line 198, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\sentencepiece\__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\sentencepiece\__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

뭔가 이력상 바뀌었다는 것인데, 일단 모델을 보면 9개월 전에 일괄 업로드된 것으로 보아,

meta-llama/Llama-3.2-1B-Instruct - Fils and versions
; https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main

아마도 Olive 측의 구성 요소 변경으로 인한 것임을 추측게 합니다. 그중에서 transformers가 가장 중요할 것 같아 버전을 바꿔서 시도해 봤는데요,

pip install transformers==4.51.3

비록 오류는 없지만, 이런 식의 어지러운 출력이 나옵니다.

[... 11:16:26,952] [INFO] [common.py:127:model_proto_to_file] Deleting existing external data file: /mnt/c/temp/.olive-cache/default_workflow/runs/2f303eca/models/model.onnx.data
[... 11:17:47,803] [INFO] [engine.py:757:_run_pass] Pass conversion:onnxconversion finished in 229.748523 seconds
[... 11:17:47,832] [INFO] [engine.py:683:_run_pass] Running pass genai_config_only:modelbuilder
GroupQueryAttention (GQA) is used in this model.
Saving GenAI config in /mnt/c/temp/.olive-cache/default_workflow/runs/333f1dcb/models
Saving processing files in /mnt/c/temp/.olive-cache/default_workflow/runs/333f1dcb/models for GenAI
[... 11:17:49,873] [INFO] [engine.py:757:_run_pass] Pass genai_config_only:modelbuilder finished in 2.040874 seconds
[... 11:17:49,895] [INFO] [engine.py:683:_run_pass] Running pass peephole_optimizer:onnxpeepholeoptimizer
... 11:18:34,665 onnxscript.rewriter.collapse_slices [INFO] - The value 'start', 'end', 'axis', 'step' is not statically known.
...[생략]...
... 11:18:34,691 onnxscript.rewriter.collapse_slices [INFO] - The value 'start', 'end', 'axis', 'step' is not statically known.
... 11:18:34,692 onnxscript.rewriter.collapse_slices [INFO] - The value 'indices' is not statically known.
... 11:18:34,903 onnx_ir.passes.common.unused_removal [INFO] - Removed 226 unused nodes
Applied 4 of general pattern rewrite rules.
... 11:18:34,958 onnx_ir.passes.common.unused_removal [INFO] - No unused functions to remove
... 11:18:38,212 onnxscript.rewriter.collapse_slices [INFO] - The value 'start', 'end', 'axis', 'step' is not statically known.
...[생략]...
... 11:18:38,242 onnxscript.rewriter.collapse_slices [INFO] - The value 'start', 'end', 'axis', 'step' is not statically known.
... 11:18:38,243 onnxscript.rewriter.collapse_slices [INFO] - The value 'indices' is not statically known.
... 11:18:38,437 onnx_ir.passes.common.unused_removal [INFO] - Removed 2 unused nodes
... 11:18:38,491 onnx_ir.passes.common.unused_removal [INFO] - No unused functions to remove
[... 11:18:41,538] [WARNING] [peephole_optimizer.py:256:onnxoptimizer_optimize] Please install `onnxoptimizer` to apply more optimization.
[... 11:20:16,163] [INFO] [engine.py:757:_run_pass] Pass peephole_optimizer:onnxpeepholeoptimizer finished in 146.268799 seconds
[... 11:20:16,184] [INFO] [engine.py:683:_run_pass] Running pass transformer_optimizer:orttransformersoptimization
... 11:21:01,694 onnx_model [INFO] - Removed 176 Cast nodes with output type same as input
... 11:21:02,821 fusion_base [INFO] - Fused SimplifiedLayerNormalization: 33
... 11:21:14,752 fusion_base [INFO] - Fused SkipSimplifiedLayerNormalization: 32
... 11:21:32,631 fusion_utils [INFO] - Remove reshape node /model/Reshape_1 since its input shape is same as output: [4]
... 11:21:32,631 fusion_utils [INFO] - Remove reshape node /model/Reshape_2 since its input shape is same as output: ['batch_size', 1, 'sequence_length', 'past_sequence_length + sequence_length']
... 11:21:32,631 fusion_utils [INFO] - Remove reshape node /model/rotary_emb/Reshape since its input shape is same as output: [3]
...[생략]...
... 11:21:32,636 fusion_utils [INFO] - Remove reshape node /model/layers.15/self_attn/Reshape_5 since its input shape is same as output: [5]
... 11:21:32,899 onnx_model [INFO] - Removed 40 nodes
... 11:21:32,913 onnx_model_gpt2 [INFO] - postprocess: remove Reshape count: 0
... 11:21:32,960 onnx_model_bert [INFO] - opset version: 20
[... 11:23:05,998] [INFO] [engine.py:757:_run_pass] Pass transformer_optimizer:orttransformersoptimization finished in 169.813711 seconds
[... 11:23:06,018] [INFO] [engine.py:683:_run_pass] Running pass matmul4:onnxmatmul4quantizer
[... 11:24:15,682] [INFO] [engine.py:757:_run_pass] Pass matmul4:onnxmatmul4quantizer finished in 69.664231 seconds
[... 11:24:15,711] [INFO] [engine.py:683:_run_pass] Running pass extract_adapters:extractadapters
[... 11:24:26,766] [INFO] [extract_adapters.py:177:_run_for_config] No lora modules found in the model. Returning the original model.
[... 11:24:26,791] [INFO] [engine.py:757:_run_pass] Pass extract_adapters:extractadapters finished in 11.079207 seconds
[... 11:24:26,814] [INFO] [engine.py:241:run] Run history for cpu-cpu:
[... 11:24:26,815] [INFO] [engine.py:499:dump_run_history] Please install tabulate for better run history output
[... 11:24:26,830] [INFO] [cache.py:195:load_model] Loading model 36d160e9 from cache.
[... 11:25:00,572] [INFO] [engine.py:266:run] Saved output model to /mnt/c/temp/models/llama
Model is saved at /mnt/c/temp/models/llama

그래도 "inference_model.json" 파일을 생성해 두고,

c:\temp> type generate_inference_model.py
# generate_inference_model.py
# This script generates the inference_model.json file for the Llama-3.2 model.
import json
import os
from transformers import AutoTokenizer

model_path = "models/llama/model"

tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]


template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

json_template = {
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": template
  }
}

json_file = os.path.join(model_path, "inference_model.json")

with open(json_file, "w") as f:
    json.dump(json_template, f, indent=2)

c:\temp> python generate_inference_model.py

c:\temp> type models\llama\model\inference_model.json
{
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 18 Jun 2025\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
  }
}

C# OpenAI로 접근하면 아래와 같이 잘 동작은 합니다.

string ep = "http://localhost:5273/v1";
string key = "OPENAI_API_KEY";
string alias = "llama-3.2";

// ...[생략]...

/* 출력 결과:
[Text]: The sky appears blue because of a phenomenon called Rayleigh scattering. It's the scattering of light by small particles in the atmosphere, such as nitrogen and oxygen gases.

When sunlight enters the Earth's atmosphere, it encounters these tiny particles. The shorter, blue wavelengths of light are scattered more than the longer, red wavelengths. This is why the sky typically appears blue during the day.

However, it's worth noting that the sky doesn't appear blue at night. This is because the sun has set, and the sky is now filled with stars. The stars are much brighter than the blue light, so the sky appears more like a dark canvas.

Additionally, the color of the sky can also be influenced by other factors, such as:

* Dust and pollution: These particles can scatter light in a way that makes the sky appear more hazy or gray.
* Water vapor: This gas can scatter light in a way that makes the sky appear more white or opaque.
* Atmospheric conditions: The presence of
*/

중간의 메시지들로 인해 어떤 영향이 있는지는 알 수 없지만, 어쨌든 이 정도에서 마무리합니다. ^^

(첨부 파일은 이 글의 예제 코드를 포함합니다.)




[이 글에 대해서 여러분들과 의견을 공유하고 싶습니다. 틀리거나 미흡한 부분 또는 의문 사항이 있으시면 언제든 댓글 남겨주십시오.]

[연관 글]






[최초 등록일: ]
[최종 수정일: 6/19/2025]

Creative Commons License
이 저작물은 크리에이티브 커먼즈 코리아 저작자표시-비영리-변경금지 2.0 대한민국 라이센스에 따라 이용하실 수 있습니다.
by SeongTae Jeong, mailto:techsharer at outlook.com

비밀번호

댓글 작성자
 




1  2  3  4  5  6  7  [8]  9  10  11  12  13  14  15  ...
NoWriterDateCnt.TitleFile(s)
13781정성태10/22/20246041오류 유형: 930. WSL + eBPF: modprobe: FATAL: Module kheaders not found in directory
13780정성태10/22/20247151Linux: 92. WSL 2 - 커널 이미지로부터 커널 함수 역어셈블
13779정성태10/22/20245669개발 환경 구성: 729. WSL 2 - Mariner VM 커널 이미지 업데이트 방법
13778정성태10/21/20247226C/C++: 181. C/C++ - 소스코드 파일의 인코딩, 바이너리 모듈 상태의 인코딩
13777정성태10/20/20245573Windows: 265. Win32 API의 W(유니코드) 버전은 UCS-2일까요? UTF-16 인코딩일까요?
13776정성태10/19/20246674C/C++: 180. C++ - 고수준 FILE I/O 함수에서의 Unicode stream 모드(_O_WTEXT, _O_U16TEXT, _O_U8TEXT)파일 다운로드1
13775정성태10/19/20246718개발 환경 구성: 728. 윈도우 환경의 개발자를 위한 UTF-8 환경 설정
13774정성태10/18/20246201Linux: 91. Container 환경에서 출력하는 eBPF bpf_get_current_pid_tgid의 pid가 존재하지 않는 이유
13773정성태10/18/20246007Linux: 90. pid 네임스페이스 구성으로 본 WSL 2 + docker-desktop
13772정성태10/17/20246251Linux: 89. pid 네임스페이스 구성으로 본 WSL 2 배포본의 계층 관계
13771정성태10/17/20246007Linux: 88. WSL 2 리눅스 배포본 내에서의 pid 네임스페이스 구성
13770정성태10/17/20246476Linux: 87. ps + grep 조합에서 grep 명령어를 사용한 프로세스를 출력에서 제거하는 방법
13769정성태10/15/20247632Linux: 86. Golang + bpf2go를 사용한 eBPF 기본 예제파일 다운로드1
13768정성태10/15/20246893C/C++: 179. C++ - _O_WTEXT, _O_U16TEXT, _O_U8TEXT의 Unicode stream 모드파일 다운로드2
13767정성태10/14/20245777오류 유형: 929. bpftrace 수행 시 "ERROR: Could not resolve symbol: /proc/self/exe:BEGIN_trigger"
13766정성태10/14/20245246C/C++: 178. C++ - 파일에 대한 Text 모드의 "translated" 동작파일 다운로드1
13765정성태10/12/20246573오류 유형: 928. go build 시 "package maps is not in GOROOT" 오류
13764정성태10/11/20247224Linux: 85. Ubuntu - 원하는 golang 버전 설치
13763정성태10/11/20245997Linux: 84. WSL / Ubuntu 20.04 - bpftool 설치
13762정성태10/11/20246197Linux: 83. WSL / Ubuntu 22.04 - bpftool 설치
13761정성태10/11/20245858오류 유형: 927. WSL / Ubuntu - /usr/include/linux/types.h:5:10: fatal error: 'asm/types.h' file not found
13760정성태10/11/20247063Linux: 82. Ubuntu - clang 최신(stable) 버전 설치
13759정성태10/10/20248050C/C++: 177. C++ - 자유 함수(free function) 및 주소 지정 가능한 함수(addressable function) [6]
13758정성태10/8/20246672오류 유형: 926. dotnet tools를 sudo로 실행하는 경우 command not found
13757정성태10/8/20247086닷넷: 2306. Linux - dotnet tool의 설치 디렉터리가 PATH 환경변수에 자동 등록이 되는 이유
13756정성태10/8/20247186오류 유형: 925. ssh로 docker 접근을 할 때 "... malformed HTTP status code ..." 오류 발생
1  2  3  4  5  6  7  [8]  9  10  11  12  13  14  15  ...