(연관된 글이 2개 있습니다.)
(시리즈 글이 5개 있습니다.)

개발 환경 구성: 748. Windows + Foundry Local - 로컬에서 AI 모델 활용
; https://www.sysnet.pe.kr/2/0/13943

닷넷: 2337. C# - Hugging Face에 공개된 LLM 모델을 Foundry Local에서 사용하는 방법
; https://www.sysnet.pe.kr/2/0/13954

닷넷: 2338. C# / Foundry Local - Phi-4-multimodal 모델을 사용하는 방법
; https://www.sysnet.pe.kr/2/0/13957

닷넷: 2339. C# - Phi-4-multimodal 모델의 GPU 가속 방법 (ORT 사용)
; https://www.sysnet.pe.kr/2/0/13958

닷넷: 2348. C# - 카카오 카나나 모델 + Microsoft.ML.OnnxRuntimeGenAI 예제
; https://www.sysnet.pe.kr/2/0/13976

C# - Hugging Face에 공개된 LLM 모델을 Foundry Local에서 사용하는 방법

지난 글에서 살펴본,

Windows + Foundry Local - 로컬에서 AI 모델 활용
; https://www.sysnet.pe.kr/2/0/13943

Foundry Local의 기본 모델은 17개 정도가 등록된 상태지만 원한다면 (모델을 ONNX 포맷으로 바꿀 수만 있다면) 자유롭게 추가 등록하는 것이 가능합니다. 이와 관련해 아래의 공식 문서에서 Hugging Face에 공개된 모델을 가져다 쓰는 방법을 다루는데요,

Compile Hugging Face models to run on Foundry Local
; https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-compile-hugging-face-models?tabs=Bash

잠깐 실습해 볼까요? ^^

문서에 나오는 것처럼, 우선 Olive 도구를 실행할 수 있는 환경을 준비합니다.

// WSL 2 Ubuntu 24.04 환경 (혹은, 윈도우의 경우 Anaconda3 환경을 준비)

// conda create -n huggingface-build python=3.10 -y
// conda activate huggingface-build
// conda remove -n huggingface-build --all -y

// Olive latest documentation - Getting started
// https://microsoft.github.io/Olive/getting-started/getting-started.html

// github.com/microsoft/Olive
// https://github.com/microsoft/Olive

// GPU가 있는 경우
pip install olive-ai[gpu,finetune]
pip install transformers onnxruntime-genai-cuda

// CPU만 있는 경우
pip install olive-ai[cpu,finetune]
pip install transformers onnxruntime-genai

그다음 Hugging Face 사이트를 통해 User access token을 발급받고 로그인을 한 번 해줍니다.

C:\temp> huggingface-cli login
...[생략]...

C:\temp> huggingface-cli whoami
...[생략]...

이제 원하는 모델을 Hugging Face 사이트에서 선택해야 하는데요, 문서에서는 "meta-llama/Llama-3.2-1B-Instruct" 모델을 예로 들고 있지만,

olive auto-opt --model_name_or_path meta-llama/Llama-3.2-1B-Instruct --trust_remote_code --output_path models/llama --device cpu --provider CPUExecutionProvider --use_ort_genai --precision int4 --log_level 1

// Convert a Hugging Face model to ONNX
// https://huggingface.co/spaces/onnx-community/convert-to-onnx

실제로 해보면 오류로 인해 실습이 안 됩니다. 이에 대해 이슈가 있는데,

Quickstart doesn't work. #1916
; https://github.com/microsoft/Olive/issues/1916

해결이 될 때까지 일단은 다른 모델로 실습하는 것이 좋겠습니다. ^^; 문제는, 제가 이 분야는 잘 모른다는 점입니다. 분위기를 보면, Hugging Face의 모든 모델이 OLIVE를 이용한 ONNX로 변환되는 것은 아닌 듯하고, 따라서 적당한 것을 고르기가 힘들었는데요, 다행히 아래의 영상에서,

Windows Dev Chat - October 24, 2024
; https://www.youtube.com/live/lAc1fq_0ftw?t=775s

"Qwen/Qwen2.5-Math-1.5B-Instruct" 모델을 다루는데, 저 역시 해당 영상의 명령어로 잘 변환이 됐습니다.

(huggingface-build) testusr@testpc:/mnt/c/temp$ olive auto-opt --model_name_or_path Qwen/Qwen2.5-Math-1.5B-Instruct --trust_remote_code --output_path models/Qwen2.5-Math-1.5B-Instruct --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1
Loading HuggingFace model from Qwen/Qwen2.5-Math-1.5B-Instruct
[... 08:52:14,438] [INFO] [run.py:142:run_engine] Running workflow default_workflow
[... 08:52:14,450] [INFO] [cache.py:138:__init__] Using cache directory: /mnt/c/temp/.olive-cache/default_workflow
[... 08:52:14,478] [INFO] [accelerator_creator.py:217:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[... 08:52:14,488] [INFO] [engine.py:223:run] Running Olive on accelerator: cpu-cpu
[... 08:52:14,490] [INFO] [engine.py:864:_create_system] Creating target system ...
[... 08:52:14,491] [INFO] [engine.py:867:_create_system] Target system created in 0.000242 seconds
[... 08:52:14,491] [INFO] [engine.py:879:_create_system] Creating host system ...
[... 08:52:14,491] [INFO] [engine.py:882:_create_system] Host system created in 0.000174 seconds
[... 08:52:14,743] [INFO] [engine.py:683:_run_pass] Running pass model_builder:modelbuilder
GroupQueryAttention (GQA) is used in this model.
Reading embedding layer
Reading decoder layer 0
Reading decoder layer 1
Reading decoder layer 2
...[생략]...
Reading decoder layer 27
Reading final norm
Reading LM head
Saving ONNX model in /mnt/c/temp/.olive-cache/default_workflow/runs/27b66e29/models
Saving GenAI config in /mnt/c/temp/.olive-cache/default_workflow/runs/27b66e29/models
Saving processing files in /mnt/c/temp/.olive-cache/default_workflow/runs/27b66e29/models for GenAI
[... 09:05:24,768] [INFO] [engine.py:757:_run_pass] Pass model_builder:modelbuilder finished in 790.024721 seconds
[... 09:05:24,790] [INFO] [engine.py:683:_run_pass] Running pass extract_adapters:extractadapters
[... 09:05:43,445] [INFO] [extract_adapters.py:177:_run_for_config] No lora modules found in the model. Returning the original model.
[... 09:05:43,482] [INFO] [engine.py:757:_run_pass] Pass extract_adapters:extractadapters finished in 18.692435 seconds
[... 09:05:43,503] [INFO] [engine.py:241:run] Run history for cpu-cpu:
[... 09:05:43,504] [INFO] [engine.py:499:dump_run_history] Please install tabulate for better run history output
[... 09:05:43,514] [INFO] [cache.py:195:load_model] Loading model 030c6a62 from cache.
[... 09:06:53,028] [INFO] [engine.py:266:run] Saved output model to /mnt/c/temp/models/Qwen2.5-Math-1.5B-Instruct
Model is saved at /mnt/c/temp/models/Qwen2.5-Math-1.5B-Instruct

자, 그럼 이제 Foundry Local에서 연동할 수 있도록 위의 출력을 등록해야 하는데요, 사실 이 과정은 Foundry Local의 작업 디렉터리를 변경하는 것으로 해결됩니다.

C:\temp\models> foundry cache cd c:\temp\models
Restarting service...
🔴 Service is stopped.
🟢 Service is Started on http://localhost:5273, PID 60904!

하지만 현재 상태로는 해당 모델에 대한 정보를 열거하면 이런 식으로 나오는데요,

C:\temp\models> foundry cache ls
Models cached on device:
   Alias                         Model ID
💾 Model was not found in catalogmodel

여기서 "Model ID"는 ./models/Qwen2.5-Math-1.5B-Instruct/model 디렉터리에 "inference_model.json" 파일을 다음과 같이 생성해 바꿀 수 있습니다.

// inference_model.json 파일이 없으므로 새로 생성

C:\temp\models> type .\Qwen2.5-Math-1.5B-Instruct\model\inference_model.json
{
  "Name": "Qwen2.5-Math-1.5B-Instruct",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{Content}<|im_end|>\n<|im_start|>assistant\n"
  }
}

C:\temp\models> foundry cache ls
Models cached on device:
   Alias                         Model ID
💾 Model was not found in catalogQwen2.5-Math-1.5B-Instruct

아쉽게도 여전히 "Alias"가 안 나오는데요, ^^; 어떻게 바꿀 수 있는지는 아직 모르겠습니다. (혹시 아시는 분은 댓글로 알려주세요. ^^;)

어쨌든, 저 상태까지 만들었으면 이제 C# 클라이언트를 다음과 같이 구성해 실습할 수 있습니다.

using OpenAI;
using OpenAI.Chat;
using System.ClientModel;

namespace ConsoleApp1;

internal class Program
{
    // Install-Package OpenAI 
    static async Task Main(string[] args)
    {
        string ep = "http://localhost:5273/v1"; // Foundry Local의 기본 엔드포인트 ("foundry service status" 명령으로 확인 가능)
        string key = "OPENAI_API_KEY";
        string alias = "Qwen2.5-Math-1.5B-Instruct";

        OpenAIClientOptions options = new OpenAIClientOptions();
        options.Endpoint = new Uri(ep);

        ApiKeyCredential akc = new ApiKeyCredential(key);
        ChatClient client = new(alias, akc, options);

        ChatCompletion completion = client.CompleteChat("Why is the sky blue?");

        foreach (var message in completion.Content)
        {
            Console.WriteLine($"[{message.Kind}]: {message.Text}");
        }
    }
}

/* 실행 결과:
[Text]: The sky is blue because of a process called Rayleigh Scintometry. When light from the sun reaches the Earth's atmosphere, it gets spread out into a range of colors due to theatic particles in the air. This is because the atoms and molecules in the air can be Absorbed and re radiated at different frequencies, causing the light to be split into its component colors. The blue color is the result of the light being split into a range of colors, with the blue being the most dominant. This is why the sky appears blue.
*/

이 정도면... 그런대로 쓸만하죠? ^^ 재미있는 건, 저런 과정을 Foundry Local이 없어도 ONNX 모델을 직접 다루는 API를 사용하는 것도 가능하다는 점입니다. 실제로 Olive 문서에 위의 모델을 Foundry Local 없이 사용하는 코드가 이렇게 실려 있습니다.

using Microsoft.ML.OnnxRuntimeGenAI;

internal class Program
{
    // CPU 버전: Install-Package Microsoft.ML.OnnxRuntimeGenAI
    // CUDA 버전: Install-Package Microsoft.ML.OnnxRuntimeGenAI.CUDA

    private static void Main(string[] args)
    {
        string modelPath = @"C:\temp\models\Qwen2.5-Math-1.5B-Instruct\model";

        Console.Write("Loading model from " + modelPath + "...");
        using Model model = new(modelPath);
        Console.Write("Done\n");
        using Tokenizer tokenizer = new(model);
        using TokenizerStream tokenizerStream = tokenizer.CreateStream();


        while (true)
        {
            Console.Write("User:");

            string prompt = "<|im_start|>user\n" +
                            Console.ReadLine() +
                            "<|im_end|>\n<|im_start|>assistant\n";
            var sequences = tokenizer.Encode(prompt);

            using GeneratorParams gParams = new GeneratorParams(model);
            gParams.SetSearchOption("max_length", 200);
            using Generator generator = new(model, gParams);
            generator.AppendTokenSequences(sequences);

            Console.Out.Write("\nAI:");
            while (!generator.IsDone())
            {
                generator.GenerateNextToken();
                var token = generator.GetSequence(0)[^1];
                Console.Out.Write(tokenizerStream.Decode(token));
                Console.Out.Flush();
            }
            Console.WriteLine();
        }
    }
}

/* 실행 결과:
Loading model from E:\foundry_cache\models\Qwen2.5-Math-1.5B-Instruct\model...Done
User:Why is the sky blue?

AI:The sky is blue because of the way our eyes interpret the light that reaches us. When light from the sun reaches us, it is composed of all the colors of the visible spectrum. However, our eyes are not equally sensitive to all colors. The light that our eyes see as blue is actually a combination of many colors, with the blue color being the dominant one.

The sky is blue because the light that reaches us from the sun is a mixture of colors, and our eyes interpret this mixture to give us the appear blue. The exact composition of the light that reaches us from the sun is complex, but it is known that the blue color is the result of the light that is reflected by the Earth's surface and the air, which contains small particles that scatter the light. This process is known as Rayleigh Scating, and it is what gives us the blue color we see in the sky.

In summary, the sky
User:

참고로, 위에서 임의로 생성했던 "inference_model.json" 파일을 파이썬 코드로 이렇게도 생성할 수 있다고 문서에서 소개하고 있습니다.

# generate_inference_model.py
# This script generates the inference_model.json file for the Llama-3.2 model.
import json
import os
from transformers import AutoTokenizer

model_path = "models/Qwen2.5-Math-1.5B-Instruct/model"

tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]

template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

json_template = {
  "Name": "Qwen2.5-Math-1.5B-Instruct",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": template
  }
}

json_file = os.path.join(model_path, "inference_model.json")

with open(json_file, "w") as f:
    json.dump(json_template, f, indent=2)

위의 코드에서 의미가 있는 것은, json 파일의 "prompt" 필드에 들어가는 (다소 낯선 포맷의) 내용을,

{
  "Name": "Qwen2.5-Math-1.5B-Instruct",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{Content}<|im_end|>\n<|im_start|>assistant\n"
  },
  "Alias": "Qwen2.5-Math-1.5B"
}

익숙한 chat 리스트로부터 생성해 준다는 정도가 되겠습니다.

chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]

그 부분을 자주 고치는 것이 아니라면 그냥 json 파일 그대로 다뤄도 될 것입니다.

처음에 언급하긴 했지만, 문서에 따라 "meta-llama/Llama-3.2-1B-Instruct" 모델로 Olive 도구를 실행하면,

// conda create -n foundry-local-build python=3.10 -y
// conda activate foundry-local-build

// pip install olive-ai[gpu,finetune]
// pip install transformers onnxruntime-genai-cuda

$ olive auto-opt --model_name_or_path meta-llama/Llama-3.2-1B-Instruct --trust_remote_code --output_path models/llama --device cpu --provider CPUExecutionProvider --use_ort_genai --precision int4 --log_level

이런 경고와,

/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/cache_utils.py:556: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
or not self.key_cache[layer_idx].numel() # the layer has no cache
/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:589: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if sequence_length != 1:
/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/cache_utils.py:539: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
elif (
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.

이런 오류가 발생합니다.

Traceback (most recent call last):
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1594, in extract_vocab_merges_from_model
    from tiktoken.load import load_tiktoken_bpe
ModuleNotFoundError: No module named 'tiktoken'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1737, in convert_slow_tokenizer
    ).converted()
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1631, in converted
    tokenizer = self.tokenizer()
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1624, in tokenizer
    vocab_scores, merges = self.extract_vocab_merges_from_model(self.vocab_file)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1596, in extract_vocab_merges_from_model
    raise ValueError(
ValueError: `tiktoken` is required to read a `tiktoken` file. Install it with `pip install tiktoken`.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/testusr/miniconda3/envs/foundry-local-build/bin/olive", line 8, in <module>
    sys.exit(main())
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/cli/launcher.py", line 64, in main
    service.run()
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/cli/auto_opt.py", line 178, in run
    self._run_workflow()
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/cli/base.py", line 42, in _run_workflow
    output = olive_run(run_config)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/workflows/run/run.py", line 255, in run
    return run_engine(package_config, run_config)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/workflows/run/run.py", line 199, in run_engine
    return engine.run(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 230, in run
    run_result = self.run_accelerator(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 313, in run_accelerator
    output_footprint = self._run_no_search(input_model_config, input_model_id, accelerator_spec, output_dir)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 356, in _run_no_search
    should_prune, signal, model_ids = self._run_passes(input_model_config, input_model_id, accelerator_spec)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 639, in _run_passes
    model_config, model_id = self._run_pass(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/engine/engine.py", line 740, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/systems/local.py", line 29, in run_pass
    output_model = the_pass.run(model, output_model_path)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/passes/olive_pass.py", line 242, in run
    output_model = self._run_for_config(model, self.config, output_model_path)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 140, in _run_for_config
    additional_files.extend(model.save_metadata(str(output_dir), exclude_load_keys=["quantization_config"]))
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/model/handler/mixin/hf.py", line 107, in save_metadata
    get_tokenizer(output_dir)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/common/hf/utils.py", line 196, in get_tokenizer
    tokenizer = from_pretrained(AutoTokenizer, model_name_or_path, "tokenizer", **kwargs)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/olive/common/hf/utils.py", line 85, in from_pretrained
    return cls.from_pretrained(get_pretrained_name_or_path(model_name_or_path, mlflow_dir), **kwargs)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 1032, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2025, in from_pretrained
    return cls._from_pretrained(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2278, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 154, in __init__
    super().__init__(
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 139, in __init__
    fast_tokenizer = convert_slow_tokenizer(self, from_tiktoken=True)
  File "/home/testusr/miniconda3/envs/foundry-local-build/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1739, in convert_slow_tokenizer
    raise ValueError(
ValueError: Converting from SentencePiece and Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast converters: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']

오류 메시지를 따라가다 보면 "tiktoken"과 "SentencePiece" 모듈까지 설치할 텐데요, 그래도 오류가 발생하는 것은 마찬가지입니다.

Traceback (most recent call last):
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\Scripts\olive.exe\__main__.py", line 7, in <module>
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\cli\launcher.py", line 64, in main
    service.run()
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\cli\auto_opt.py", line 178, in run
    self._run_workflow()
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\cli\base.py", line 42, in _run_workflow
    output = olive_run(run_config)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\workflows\run\run.py", line 255, in run
    return run_engine(package_config, run_config)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\workflows\run\run.py", line 199, in run_engine
    return engine.run(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 230, in run
    run_result = self.run_accelerator(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 313, in run_accelerator
    output_footprint = self._run_no_search(input_model_config, input_model_id, accelerator_spec, output_dir)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 356, in _run_no_search
    should_prune, signal, model_ids = self._run_passes(input_model_config, input_model_id, accelerator_spec)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 639, in _run_passes
    model_config, model_id = self._run_pass(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\engine\engine.py", line 740, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\systems\local.py", line 29, in run_pass
    output_model = the_pass.run(model, output_model_path)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\passes\olive_pass.py", line 242, in run
    output_model = self._run_for_config(model, self.config, output_model_path)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\passes\onnx\conversion.py", line 140, in _run_for_config
    additional_files.extend(model.save_metadata(str(output_dir), exclude_load_keys=["quantization_config"]))
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\model\handler\mixin\hf.py", line 107, in save_metadata
    get_tokenizer(output_dir)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\common\hf\utils.py", line 196, in get_tokenizer
    tokenizer = from_pretrained(AutoTokenizer, model_name_or_path, "tokenizer", **kwargs)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\olive\common\hf\utils.py", line 85, in from_pretrained
    return cls.from_pretrained(get_pretrained_name_or_path(model_name_or_path, mlflow_dir), **kwargs)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 1032, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\tokenization_utils_base.py", line 2025, in from_pretrained
    return cls._from_pretrained(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\tokenization_utils_base.py", line 2063, in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\tokenization_utils_base.py", line 2278, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\models\llama\tokenization_llama.py", line 171, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\transformers\models\llama\tokenization_llama.py", line 198, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\sentencepiece\__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "C:\Users\testusr\anaconda3\envs\huggingface-build\lib\site-packages\sentencepiece\__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

뭔가 이력상 바뀌었다는 것인데, 일단 모델을 보면 9개월 전에 일괄 업로드된 것으로 보아,

meta-llama/Llama-3.2-1B-Instruct - Fils and versions
; https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main

아마도 Olive 측의 구성 요소 변경으로 인한 것임을 추측게 합니다. 그중에서 transformers가 가장 중요할 것 같아 버전을 바꿔서 시도해 봤는데요,

pip install transformers==4.51.3

비록 오류는 없지만, 이런 식의 어지러운 출력이 나옵니다.

[... 11:16:26,952] [INFO] [common.py:127:model_proto_to_file] Deleting existing external data file: /mnt/c/temp/.olive-cache/default_workflow/runs/2f303eca/models/model.onnx.data
[... 11:17:47,803] [INFO] [engine.py:757:_run_pass] Pass conversion:onnxconversion finished in 229.748523 seconds
[... 11:17:47,832] [INFO] [engine.py:683:_run_pass] Running pass genai_config_only:modelbuilder
GroupQueryAttention (GQA) is used in this model.
Saving GenAI config in /mnt/c/temp/.olive-cache/default_workflow/runs/333f1dcb/models
Saving processing files in /mnt/c/temp/.olive-cache/default_workflow/runs/333f1dcb/models for GenAI
[... 11:17:49,873] [INFO] [engine.py:757:_run_pass] Pass genai_config_only:modelbuilder finished in 2.040874 seconds
[... 11:17:49,895] [INFO] [engine.py:683:_run_pass] Running pass peephole_optimizer:onnxpeepholeoptimizer
... 11:18:34,665 onnxscript.rewriter.collapse_slices [INFO] - The value 'start', 'end', 'axis', 'step' is not statically known.
...[생략]...
... 11:18:34,691 onnxscript.rewriter.collapse_slices [INFO] - The value 'start', 'end', 'axis', 'step' is not statically known.
... 11:18:34,692 onnxscript.rewriter.collapse_slices [INFO] - The value 'indices' is not statically known.
... 11:18:34,903 onnx_ir.passes.common.unused_removal [INFO] - Removed 226 unused nodes
Applied 4 of general pattern rewrite rules.
... 11:18:34,958 onnx_ir.passes.common.unused_removal [INFO] - No unused functions to remove
... 11:18:38,212 onnxscript.rewriter.collapse_slices [INFO] - The value 'start', 'end', 'axis', 'step' is not statically known.
...[생략]...
... 11:18:38,242 onnxscript.rewriter.collapse_slices [INFO] - The value 'start', 'end', 'axis', 'step' is not statically known.
... 11:18:38,243 onnxscript.rewriter.collapse_slices [INFO] - The value 'indices' is not statically known.
... 11:18:38,437 onnx_ir.passes.common.unused_removal [INFO] - Removed 2 unused nodes
... 11:18:38,491 onnx_ir.passes.common.unused_removal [INFO] - No unused functions to remove
[... 11:18:41,538] [WARNING] [peephole_optimizer.py:256:onnxoptimizer_optimize] Please install `onnxoptimizer` to apply more optimization.
[... 11:20:16,163] [INFO] [engine.py:757:_run_pass] Pass peephole_optimizer:onnxpeepholeoptimizer finished in 146.268799 seconds
[... 11:20:16,184] [INFO] [engine.py:683:_run_pass] Running pass transformer_optimizer:orttransformersoptimization
... 11:21:01,694 onnx_model [INFO] - Removed 176 Cast nodes with output type same as input
... 11:21:02,821 fusion_base [INFO] - Fused SimplifiedLayerNormalization: 33
... 11:21:14,752 fusion_base [INFO] - Fused SkipSimplifiedLayerNormalization: 32
... 11:21:32,631 fusion_utils [INFO] - Remove reshape node /model/Reshape_1 since its input shape is same as output: [4]
... 11:21:32,631 fusion_utils [INFO] - Remove reshape node /model/Reshape_2 since its input shape is same as output: ['batch_size', 1, 'sequence_length', 'past_sequence_length + sequence_length']
... 11:21:32,631 fusion_utils [INFO] - Remove reshape node /model/rotary_emb/Reshape since its input shape is same as output: [3]
...[생략]...
... 11:21:32,636 fusion_utils [INFO] - Remove reshape node /model/layers.15/self_attn/Reshape_5 since its input shape is same as output: [5]
... 11:21:32,899 onnx_model [INFO] - Removed 40 nodes
... 11:21:32,913 onnx_model_gpt2 [INFO] - postprocess: remove Reshape count: 0
... 11:21:32,960 onnx_model_bert [INFO] - opset version: 20
[... 11:23:05,998] [INFO] [engine.py:757:_run_pass] Pass transformer_optimizer:orttransformersoptimization finished in 169.813711 seconds
[... 11:23:06,018] [INFO] [engine.py:683:_run_pass] Running pass matmul4:onnxmatmul4quantizer
[... 11:24:15,682] [INFO] [engine.py:757:_run_pass] Pass matmul4:onnxmatmul4quantizer finished in 69.664231 seconds
[... 11:24:15,711] [INFO] [engine.py:683:_run_pass] Running pass extract_adapters:extractadapters
[... 11:24:26,766] [INFO] [extract_adapters.py:177:_run_for_config] No lora modules found in the model. Returning the original model.
[... 11:24:26,791] [INFO] [engine.py:757:_run_pass] Pass extract_adapters:extractadapters finished in 11.079207 seconds
[... 11:24:26,814] [INFO] [engine.py:241:run] Run history for cpu-cpu:
[... 11:24:26,815] [INFO] [engine.py:499:dump_run_history] Please install tabulate for better run history output
[... 11:24:26,830] [INFO] [cache.py:195:load_model] Loading model 36d160e9 from cache.
[... 11:25:00,572] [INFO] [engine.py:266:run] Saved output model to /mnt/c/temp/models/llama
Model is saved at /mnt/c/temp/models/llama

그래도 "inference_model.json" 파일을 생성해 두고,

c:\temp> type generate_inference_model.py
# generate_inference_model.py
# This script generates the inference_model.json file for the Llama-3.2 model.
import json
import os
from transformers import AutoTokenizer

model_path = "models/llama/model"

tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]


template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

json_template = {
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": template
  }
}

json_file = os.path.join(model_path, "inference_model.json")

with open(json_file, "w") as f:
    json.dump(json_template, f, indent=2)

c:\temp> python generate_inference_model.py

c:\temp> type models\llama\model\inference_model.json
{
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 18 Jun 2025\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
  }
}

C# OpenAI로 접근하면 아래와 같이 잘 동작은 합니다.

string ep = "http://localhost:5273/v1";
string key = "OPENAI_API_KEY";
string alias = "llama-3.2";

// ...[생략]...

/* 출력 결과:
[Text]: The sky appears blue because of a phenomenon called Rayleigh scattering. It's the scattering of light by small particles in the atmosphere, such as nitrogen and oxygen gases.

When sunlight enters the Earth's atmosphere, it encounters these tiny particles. The shorter, blue wavelengths of light are scattered more than the longer, red wavelengths. This is why the sky typically appears blue during the day.

However, it's worth noting that the sky doesn't appear blue at night. This is because the sun has set, and the sky is now filled with stars. The stars are much brighter than the blue light, so the sky appears more like a dark canvas.

Additionally, the color of the sky can also be influenced by other factors, such as:

* Dust and pollution: These particles can scatter light in a way that makes the sky appear more hazy or gray.
* Water vapor: This gas can scatter light in a way that makes the sky appear more white or opaque.
* Atmospheric conditions: The presence of
*/

중간의 메시지들로 인해 어떤 영향이 있는지는 알 수 없지만, 어쨌든 이 정도에서 마무리합니다. ^^

(첨부 파일은 이 글의 예제 코드를 포함합니다.)

[이 글에 대해서 여러분들과 의견을 공유하고 싶습니다. 틀리거나 미흡한 부분 또는 의문 사항이 있으시면 언제든 댓글 남겨주십시오.]

[연관 글]

[최초 등록일: 6/18/2025]
[최종 수정일: 7/25/2025]

이 저작물은 크리에이티브 커먼즈 코리아 저작자표시-비영리-변경금지 2.0 대한민국 라이센스에 따라 이용하실 수 있습니다.

by SeongTae Jeong, mailto:techsharer at outlook.com

No	Writer	Date	Cnt.	Title	File(s)
13808	정성태	11/9/2024	8366	오류 유형: 932. Linux - 커널 업그레이드 후 "error: bad shim signature" 오류 발생
13807	정성태	11/9/2024	6380	Linux: 102. Linux - 커널 이미지 파일 서명 (Ubuntu 환경)
13806	정성태	11/8/2024	6648	Windows: 270. 어댑터 상세 정보(Network Connection Details) 창의 내용이 비어 있는 경우
13805	정성태	11/8/2024	6142	오류 유형: 931. Active Directory의 adprep 또는 복제가 안 되는 경우
13804	정성태	11/7/2024	8414	Linux: 101. eBPF 함수의 인자를 다루는 방법
13803	정성태	11/7/2024	7814	닷넷: 2309. C# - .NET Core에서 바뀐 DateTime.Ticks의 정밀도
13802	정성태	11/6/2024	8193	Windows: 269. GetSystemTimeAsFileTime과 GetSystemTimePreciseAsFileTime의 차이점	1
13801	정성태	11/5/2024	8013	Linux: 100. eBPF의 2가지 방식 - libbcc와 libbpf(CO-RE)
13800	정성태	11/3/2024	9048	닷넷: 2308. C# - ICU 라이브러리를 활용한 문자열의 대소문자 변환 [2]	1
13799	정성태	11/2/2024	6323	개발 환경 구성: 732. 모바일 웹 브라우저에서 유니코드 문자가 표시되지 않는 경우
13798	정성태	11/2/2024	8673	개발 환경 구성: 731. 유니코드 - 출력 예시 및 폰트 찾기
13797	정성태	11/1/2024	8685	C/C++: 185. C++ - 문자열의 대소문자를 변환하는 transform + std::tolower/toupper 방식의 문제점	1
13796	정성태	10/31/2024	7966	C/C++: 184. C++ - ICU dll을 이용하는 예제 코드 (Windows)	1
13795	정성태	10/31/2024	7117	Windows: 268. Windows - 리눅스 환경처럼 공백으로 끝나는 프롬프트 만들기
13794	정성태	10/30/2024	7265	닷넷: 2307. C# - 윈도우에서 한글(및 유니코드)을 포함한 콘솔 프로그램을 컴파일 및 실행하는 방법
13793	정성태	10/28/2024	7211	C/C++: 183. C++ - 윈도우에서 한글(및 유니코드)을 포함한 콘솔 프로그램을 컴파일 및 실행하는 방법
13792	정성태	10/27/2024	6447	Linux: 99. Linux - 프로세스의 실행 파일 경로 확인
13791	정성태	10/27/2024	6837	Windows: 267. Win32 API의 A(ANSI) 버전은 DBCS를 사용할까요?	1
13790	정성태	10/27/2024	6255	Linux: 98. Ubuntu 22.04 - 리눅스 커널 빌드 및 업그레이드
13789	정성태	10/27/2024	7476	Linux: 97. menuconfig에 CONFIG_DEBUG_INFO_BTF, CONFIG_DEBUG_INFO_BTF_MODULES 옵션이 없는 경우
13788	정성태	10/26/2024	5841	Linux: 96. eBPF (bpf2go) - fentry, fexit를 이용한 트레이스
13787	정성태	10/26/2024	7830	개발 환경 구성: 730. github - Linux 커널 repo를 윈도우 환경에서 git clone하는 방법 [1]
13786	정성태	10/26/2024	7317	Windows: 266. Windows - 대소문자 구분이 가능한 파일 시스템
13785	정성태	10/23/2024	6471	C/C++: 182. 윈도우가 운영하는 2개의 Code Page	1
13784	정성태	10/23/2024	7197	Linux: 95. eBPF - kprobe를 이용한 트레이스
13783	정성태	10/23/2024	6802	Linux: 94. eBPF - vmlinux.h 헤더 포함하는 방법 (bpf2go에서 사용)

AD BLOCK 해제 요청

C# - Hugging Face에 공개된 LLM 모델을 Foundry Local에서 사용하는 방법