성태의 닷넷 이야기
홈 주인
모아 놓은 자료
프로그래밍
질문/답변
사용자 관리
사용자
메뉴
아티클
외부 아티클
유용한 코드
온라인 기능
MathJax 입력기
최근 덧글
[정성태] VT sequences to "CONOUT$" vs. STD_O...
[정성태] NetCoreDbg is a managed code debugg...
[정성태] Evaluating tail call elimination in...
[정성태] What’s new in System.Text.Json in ....
[정성태] What's new in .NET 9: Cryptography ...
[정성태] 아... 제시해 주신 "https://akrzemi1.wordp...
[정성태] 다시 질문을 정리할 필요가 있을 것 같습니다. 제가 본문에...
[이승준] 완전히 잘못 짚었습니다. 댓글 지우고 싶네요. 검색을 해보...
[정성태] 우선 답글 감사합니다. ^^ 그런데, 사실 저 예제는 (g...
[이승준] 수정이 안되어서... byteArray는 BYTE* 타입입니다...
글쓰기
제목
이름
암호
전자우편
HTML
홈페이지
유형
제니퍼 .NET
닷넷
COM 개체 관련
스크립트
VC++
VS.NET IDE
Windows
Team Foundation Server
디버깅 기술
오류 유형
개발 환경 구성
웹
기타
Linux
Java
DDK
Math
Phone
Graphics
사물인터넷
부모글 보이기/감추기
내용
<div style='display: inline'> <h1 style='font-family: Malgun Gothic, Consolas; font-size: 20pt; color: #006699; text-align: center; font-weight: bold'>Transformers (신경망 언어모델 라이브러리) 강좌 - 2장 코드 실행 결과</h1> <p> 다음의 강좌에서,<br /> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > Transformers (신경망 언어모델 라이브러리) 강좌 ; <a target='tab' href='https://wikidocs.net/book/8056'>https://wikidocs.net/book/8056</a> </pre> <br /> 2장의 내용에,<br /> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > 2장. 🤗Transformers 라이브러리 사용하기 ; <a target='tab' href='https://wikidocs.net/166794'>https://wikidocs.net/166794</a> </pre> <br /> 포함된 코드를 구글 Colab (또는 local 환경)에서 수행한 결과를 나열해 봅니다. ^^<br /> <br /> <hr style='width: 50%' /><br /> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > from transformers import AutoTokenizer, AutoModel checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" # <a target='tab' href='https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english'>https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english</a> tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModel.from_pretrained(checkpoint) raw_inputs = [ "I've been waiting for a HuggingFace course my whole life.", "I hate this so much!", ] inputs = tokenizer(raw_inputs, <span style='color: blue; font-weight: bold'>padding=True, truncation=True</span>, <span style='color: blue; font-weight: bold'>return_tensors="pt"</span>) print(inputs) outputs = model(**inputs) print(outputs.last_hidden_state.shape) # 실행 결과 {'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])} torch.Size([2, 16, 768]) # 배치 크기(Batch size) == 2 # 시퀀스 길이(Sequence length) == 16 (len(inputs['input_ids'][0])) # 은닉 크기(Hidden size) == 768 </pre> <br /> 한글로 시도해 보려고 찾아낸 "quantumaikr/KoreanLM" 모델로는 (RAM의 제약으로) Colab에서 실습할 수 없습니다. (무료인 경우 Colab은 12GB 메모리를 주지만 아래의 코드를 실습하려면 34GB 정도의 여유 메모리가 필요합니다.) 따라서 한글용 코드를 실습하려면 별도의 환경을 구성해야 합니다. (제 경우에는 윈도우 환경의 PC에서 pytorch + transformer를 구성해 테스트했습니다.)<br /> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > # <a target='tab' href='https://huggingface.co/quantumaikr/KoreanLM'>https://huggingface.co/quantumaikr/KoreanLM</a> from transformers import AutoTokenizer, AutoModel checkpoint = "quantumaikr/KoreanLM" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModel.from_pretrained(checkpoint) raw_inputs = [ "KoreanLM은 괜찮은 오픈소스 프로젝트입니다.", "그렇긴 해도 예제 코드를 Colab에서 수행하지 못할 정도로 많은 RAM을 요구합니다.", ] inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") print(inputs) del inputs['token_type_ids'] outputs = model(**inputs) print(outputs.last_hidden_state.shape) # 출력 결과 {<span style='color: blue; font-weight: bold'>'input_ids'</span>: tensor([[ 1, 22467, 26369, 31354, 29871, 237, 183, 159, 239, 179, 177, 31354, 29871, 31346, 240, 151, 139, 31189, 30784, 29871, 240, 151, 135, 30906, 239, 163, 160, 31177, 239, 161, 136, 31063, 30709, 29889, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 1, 29871, 31607, 238, 163, 138, 237, 187, 183, 29871, 31435, 31136, 29871, 239, 155, 139, 31306, 29871, 239, 192, 151, 31493, 31517, 1530, 370, 31054, 31093, 29871, 30970, 240, 153, 140, 30944, 30811, 29871, 238, 173, 190, 240, 152, 163, 29871, 30852, 31136, 30906, 29871, 238, 170, 145, 31354, 18113, 31286, 29871, 31527, 31231, 31980, 31063, 30709, 29889]]), <span style='color: blue; font-weight: bold'>'token_type_ids'</span>: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), <span style='color: blue; font-weight: bold'>'attention_mask'</span>: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} torch.Size([2, 59, 4096]) </pre> <br /> 참고로, "quantumaikr/KoreanLM" 모델은 tokenizer의 결과에 ('<a target='tab' href='https://wikidocs.net/166801'>token_type_ids</a>' 항목이 추가로 나와 이런 오류가 발생합니다. (예를 들어, DistilBERT 모델을 사용하는 경우에는 tokenizer가 token_type_ids를 반환하지 않음)<br /> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > Traceback (most recent call last): File "C:\python\llml\test\sc1.py", line 15, in <module> outputs = model(**inputs) File "C:\python\llml\test\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) TypeError: LlamaModel.forward() got an unexpected keyword argument 'token_type_ids' </pre> <br /> 그래서 영문을 테스트했던 "distilbert-base-uncased-finetuned-sst-2-english"와는 달리 위의 코드에서는 "del inputs['token_type_ids']" 삭제 과정을 추가했습니다.<br /> <br /> <hr style='width: 50%' /><br /> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSequenceClassification.from_pretrained(checkpoint) raw_inputs = [ "I've been waiting for a HuggingFace course my whole life.", "I hate this so much!", ] inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") print(inputs) outputs = model(**inputs) print('outputs.logits.shape', outputs.logits.shape) print('outputs.logits', outputs.logits) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) # <a target='tab' href='https://www.sysnet.pe.kr/2/0/11966'>softmax</a> print('predictions', predictions) print('model.config.id2label', model.config.id2label) # 출력 결과 {'input_ids': ...[생략]..., 'attention_mask': ...[생략]...} outputs.logits.shape torch.Size([2, 2]) outputs.logits tensor([[-1.5607, 1.6123], [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>) predictions tensor([[4.0195e-02, 9.5980e-01], [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>) model.config.id2label {0: 'NEGATIVE', 1: 'POSITIVE'} </pre> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch checkpoint = "quantumaikr/KoreanLM" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSequenceClassification.from_pretrained(checkpoint) raw_inputs = [ "KoreanLM은 괜찮은 오픈소스 프로젝트입니다.", "그렇긴 해도 예제 코드를 Colab에서 수행하지 못할 정도로 많은 RAM을 요구합니다.", ] inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") print(inputs) del inputs['token_type_ids'] outputs = model(**inputs) print('outputs.logits.shape', outputs.logits.shape) print('outputs.logits', outputs.logits) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) print('predictions', predictions) print('model.config.id2label', model.config.id2label) # 출력 결과 {'input_ids': ...[생략]..., 'token_type_ids': ...[생략]..., 'attention_mask': ...[생략]...} outputs.logits.shape torch.Size([2, 2]) outputs.logits tensor([[-0.2934, 1.9842], [-0.8308, 2.1648]], grad_fn=<IndexBackward0>) predictions tensor([[0.0930, 0.9070], [0.0476, 0.9524]], grad_fn=<SoftmaxBackward0>) model.config.id2label {0: 'LABEL_0', 1: 'LABEL_1'} </pre> <br /> <hr style='width: 50%' /><br /> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > from transformers import BertConfig, BertModel config = BertConfig() print(config) model = BertModel.from_pretrained("bert-base-cased") # <a target='tab' href='https://huggingface.co/bert-base-cased'>https://huggingface.co/bert-base-cased</a> # 출력 결과 BertConfig { "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, <span style='color: blue; font-weight: bold'>"pad_token_id": 0,</span> "position_embedding_type": "absolute", "transformers_version": "4.29.2", "type_vocab_size": 2, "use_cache": true, "vocab_size": 30522 } </pre> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > from transformers import BertTokenizer, AutoTokenizer bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased") auto_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") print(bert_tokenizer) print(auto_tokenizer) print(bert_tokenizer("Using a Transformer network is simple")) print(auto_tokenizer("Using a Transformer network is simple")) tokens = bert_tokenizer.tokenize("Using a Transformer network is simple") # subword tokenization # <a target='tab' href='https://huggingface.co/bert-base-cased/blob/main/vocab.txt'>https://huggingface.co/bert-base-cased/blob/main/vocab.txt</a> print(tokens) decoded_string = bert_tokenizer.decode([7993, 170, 13809, 23763, 2443, 1110, 3014]) print(decoded_string) # 출력 결과 BertTokenizer(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True) BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True) {'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]} {'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]} ['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple'] [7993, 170, 13809, 23763, 2443, 1110, 3014] Using a Transformer network is simple </pre> <br /> <hr style='width: 50%' /><br /> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSequenceClassification.from_pretrained(checkpoint) sequence = "I've been waiting for a HuggingFace course my whole life." tokens = tokenizer.tokenize(sequence) ids = tokenizer.convert_tokens_to_ids(tokens) print("Input IDs:", ids) input_ids = torch.tensor([ids]) print("Input IDs:", input_ids) output = model(input_ids) print("Logits:", output.logits) # 출력 결과 Input IDs: [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012] Input IDs: tensor([[ 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]]) Logits: tensor([[-2.7276, 2.8789]], grad_fn=<AddmmBackward0>) </pre> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSequenceClassification.from_pretrained(checkpoint) sequence1_ids = [[200, 200, 200]] sequence2_ids = [[200, 200]] batched_ids = [ [200, 200, 200], [200, 200, <span style='color: blue; font-weight: bold'>tokenizer.pad_token_id</span>], ] print(model(torch.tensor(sequence1_ids)).logits) print(model(torch.tensor(sequence2_ids)).logits) print(model(torch.tensor(batched_ids)).logits) batch_ids = [ [200, 200, 200], [200, 200, tokenizer.pad_token_id], ] <span style='color: blue; font-weight: bold'>attention_mask = [ [1, 1, 1], [1, 1, 0], ]</span> outputs = model(torch.tensor(batch_ids), attention_mask=torch.tensor(attention_mask)) print(outputs.logits) # 출력 결과 tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>) tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>) tensor([[ 1.5694, -1.3895], [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>) tensor([[ 1.5694, -1.3895], [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>) </pre> <br /> <hr style='width: 50%' /><br /> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > from transformers import AutoTokenizer checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"] # 해당 시퀀스를 리스트 내의 최대 시퀀스 길이까지 패딩(padding) 합니다. model_inputs = tokenizer(sequences, padding="longest") # 시퀀스를 모델 최대 길이(model max length)까지 패딩(padding) 합니다. # (512 for BERT or DistilBERT) model_inputs = tokenizer(sequences, padding="max_length") # 지정된 최대 길이까지 시퀀스를 패딩(padding) 합니다. model_inputs = tokenizer(sequences, padding="max_length", max_length=8) # 출력 결과 {'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]} {'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102, 0, 0, 0, ...[생략: max_length]..., 0, 0, 0, 0, 0], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, ...[생략: max_length]..., 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...[생략: max_length]..., 0, 0, 0], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...[생략: max_length]..., 0]]} {'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0]]} </pre> <br /> <pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' > from transformers import AutoTokenizer checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) sequence = "I've been waiting for a HuggingFace course my whole life." model_inputs = tokenizer(sequence) print(model_inputs["input_ids"]) tokens = tokenizer.tokenize(sequence) print(tokens) ids = tokenizer.convert_tokens_to_ids(tokens) print(ids) print(tokenizer.decode(model_inputs["input_ids"])) print(tokenizer.decode(ids)) # 출력 결과 [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102] ['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.'] [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012] <span style='color: blue; font-weight: bold'>[CLS]</span> i've been waiting for a huggingface course my whole life. <span style='color: blue; font-weight: bold'>[SEP]</span> i've been waiting for a huggingface course my whole life. </pre> </p><br /> <br /><hr /><span style='color: Maroon'>[이 글에 대해서 여러분들과 의견을 공유하고 싶습니다. 틀리거나 미흡한 부분 또는 의문 사항이 있으시면 언제든 댓글 남겨주십시오.]</span> </div>
첨부파일
스팸 방지용 인증 번호
8572
(왼쪽의 숫자를 입력해야 합니다.)