J & J - 정성태의 닷넷 이야기

사용자

메뉴

최근 덧글

[정성태] Working with Rust Libraries from C#...[정성태] Detecting blocking calls using asyn...[정성태] 아쉽게도, 커뮤니티는 아니고 개인 블로그입니다. ^^[정성태] 질문이 잘 이해가 안 됩니다. 우선, 해당 소스코드에서 ILis...[양승조

] var대신 dinamic으로 선언해서 해결은 했습니다. 맞는 해...[양승조

] 또 막혔습니다. ㅠㅠ var list = props[i].Ge...[양승조

] 아. 감사합니다. 어제는 안됐던것 같은데....정신을 차려야겠네...[정성태] "props[i].GetValue(props[i])" 코드에서 ...[정성태] 저렇게 조각 코드 말고, 실제로 재현이 되는 예제 프로젝트를 압...[정성태] Modules 창(Ctrl+Shift+U)을 띄워서, 해당 Op...

글쓰기

제목

이름

암호

전자우편

HTML

홈페이지

유형

내용

<div style='display: inline'>
<h1 style='font-family: Malgun Gothic, Consolas; font-size: 20pt; color: #006699; text-align: center; font-weight: bold'>Elasticsearch 6.6부터 기본 추가된 한글 형태소 분석기 노리(nori) 사용법</h1>

예전에는, 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
윈도우 환경에서 elasticsearch의 한글 형태소 분석기 설치
; <a target='tab' href='https://www.sysnet.pe.kr/2/0/11664'>https://www.sysnet.pe.kr/2/0/11664</a>
</pre>
 
한글 형태소 분석기 때문에 윈도우의 경우 최신 버전의 elasticsearch를 사용하고 싶어도 6.1.1로 고정했어야 하는 제약이, 이제는 6.6부터 elasticsearch 자체에서 제공하므로 마음껏 최신 버전 사용을 사용할 수 있습니다. 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
6.7.2 노리 (nori) 한글 형태소 분석기
; <a target='tab' href='https://esbook.kimjmin.net/06-text-analysis/6.7-stemming/6.7.2-nori'>https://esbook.kimjmin.net/06-text-analysis/6.7-stemming/6.7.2-nori</a>
</pre>
 
자, 그럼 7.9 버전의 Elasticsearch로 클러스터도 구성해 보았으니, 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
Windows - 단일 머신에서 단일 바이너리로 여러 개의 ElasticSearch 노드를 실행하는 방법
; <a target='tab' href='https://www.sysnet.pe.kr/2/0/12308'>https://www.sysnet.pe.kr/2/0/12308</a>
</pre>
 
이참에 Nori도 설치해 보겠습니다. 방법이 매우 간단한데, "<a target='tab' href='https://esbook.kimjmin.net/06-text-analysis/6.7-stemming/6.7.2-nori'>6.7.2 노리 (nori) 한글 형태소 분석기</a>" 글에 나온 데로 Nori 플러그인을 설치하고, 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
D:\elk\elasticsearch&gt; .\bin\elasticsearch-plugin install analysis-nori
-&gt; Installing analysis-nori
-&gt; Downloading analysis-nori from elastic
[=================================================] 100%
-&gt; Installed analysis-nori

// 제거
// .\bin\elasticsearch-plugin remove analysis-nori
</pre>
 
elasticsearch를 재시작하면 로드 도중 다음과 같은 메시지를 볼 수 있습니다. 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
[2020-09-02T11:07:48,023][INFO ][o.e.p.PluginsService ] [TESTPC] loaded plugin [analysis-nori]
</pre>
 
<hr style='width: 50%' /> 
 
인덱스에 적용하기 전, 자신의 목적에 맞게 토큰을 잘 구획하는지는 다음과 같은 명령어로 확인할 수 있습니다. 
 
<div style='BACKGROUND-COLOR: #ccffcc; padding: 10px 10px 5px 10px; MARGIN: 0px 10px 10px 10px; FONT-FAMILY: Malgun Gothic, Consolas, Verdana; COLOR: #005555'>
curl -X POST "http://localhost:9200/_analyze" -H "Content-Type: application/json" -d "{ \"tokenizer\": \"nori_tokenizer\", \"text\": \"논쟁이 주를 이룹니다.\" }" 
 
{"tokens":[{"token":"논쟁","start_offset":0,"end_offset":2,"type":"word","position":0},{"token":"이","start_offset":2,"end_offset":3,"type":"word","position":1},{"token":"주","start_offset":4,"end_offset":5,"type":"word","position":2},{"token":"를","start_offset":5,"end_offset":6,"type":"word","position":3},{"token":"이루","start_offset":7,"end_offset":11,"type":"word","position":4},{"token":"ㅂ니다","start_offset":7,"end_offset":11,"type":"word","position":5}]} 
</div> 
 
<div style='BACKGROUND-COLOR: #ccffcc; padding: 10px 10px 5px 10px; MARGIN: 0px 10px 10px 10px; FONT-FAMILY: Malgun Gothic, Consolas, Verdana; COLOR: #005555'>
curl -X POST "http://localhost:9200/_analyze" -H "Content-Type: application/json" -d "{ \"tokenizer\": \"nori_tokenizer\", \"text\": \"동해물과 백두산이\" }" 
 
{"tokens":[{"token":"동해","start_offset":0,"end_offset":2,"type":"word","position":0},{"token":"물","start_offset":2,"end_offset":3,"type":"word","position":1},{"token":"과","start_offset":3,"end_offset":4,"type":"word","position":2},{"token":"백두","start_offset":5,"end_offset":7,"type":"word","position":3},{"token":"산","start_offset":7,"end_offset":8,"type":"word","position":4},{"token":"이","start_offset":8,"end_offset":9,"type":"word","position":5}]} 
</div> 
 
보는 바와 같이, 기본 nori_tokenizer는 너무 세세하게 토큰을 나누기 때문에 일반적인 목적으로는 맞지 않습니다. 대신, "decompound_mode"의 옵션을 "none"으로 살짝 조정해 주면, 
 
<div style='BACKGROUND-COLOR: #ccffcc; padding: 10px 10px 5px 10px; MARGIN: 0px 10px 10px 10px; FONT-FAMILY: Malgun Gothic, Consolas, Verdana; COLOR: #005555'>
curl -X POST "http://localhost:9200/_analyze" -H "Content-Type: application/json" -d "{ \"tokenizer\": { \"type\": \"nori_tokenizer\", \"decompound_mode\": \"none\" }, \"text\": \"논쟁이 주를 이룹니다.\" }"
 
 
{"tokens":[{"token":"논쟁","start_offset":0,"end_offset":2,"type":"word","position":0},{"token":"이","start_offset":2,"end_offset":3,"type":"word","position":1},{"token":"주","start_offset":4,"end_offset":5,"type":"word","position":2},{"token":"를","start_offset":5,"end_offset":6,"type":"word","position":3},{"token":"이룹니다","start_offset":7,"end_offset":11,"type":"word","position":4}]} 
</div> 
 
<div style='BACKGROUND-COLOR: #ccffcc; padding: 10px 10px 5px 10px; MARGIN: 0px 10px 10px 10px; FONT-FAMILY: Malgun Gothic, Consolas, Verdana; COLOR: #005555'>
curl -X POST "http://localhost:9200/_analyze" -H "Content-Type: application/json" -d "{ \"tokenizer\": { \"type\": \"nori_tokenizer\", \"decompound_mode\": \"none\" }, \"text\": \"동해물과 백두산이\" }"
 
 
{"tokens":[{"token":"동해","start_offset":0,"end_offset":2,"type":"word","position":0},{"token":"물","start_offset":2,"end_offset":3,"type":"word","position":1},{"token":"과","start_offset":3,"end_offset":4,"type":"word","position":2},{"token":"백두 산","start_offset":5,"end_offset":8,"type":"word","position":3},{"token":"이","start_offset":8,"end_offset":9,"type":"word","position":4}]} 
</div> 
 
여전히 "동해물"을 "동해" + "물"로, "백두산"을 "백두 산"으로 토큰을 나누는 것이 좀 마음에 안 들지만... 기본 모드였던 "discard"보다는 그나마 낫고 (별다른 대안이 없으므로) "none"으로 설정하는 것이 최선일 듯합니다. 
 
<hr style='width: 50%' /> 
 
tokenizer의 옵션 값이 결정되었으면 이제 인덱스에 반영해 보겠습니다. 다음의 글에 따라, 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
4.2 CRUD - 입력, 조회, 수정, 삭제
; <a target='tab' href='https://esbook.kimjmin.net/04-data/4.2-crud'>https://esbook.kimjmin.net/04-data/4.2-crud</a>
</pre>
 
간단하게 "html_strip"과 "lowercase" 필터가 함께 적용된 tokenizer로 인덱스를 생성하고, 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
c:\temp&gt; curl -XPUT "http://localhost:9200/my_org/" -H "Content-Type: application/json" -d "{ \"settings\":{ \"analysis\":{ \"analyzer\":{ \"nori_analyzer\": { \"tokenizer\": \"nori_tokenizer\", \"decompound_mode\": \"none\", \"char_filter\":[ \"html_strip\" ], \"filter\": [ \"lowercase\" ] } } } } }"

{"acknowledged":true,"shards_acknowledged":true,"index":"my_org"}
</pre>
 
문서 구조를 정의한 다음, 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
c:\temp&gt; curl -XPUT "http://localhost:9200/my_org/_mapping" -H "Content-Type: application/json" -d "{ \"properties\" : { \"name\" : {\"type\" : \"text\", \"index\" : \"false\"}, \"age\" : {\"type\" : \"integer\"}, \"address\" : {\"type\" : \"text\", \"analyzer\": \"nori_analyzer\" }, \"registered\" : {\"type\" : \"date\"} } }"

{"acknowledged":true}
</pre>
 
샘플 데이터를 넣으면, 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
/* doc_data1.json
{
 "name": "tester1",
 "age": 16,
 "address": "동해물과 백두산이 &lt;h&gt;스키마를&lt;/h&gt;",
 "registered": "2017-04-29T10:16:00"
}
*/

/* doc_data2.json
{
    "name": "tester2",
    "age": 15,
    "address": "&lt;span title='test'&gt;김이지 Shine&lt;/span&gt;",
    "registered": "2017-04-29T10:16:00"
}
*/

C:\temp&gt; curl -XPUT "http://localhost:9200/my_org/_doc/1" -H "Content-Type: application/json" -d @doc_data1.json
{"_index":"my_org","_type":"_doc","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":0,"_primary_term":1}

C:\temp&gt; curl -XPUT "http://localhost:9200/my_org/_doc/2" -H "Content-Type: application/json" -d @doc_data2.json
{"_index":"my_org","_type":"_doc","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":1,"_primary_term":1}
</pre>
 
이제 다음과 같이 검색할 수 있습니다. 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
C:\temp&gt; curl -XGET "http://localhost:9200/my_org/_search" -H "Content-Type: application/json" -d "{ \"query\": { \"match\": { \"address\": \"동해물\" } } }"

{"took":12,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.5753642,"hits":[{"_index":"my_org","_type":"_doc","_id":"1","_score":0.5753642,"_source":{ "name" : "tester", "age": 16, "address": "동해물과 백두산이 &lt;h&gt;스키마를&lt;/h&gt;", "registered":"2017-04-29T10:16:00" }}]}}
</pre>
 
그런데, "서해물"로도 검색이 되는군요. ^^; 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
C:\temp&gt; curl -XGET "http://localhost:9200/my_org/_search" -H "Content-Type: application/json" -d "{ \"query\": { \"match\": { \"address\": \"서해물\" } } }"
{"took":3,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.2876821,"hits":[{"_index":"my_org","_type":"_doc","_id":"1","_score":0.2876821,"_source":{ "name" : "tester", "age": 16, "address": "동해물과 백두산이 &lt;h&gt;스키마를&lt;/h&gt;", "registered":"2017-04-29T10:16:00" }}]}}
</pre>
 
다행히, "서해"로는 검색이 안 되고. 
 
<pre style='margin: 10px 0px 10px 10px; padding: 10px 0px 10px 10px; background-color: #fbedbb; overflow: auto; font-family: Consolas, Verdana;' >
C:\temp&gt; curl -XGET "http://localhost:9200/my_org/_search" -H "Content-Type: application/json" -d "{ \"query\": { \"match\": { \"address\": \"서해\" } } }"
{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null,"hits":[]}}
</pre>

<hr />[이 글에 대해서 여러분들과 의견을 공유하고 싶습니다. 틀리거나 미흡한 부분 또는 의문 사항이 있으시면 언제든 댓글 남겨주십시오.]

</div>

첨부파일

스팸 방지용 인증 번호

7574 (왼쪽의 숫자를 입력해야 합니다.)