(연관된 글이 1개 있습니다.)
(시리즈 글이 5개 있습니다.)

오류 유형: 848. .NET Core/5+ - Process terminated. Couldn't find a valid ICU package installed on the system
; https://www.sysnet.pe.kr/2/0/13266

닷넷: 2153. C# - 사용자가 빌드한 ICU dll 파일을 사용하는 방법
; https://www.sysnet.pe.kr/2/0/13430

C/C++: 184. C++ - ICU dll을 이용하는 예제 코드 (Windows)
; https://www.sysnet.pe.kr/2/0/13796

C/C++: 185. C++ - 문자열의 대소문자를 변환하는 transform + std::tolower/toupper 방식의 문제점
; https://www.sysnet.pe.kr/2/0/13797

닷넷: 2308. C# - ICU 라이브러리를 활용한 문자열의 대소문자 변환
; https://www.sysnet.pe.kr/2/0/13800

C++ - 문자열의 대소문자를 변환하는 transform + std::tolower/toupper 방식의 문제점

(이 글에 포함된 일부 유니코드 문자는 모바일 웹 브라우저에서는 정상적으로 안 보일 수 있습니다.)

C++에서 문자열의 대소문자를 변환하는 방법을 검색해 보면 아래의 코드로 설명하는 답변들이 많습니다.

// tolower function for C++ strings
// https://stackoverflow.com/questions/3403844/tolower-function-for-c-strings

std::string str = "wHatEver";
std::transform(str.begin(), str.end(), str.begin(), ::tolower);

그런데, 저 코드가 지닌 문제점을 아래의 글에서 아주 잘 설명하고 있습니다.

A popular but wrong way to convert a string to uppercase or lowercase
; https://devblogs.microsoft.com/oldnewthing/20241007-00/?p=110345

간단하게 정리해 볼까요? ^^

우선, tolower 함수는 "addressible function"이 아닙니다. 따라서, 엄밀히는 람다를 이용해 다음과 같이 바꿔야 합니다.

std::wstring name;

std::transform(name.begin(), name.end(), name.begin(),
    [](auto c) { return std::tolower(c); });

하지만 그래도 문제가 있습니다. 위에서 tolower는 unsigned char 타입을 처리하는, 즉 narrow characters를 처리하는 함수입니다.

// C:\Program Files (x86)\Windows Kits\10\Source\10.0.22621.0\ucrt\convert\tolower_toupper.cpp
extern "C" int __cdecl tolower(int const c)
{
    return __acrt_locale_changed()
        ? _tolower_l(c, nullptr)
        : __ascii_tolower(c);
}

__forceinline int __CRTDECL __ascii_tolower(int const _C)
{
    if (_C >= 'A' && _C <= 'Z')
    {
        return _C - ('A' - 'a');
    }
    return _C;
}

따라서 wchar_t 타입(wide characters)을 처리하는 경우에는 towlower 함수를 사용해야 맞습니다.

// C:\Program Files (x86)\Windows Kits\10\Source\10.0.22621.0\ucrt\convert\towlower.cpp
extern "C" wint_t __cdecl towlower (wint_t c)
{
    return _towlower_l(c, nullptr);
}

하지만, tolower를 사용해도 그런대로 잘 동작하는데요, 왜냐하면 tolower의 인자가 int 타입이라 wchar_t 타입을 받을 수 있고, 대부분의 경우 영문자 알파벳 'A' ~ 'Z'를 'a' ~ 'z'로 변환하는 것을 기대하기 때문입니다.

그런 탓에, 일단은 Visual C++에서 transform 코드에 사용하는 것은 크게 문제가 되진 않습니다.

하지만, tolower/towlower로는 진정한 유니코드 범위의 문자를 처리할 수 없습니다. 왜냐하면, 그 2가지 함수는 변환 자체를 하나의 char/wchar_t를 대상으로 하기 때문인데요, 이로 인해 Surrogate Pair를 처리할 수 없습니다.

"A popular but wrong way to convert a string to uppercase or lowercase"글에서는 이에 대한 사례로 U+10C80 코드(OLD HUNGARIAN CAPITAL LETTER)에 해당하는 '𐲀' 문자와 그것의 소문자에 해당하는 U+10CC0(OLD HUNGARIAN SMALL LETTER A) '𐳀' 문자를 들고 있습니다.

U+10C80 문자의 UTF-16 인코딩 값은 Surrogate Pair에 해당하는 "0xD803 0xDC80" 4바이트로 표현하는데요, 따라서 transform으로 변환하면 2바이트씩 별개로 처리돼 소문자 변환에 실패하게 됩니다.

#include <iostream>
#include <algorithm>

#include <fcntl.h>
#include <io.h>

int main(int, char* [])
{
    (void)_setmode(_fileno(stdout), _O_U16TEXT);

    std::wstring name = L"TEST한글𐲀";

    std::transform(name.begin(), name.end(), name.begin(),
        [](auto c) {
            wprintf(L"%c == %d (%x)\n", c, c, c);
            return towlower(c); 
        });

    std::wcout << name << std::endl;
}

/* 출력 결과
T == 84 (54)
E == 69 (45)
S == 83 (53)
T == 84 (54)
한 == 54620 (d55c)
글 == 44544 (ae00)
  == 55299 (d803)
  == 56448 (dc80)
test한글𐲀
*/

보는 바와 같이 '𐲀' 문자는 towlower 함수에 d803, dc80 두 개로 나뉘어 전달돼 정상적인 소문자 변환이 되지 않았습니다. 저 문제가 해결되려면 wchar_t가 4바이트로 다뤄져야 하는데요, 여기서 재미있는 건, ^^ 설령 그렇게 해도 여전히 문제가 되는 경우가 있다고 합니다. 특이하게도, 소문자일 때는 1개의 문자였던 것이 대문자로 바뀔 때는 2개의 문자로 바뀌는 경우라고 하는데요, 예를 들어 소문자인 U+00DF(ß) LATIN SMALL LETTER SHARP S 문자가 대문자로 바뀌는 경우에는 'S' 문자가 2개인 "SS"로 바뀌게 됩니다. 즉, "Straße" 문자열을 대문자로 바꾸면 "STRASSE"가 되어 버립니다.

유사하게 역시 소문자인 U+FB02(ﬂ) LATIN SMALL LIGATURE FL은 대문자로 바뀌는 경우에는 "FL" 2개의 문자로 바뀌게 됩니다. 즉, 대소문자 변환 시 길이까지도 변하는 경우가 있는 것입니다.

또 다른 특이 사례로, 프랑스어에는 '소문자' + '악센트 문자' 2개의 글자가 (폰트로는) 1개의 '소문자 악센트 문자'로 표현되는 것도 있는데요, 이것을 대문자로 바꾸면 마찬가지로 '대문자' + '악센트 문자'로 바뀝니다.

소문자 'à' (\x0061\x0300) == 'a' U+0061 (Latin Small Letter A) + U+0300 (COMBINING GRAVE ACCENT)

대문자 'À' (\x0041\x0300) == 'A' U+0041 (Latin Capital Letter A) + U+0300 (COMBINING GRAVE ACCENT)

따라서, 16비트 값 하나로 그 문자의 대소문자를 결정할 수 없기 때문에 단순히 1:1 매핑으로 변환하려는 transform + tolower/towlower와 같은 방식은 이런 문제를 근본적으로 해결할 수 없는 것입니다.

그렇다면, 당연히 문맥을 고려할 수 있는 전용 함수가 있어야 하는데요, Windows의 경우 LCMapStringEx 함수가 그 용도로 사용할 수 있다고 합니다.

LCMapStringEx function (winnls.h)
; https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringex

그래서, 실제로 사용을 해봤는데요,

#include <iostream>
#include <algorithm>
#include <windows.h>

#include <fcntl.h>
#include <io.h>

int main(int, char* [])
{
    (void)_setmode(_fileno(stdout), _O_U16TEXT);

    LPCWSTR localeName = LOCALE_NAME_INVARIANT;
    // LPCWSTR localeName = L"hu-HU";

    {
        std::wstring name = L"TEST한글\xd803\xdc80";
        DWORD dwMapFlags = LCMAP_LOWERCASE | LCMAP_LINGUISTIC_CASING;

        int needBytes = LCMapStringEx(localeName, dwMapFlags,
            name.c_str(), -1, nullptr, 0, (LPNLSVERSIONINFO)&nvi, nullptr, 0);
        wchar_t* buffer = new wchar_t[needBytes];

        memset(buffer, 0, needBytes * sizeof(wchar_t));
        int result = LCMapStringEx(localeName, dwMapFlags,
            name.c_str(), (int)name.length(), buffer, needBytes, nullptr, nullptr, 0);

        std::wcout << name << std::endl;
        std::wcout << buffer << std::endl;
    }
    wprintf(L"\n");
}

/* 출력 결과:
TEST한글𐲀
test한글𐲀
*/

name과 buffer의 출력 결과를 보면, '𐲀' 문자가 소문자 U+10CC0(OLD HUNGARIAN SMALL LETTER A)로 변환되지 않았습니다.

혹시 함수에 전달한 옵션의 다른 조합이 있는 것일까요? ^^; 다른 예제가 있을까 싶어 검색해 봤더니, 마침 저처럼 "A popular but wrong way to convert a string to uppercase or lowercase"를 참조한 글이 있었습니다.

How To Convert Unicode Strings to Lower Case and Upper Case in C++
; https://giodicanio.com/2024/10/09/how-to-convert-unicode-strings-to-lower-case-and-upper-case-in-c-plus-plus/

또한, github에 관련 코드도 공개했길래,

GiovanniDicanio/StringCaseConversion
; https://github.com/GiovanniDicanio/StringCaseConversion/tree/main

반가워 살펴봤지만 ^^; 저랑 같은 코드였습니다. 단지, 그가 저 코드를 자신 있게 공개할 수 있었던 것은, ^^; 제가 했던 테스트를 하지 않았기 때문에 잘 동작한다고 믿었던 것 같습니다.

문제를 떠넘기는 ^^ 바람직한 자세로, 이슈를 넌지시 던져봤는데요,

It doesn't work in a context-free manner. #1
; https://github.com/GiovanniDicanio/StringCaseConversion/issues/1

stackoverflow에 있는 글을 링크하며 회피했습니다. ^^

Why does LCMapStringEx fail to convert OLD HUNGARIAN CAPITAL LETTER A (U+10C80) to lowercase?
; https://stackoverflow.com/questions/79142316/why-does-lcmapstringex-fail-to-convert-old-hungarian-capital-letter-a-u10c80

결국 제가 내린 결론은, 정작 Raymond Chen 자신도 LCMapStringEx 코드로는 테스트한 적이 없어서 저런 식의 글을 썼다는 ... 의심이 들었습니다.

그래도, Raymond Chen의 글에서 언급한 icu 라이브러리는,

If you need to perform a case mapping on a string, you can use LCMap­String­Ex with LCMAP_LOWERCASE or LCMAP_UPPERCASE, possibly with other flags like LCMAP_LINGUISTIC_CASING. If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.

잘 동작했고, 이에 대해서는 지난 글에 소개를 했습니다.

C++ - ICU dll을 이용하는 예제 코드 (Windows)
; https://www.sysnet.pe.kr/2/0/13796

위의 글에서는 소문자로의 변환만 테스트를 했었는데요, 대문자의 경우에도 다음과 같이 테스트를 작성해 보면,

wchar_t pText2[] = L"\x00DF, \x0061\x0300, \xFB02";
UChar* upperStr;

{
    uText = (UChar*)pText2;

    // 우선, 대문자로 변환했을 때의 결과물을 위해 필요한 버퍼 크기를 알아냄.
    length = u_strToUpper(NULL, 0, uText, -1, nullptr, &errorCode);

    if (errorCode != U_ZERO_ERROR && errorCode != U_BUFFER_OVERFLOW_ERROR)
    {
        printf("Error: (ICU) %s\n", u_errorName(errorCode));
        return 1;
    }

    errorCode = U_ZERO_ERROR;

    if (length < 1)
    {
        printf("Error: length less than 1.\n");
        return 1;
    }

    // UTF-16 (2바이트) 대문자 텍스트가 보관될 버퍼 할당
    upperStr = (UChar*)malloc((length + 1) * sizeof(UChar));

    if (!upperStr)
    {
        printf("Error: unable to allocate memory (3).\n");
        return 1;
    }
}

{
    retLength = u_strToUpper(upperStr, length + 1, uText, -1, nullptr, &errorCode);

    if (errorCode != U_ZERO_ERROR)
    {
        printf("Error: (ICU) %s\n", u_errorName(errorCode));
        free(upperStr);
        return 1;
    }

    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), pText2, wcslen(pText2), NULL, NULL);
    printf(", len(lower_text): %zu\n", wcslen(pText2));

    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), upperStr, retLength, NULL, NULL);
    printf(", len(upper_text): %d\n", retLength);

    free(upperStr);
}

/* 출력 결과:
ß, à, ﬂ, len(lower_text): 8
SS, À, FL, len(upper_text): 10
*/

보는 바와 같이 "A popular but wrong way to convert a string to uppercase or lowercase" 글에서 언급한 몇 가지 사례에 대해 정확하게 변환됐습니다.

어느 글에선가, wcslwr 함수를 사용하면 어떠냐는 글이 있었던 것 같은데요,

{
    wchar_t buffer[1024] = L"TEST한글𐳀𐲀";

    _locale_t loc = _create_locale(LC_ALL, "hu-HU");
    _wcslwr_s_l(buffer, wcslen(buffer) + 1, loc);
    std::wcout << buffer << std::endl;
}

결과는 LCMapStringEx와 동일합니다. 왜냐하면 wcslwr 함수는 내부적으로 LCMapStringEx를 호출하기 때문입니다.

// 반면 ReactOS의 경우 wcslwr 함수는 tolower 함수를 호출합니다.
// https://doxygen.reactos.org/d2/d20/wcslwr_8c_source.html
// https://doxygen.reactos.org/d3/d42/ctype_8c_source.html#l00901

// C:\Program Files (x86)\Windows Kits\10\Source\10.0.22621.0\ucrt\string\wcslwr.cpp

static errno_t __cdecl _wcslwr_s_l_stat (
        _Inout_updates_z_(sizeInWords) wchar_t * wsrc,
        size_t sizeInWords,
        _locale_t plocinfo
        )
{

    // ...[생략]...

    /* Inquire size of wdst string */
    if ( (dstsize = __acrt_LCMapStringW(
                    plocinfo->locinfo->locale_name[LC_CTYPE],
                    LCMAP_LOWERCASE,
                    wsrc,
                    -1,
                    nullptr,
                    0
                    )) == 0 )
    {
        errno = EILSEQ;
        return errno;
    }

    // ...[생략]...

    /* Map wrc string to wide-character wdst string in alternate case */
    if (__acrt_LCMapStringW(
                plocinfo->locinfo->locale_name[LC_CTYPE],
                LCMAP_LOWERCASE,
                wsrc,
                -1,
                wdst.get(),
                dstsize
                ) != 0)
    {
        /* Copy wdst string to user string */
        return wcscpy_s(wsrc, sizeInWords, wdst.get());
    }
    // ...[생략]...
}

// C:\Program Files (x86)\Windows Kits\10\Source\10.0.22621.0\ucrt\internal\winapi_thunks.cpp
extern "C" int WINAPI __acrt_LCMapStringEx(
    LPCWSTR          const locale_name,
    DWORD            const flags,
    LPCWSTR          const source,
    int              const source_count,
    LPWSTR           const destination,
    int              const destination_count,
    LPNLSVERSIONINFO const version,
    LPVOID           const reserved,
    LPARAM           const sort_handle
    )
{
    if (auto const lc_map_string_ex = try_get_LCMapStringEx()) // lc_map_string_ex == {KernelBase.dll!LCMapStringEx(void)}
    {
        return lc_map_string_ex(locale_name, flags, source, source_count, destination, destination_count, version, reserved, sort_handle);
    }
    // ...[생략]...
}

마지막으로, 'ß' 문자(U+00DF: Latin Small Letter Sharp S)의 대문자가 2017년부터는 ('SS'가 아닌) 'ẞ' 문자(U+1E9E: Latin Capital Letter Sharp S)로 바뀌었다고 하는데요, 하지만 현재의 ICU 라이브러리는 그전의 규칙으로 변환하고 있습니다.

[이 글에 대해서 여러분들과 의견을 공유하고 싶습니다. 틀리거나 미흡한 부분 또는 의문 사항이 있으시면 언제든 댓글 남겨주십시오.]

[다음 글] 개발 환경 구성: 731. 유니코드 - 출력 예시 및 폰트 찾기
[이전 글] C/C++: 184. C++ - ICU dll을 이용하는 예제 코드 (Windows)

[연관 글]

개발 환경 구성: 732. 모바일 웹 브라우저에서 유니코드 문자가 표시되지 않는 경우

[최초 등록일: 11/1/2024]
[최종 수정일: 2/11/2025]

이 저작물은 크리에이티브 커먼즈 코리아 저작자표시-비영리-변경금지 2.0 대한민국 라이센스에 따라 이용하실 수 있습니다.

by SeongTae Jeong, mailto:techsharer at outlook.com

No	Writer	Date	Cnt.	Title	File(s)
12604	정성태	4/18/2021	16874	VS.NET IDE: 163. 비주얼 스튜디오 속성 창의 "Build(빌드)" / "Configuration(구성)"에서의 "활성" 의미
12603	정성태	4/16/2021	18620	VS.NET IDE: 162. 비주얼 스튜디오 - 상속받은 컨트롤이 디자인 창에서 지원되지 않는 문제
12602	정성태	4/16/2021	19457	VS.NET IDE: 161. x64 DLL 프로젝트의 컨트롤이 Visual Studio의 Designer에서 보이지 않는 문제 [1]
12601	정성태	4/15/2021	18528	.NET Framework: 1040. C# - REST API 대신 github 클라이언트 라이브러리를 통해 프로그래밍으로 접근
12600	정성태	4/15/2021	19138	.NET Framework: 1039. C# - Kubeconfig의 token 설정 및 인증서 구성을 자동화하는 프로그램
12599	정성태	4/14/2021	19133	.NET Framework: 1038. C# - 인증서 및 키 파일로부터 pfx/p12 파일을 생성하는 방법	1
12598	정성태	4/14/2021	20364	.NET Framework: 1037. openssl의 PEM 개인키 파일을 .NET RSACryptoServiceProvider에서 사용하는 방법 (2)	1
12597	정성태	4/13/2021	18861	개발 환경 구성: 569. csproj의 내용을 공통 설정할 수 있는 Directory.Build.targets / Directory.Build.props 파일
12596	정성태	4/12/2021	18069	개발 환경 구성: 568. Windows의 80 포트 점유를 해제하는 방법
12595	정성태	4/12/2021	19026	.NET Framework: 1036. SQL 서버 - varbinary 타입에 대한 문자열의 CAST, CONVERT 변환을 C# 코드로 구현
12594	정성태	4/11/2021	18443	.NET Framework: 1035. C# - kubectl 명령어 또는 REST API 대신 Kubernetes 클라이언트 라이브러리를 통해 프로그래밍으로 접근 [1]	1
12593	정성태	4/10/2021	18686	개발 환경 구성: 567. Docker Desktop for Windows - kubectl proxy 없이 k8s 대시보드 접근 방법
12592	정성태	4/10/2021	18480	개발 환경 구성: 566. Docker Desktop for Windows - k8s dashboard의 Kubeconfig 로그인 및 Skip 방법
12591	정성태	4/9/2021	22597	.NET Framework: 1034. C# - byte 배열을 Hex(16진수) 문자열로 고속 변환하는 방법 [2]	1
12590	정성태	4/9/2021	19174	.NET Framework: 1033. C# - .NET 4.0 이하에서 Console.IsInputRedirected 구현 [1]
12589	정성태	4/8/2021	19421	.NET Framework: 1032. C# - Environment.OSVersion의 문제점 및 윈도우 운영체제의 버전을 구하는 다양한 방법 [1]
12588	정성태	4/7/2021	22073	개발 환경 구성: 565. PowerShell - New-SelfSignedCertificate를 사용해 CA 인증서 생성 및 인증서 서명 방법
12587	정성태	4/6/2021	23531	개발 환경 구성: 564. Windows 10 - ClickOnce 배포처럼 사용할 수 있는 MSIX 설치 파일 [1]
12586	정성태	4/5/2021	20355	오류 유형: 710. Windows - Restart-Computer / shutdown 명령어 수행 시 Access is denied(E_ACCESSDENIED)
12585	정성태	4/5/2021	18988	개발 환경 구성: 563. 기본 생성된 kubeconfig 파일의 내용을 새롭게 생성한 인증서로 구성하는 방법
12584	정성태	4/1/2021	20281	개발 환경 구성: 562. kubeconfig 파일 없이 kubectl 옵션만으로 실행하는 방법
12583	정성태	3/29/2021	20392	개발 환경 구성: 561. kubectl 수행 시 다른 k8s 클러스터로 접속하는 방법
12582	정성태	3/29/2021	20674	오류 유형: 709. Visual C++ - 컴파일 에러 error C2059: syntax error: '__stdcall'
12581	정성태	3/28/2021	20523	.NET Framework: 1031. WinForm/WPF에서 Console 창을 띄워 출력하는 방법 (2) - Output 디버깅 출력을 AllocConsole로 우회 [2]
12580	정성태	3/28/2021	18183	오류 유형: 708. SQL Server Management Studio - Execution Timeout Expired.
12579	정성태	3/28/2021	19108	오류 유형: 707. 중첩 가상화(Nested Virtualization) - The virtual machine could not be started because this platform does not support nested virtualization.

AD BLOCK 해제 요청

C++ - 문자열의 대소문자를 변환하는 transform + std::tolower/toupper 방식의 문제점