Lazy Diary @ Hatena Blog

PowerShell / Java / miscellaneous things about software development, Tips & Gochas. CC BY-SA 4.0/Apache License 2.0

.NET cannot distinguish Shift_JIS from MS932(Windows-31J)

Context:

Japanese character encoding Shift_JIS (シフトJIS) and Microsoft Codepage 932 (a.k.a. MS932, Windows-31J in IANA) are slightly different. For example, full-width cent sign (¢) is 0x8191 in both Shift_JIS and MS932, but it is mapped to Unicode U+00A2 in Shift_JIS, and U+FFE0 in MS932.

Problem:

.NET CLR cannot distinguish Shift_JIS from MS932. For example, in PowerShell, [System.Text.Encoding]::GetEncoding(932) and [System.Text.Encoding]::GetEncoding("shift_jis") return same Encoding object (both have MS932 mapping).

This causes problem when you check whether an Unicode codepoint is available on Shift_JIS character set (You cannot get accurate availability of such characters on .NET). Especially, non-Windows Shift_JIS environment (e.g. HP-UX, AIX) and Japanese Technical standard (JIS X 0221) uses Shift_JIS mapping. If you want to get an Unicode codepoint of ¢ from Shift_JIS/MS932 character (0x8191), it should be U+00A2 (Shift JIS mapping), but in .NET environment, you will get U+FFE0 (MS932 mapping).

Solution:

There is no simple workaround in .NET CLR. If you should distinguish these encodings, use special-purpose character encoding library, write your own encoding converter, or select Java instead of .NET.