Lazy Diary @ Hatena Blog

PowerShell / Java / miscellaneous things about software development, Tips & Gochas. CC BY-SA 4.0/Apache License 2.0

How to separate a string into codepoint-wise characters with PowerShell

Context:

You have a Unicode string that contain non-ASCII characters as well as ASCII characters. You want to separate that string into characters.

Problem:

If you split the string with the code below:

$TemporaryArray = $InputString -split "";
$ResultArray = $TemporaryArray[1..($TemporaryArray.length-2)];

You will have a problem: characters that represented as surrogate pair (U+10000 ~ U+10FFFF) will separated high surrogate and low surrogate (they are not character).

Reason:

PowerShell -split operator is not surrogate pair aware, and it seems by design.

Solution:

Once convert the string into UTF32 byte-array, and separate it into codepoints (4-byte length), and convert them to String object.

$ResultArray = @();
$InputStringBytes = [Text.Encoding]::UTF32.GetBytes($InputString);
for ($i=0; $i -lt $InputStringBytes.length; $i+=4) {
     $ResultArray += [Text.Encoding]::UTF32.GetString($InputStringBytes, $i, 4);
}

Limitation:

This method separate a string into each codepoint, so the Unicode ligatures (it consists of two or more codepoints) are illegally separated into codepoints. You can use icu.net for this purpose.