Context:
You have a Unicode string that contain non-ASCII characters as well as ASCII characters. You want to separate that string into characters.
Problem:
If you split the string with the code below:
$TemporaryArray = $InputString -split ""; $ResultArray = $TemporaryArray[1..($TemporaryArray.length-2)];
You will have a problem: characters that represented as surrogate pair (U+10000 ~ U+10FFFF) will separated high surrogate and low surrogate (they are not character).
Reason:
PowerShell -split
operator is not surrogate pair aware, and it seems by design.
Solution:
Once convert the string into UTF32 byte-array, and separate it into codepoints (4-byte length), and convert them to String object.
$ResultArray = @(); $InputStringBytes = [Text.Encoding]::UTF32.GetBytes($InputString); for ($i=0; $i -lt $InputStringBytes.length; $i+=4) { $ResultArray += [Text.Encoding]::UTF32.GetString($InputStringBytes, $i, 4); }
Limitation:
This method separate a string into each codepoint, so the Unicode ligatures (it consists of two or more codepoints) are illegally separated into codepoints. You can use icu.net for this purpose.