Lazy Diary @ Hatena Blog

PowerShell / Java / miscellaneous things about software development, Tips & Gochas. CC BY-SA 4.0/Apache License 2.0

How to separate a string into codepoint-wise characters with PowerShell

Context:

You have a Unicode string that contain non-ASCII characters as well as ASCII characters. You want to separate that string into characters.

Problem:

If you split the string with the code below:

$TemporaryArray = $InputString -split "";
$ResultArray = $TemporaryArray[1..($TemporaryArray.length-2)];

You will have a problem: characters that represented as surrogate pair (U+10000 ~ U+10FFFF) will separated high surrogate and low surrogate (they are not character).

Reason:

PowerShell -split operator is not surrogate pair aware, and it seems by design.

Solution:

Once convert the string into UTF32 byte-array, and separate it into codepoints (4-byte length), and convert them to String object.

$ResultArray = @();
$InputStringBytes = [Text.Encoding]::UTF32.GetBytes($InputString);
for ($i=0; $i -lt $InputStringBytes.length; $i+=4) {
     $ResultArray += [Text.Encoding]::UTF32.GetString($InputStringBytes, $i, 4);
}

Limitation:

This method separate a string into each codepoint, so the Unicode ligatures (it consists of two or more codepoints) are illegally separated into codepoints. You can use icu.net for this purpose.

Difference of behavior of String#split() in Java and -split operator in PowerShell

Both of String#split() in Java and -split operator in PowerShell take regex as argument, and split string into a list or an array, but there is some difference in behavior when you pass an empty string as argument.

In Java:

System.out.println("abc".split("").length); // -> 3

Whereas in PowerShell:

PS > ("abc" -split "").Length  # -> 5

Because ("abc" -split "") makes @("","a","b","c","").

Also in PowerShell:

PS > ("abc".split("")).Length  # -> 1

Because split("") will not split the target string at all.

労働と挨拶のどちらが大切か

  • (A) 挨拶より労働の方が大切。
  • (B) 挨拶と労働が同じくらい大切。
  • (C) 労働より挨拶よ方が大切。

中井久夫「治療文化論」p.104より引用。

東京においては「あいさつ」のできることが、「はたらくこと」と並んでかなり重要であり、名古屋においては「あいさつ」よりも「はたらけること」である。

How to extract non-MS932 (Shift_JIS) compliant characters from string

function Get-NonMS932CompliantCharacter {
  Param(
    [Parameter(ValueFromPipeline=$true,Mandatory=$true)]
    [string] $TargetString
  )
  process {
    $TargetStringBytes = [Text.Encoding]::UTF32.GetBytes($TargetString);
    for ($i=0; $i -lt $TargetStringBytes.Length; $i+=4) {
        $TargetChar = [Text.Encoding]::UTF32.GetString($TargetStringBytes, $i, 4);
        $MS932Bytes = [Text.Encoding]::GetEncoding(932).GetBytes($TargetChar);
        $MS932Char = [Text.Encoding]::GetEncoding(932).GetString($MS932Bytes,0,$MS932Bytes.Length)
        if ($TargetChar -ne $MS932Char) {
            $TargetChar
        }
    }
  }
}

ex:

PS > "あえうえお①𩸽X𠀋か㐂" | Get-NonMS932CompliantCharacter
𩸽
𠀋
㐂

tr equivalent in PowerShell (Unicode surrogate pair-aware)

There is no straightforward tr equivalent in Windows, so I made an cmdlet that you can use like tr command. This tr cmdlet is aware of Unicode characters including surrogate pairs.

function tr {
  Param(
    [Parameter(ValueFromPipeline=$true,Mandatory=$true)]
    [string] $TargetString,
    [Parameter(Mandatory=$true)]
    [string] $FromString,
    [Parameter(Mandatory=$true)]
    [string] $ToString
  )
  begin {
    # [-split ""] splits a surrogate pair into two invalid characters,
    # so the code below is not suitable for this purpose
    # $FromStringArray = $FromString -split "";
    # $FromStringArray = $FromStringArray[1..($FromStringArray.length-2)];

    # Split string into character array
    $FromStringArray = @();
    $FromStringBytes = [Text.Encoding]::UTF32.GetBytes($FromString);
    for ($i=0; $i -lt $FromStringBytes.length; $i+=4) {
         $FromStringArray += [Text.Encoding]::UTF32.GetString($FromStringBytes, $i, 4);
    }

    $ToStringArray = @();
    $ToStringBytes = [Text.Encoding]::UTF32.GetBytes($ToString);
    for ($i=0; $i -lt $ToStringBytes.length; $i+=4) {
         $ToStringArray += [Text.Encoding]::UTF32.GetString($ToStringBytes, $i, 4);
    }
  }
  process {
    for ($i=0; $i -lt $FromStringArray.Length -and $i -lt $ToStringArray.Length; $i++) {
        $TargetString = $TargetString.Replace($FromStringArray[$i],$ToStringArray[$i]);
    }
    $TargetString
  }
}

ex:

PS > @("𩸽𠀋", "あいうえおあお") | tr -FromString "𩸽𠀋うえお" -ToString "○𡶷ウエオ"

○𡶷
あいウエオあオ