Lazy Diary @ Hatena Blog

PowerShell / Java / miscellaneous things about software development, Tips & Gochas. CC BY-SA 4.0/Apache License 2.0

Get CodePoint-Property Pair from Scripts.txt on Unicode.org

Context

You want to make a list of pair of unicode codepoint and its character property, like below:

00009,Cc
00020,Zs
00021,Po
00024,Sc
...

Solution with PowerShell

You can make the list from ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt ftp://ftp.unicode.org/Public/UNIDATA/Scripts.txt with PowerShell:

Get-Content ./Scripts.txt | `
  Where-Object { ($_ -notlike "#*") -and ($_ -notlike "") } | `
  ForEach-Object {
    $_ -match '(?<CodePoint>[0-9A-F]+(\.\.[0-9A-F]+)?)\s+;\s+(\w+) # (?<PatternName>\w+)' > $null;
    $CodePoint = $Matches.CodePoint;
    $PatternName = $Matches.PatternName;
    if ($CodePoint -like '*..*') {
      $StartCodePoint = [Convert]::ToInt32(($CodePoint -split "\.\.")[0], 16);
      $EndCodePoint = [Convert]::ToInt32(($CodePoint -split "\.\.")[1], 16);
      $StartCodePoint..$EndCodePoint | ForEach-Object {
        $_.ToString('X5') + ',' + $PatternName
      }
    } else {
      [Convert]::ToInt32($CodePoint, 16).ToString('X5') + ',' + $PatternName
    }
  } | `
  Sort-Object | Get-Unique | Out-File DetailedPropList.txt -Encoding utf8