Lazy Diary @ Hatena Blog

PowerShell / Java / miscellaneous things about software development, Tips & Gochas. CC BY-SA 4.0/Apache License 2.0

charset

There are no properties for ordinary characters in PropList.txt

Problem When run the script in next URL with PropList.txt on unicode.org, result file did not contain character properties for ordinary characters like 'x', 'y', or 'z'. http://satob.hatenablog.com/entry/2017/11/21/002957 Reason PropList.t…

Get CodePoint-Property Pair from Scripts.txt on Unicode.org

Context You want to make a list of pair of unicode codepoint and its character property, like below: 00009,Cc 00020,Zs 00021,Po 00024,Sc ... Solution with PowerShell You can make the list from ftp://ftp.unicode.org/Public/UNIDATA/PropList.…

Supported character encodings in Get-Content and Import-Csv (in PowerShell 2.0/4.0)

Tested in Windows 7 (Japanese). Import-Csv does not have -Encoding option in PowerShell 2.0. There are no option for UTF-32BE in PowerShell 2.0. (note: PowerShell ISE can handle UTF-32BE Files) Import-Csv does not support Unknown and Strin…

You cannot use -Encoding option with Import-Csv in PowerShell 2.0

Context: You use PowerShell 2.0 (Windows 7 or Windows Server 2008 R2). You want to read CSV file that contain non-ASCII characters. Problem: In PowerShell 2.0, Import-Csv cmdlet doesn’t have -Encoding option. Solution: If you want to read …

Difference of acceptable parameters for -Encoding option

Acceptable parameters for -Encoding option are different for Get-Content, Set-Content, Export-Csv, Import-Csv, and Out-File. # cmdlet Default ASCII UTF-7 UTF-8 UTF-16LE UTF-16BE UTF-32LE UTF-32BE Byte Default OEM String Unknown 1 Get-Conte…

How to write result of ConvertTo-Csv to a file in UTF-8 without BOM

Context: You want to write the result of ConvertTo-Csv in UTF-8 encoding without BOM. e.g. You need a file that can be read by a Java program (Java File API cannot handle BOM in UTF-8 encoded files). UTF-8 in PowerShell, e.g. ConvertTo-Csv…

.NET cannot distinguish Shift_JIS from MS932(Windows-31J)

Context: Japanese character encoding Shift_JIS (シフトJIS) and Microsoft Codepage 932 (a.k.a. MS932, Windows-31J in IANA) are slightly different. For example, full-width cent sign (¢) is 0x8191 in both Shift_JIS and MS932, but it is mappe…

PowerShellで法務省 戸籍統一文字情報のページからあるコードポイントの文字の情報を取得する

(In English: How to get information about a Japanese character from 戸籍統一文字情報 site (managed by The Ministry of Justice (Japan)) with PowerShell) 問題: ある漢字に関する情報(読みや、子の名に使える文字か等)を調べたければ、法務省の戸…

CharsetEncoder#canEncode() equivalent for PowerShell

Context: You want to test whether a codepoint is valid in a specific character encoding. Problem In .NET, there are no equivalent functions to CharsetEncoder#canEncode() in Java. Solution If you want to test whether a character is valid in…

System.Text.Encoding.GetEncodings() does not show all available encodings after call RegisterProvider()

Problem: If you want to use character encodings other than the default registered encodings, you have to call [System.Text.Encoding]::RegisterProvider([System.Text.CodePagesEncodingProvider]::Instance). But even if you call that method, th…

How to convert from a code point (U+xxxx) to a code point in another character encoding

function Convert-CodePoint { Param( [Parameter(ValueFromPipeline=$true,Mandatory=$true)] [string] $CodePoint, [Parameter(ValueFromPipeline=$false,Mandatory=$true)] $From, [Parameter(ValueFromPipeline=$false,Mandatory=$true)] $To ) begin { …

How to convert from a code point (U+xxxx) to a character

function ConvertFrom-CodePoint { Param( [Parameter(ValueFromPipeline=$true,Mandatory=$true)] [string] $CodePoint, [Parameter(ValueFromPipeline=$false,Mandatory=$true)] $From ) begin { [System.Text.Encoding]::RegisterProvider([System.Text.C…