Lazy Diary @ Hatena Blog

PowerShell / Java / miscellaneous things about software development, Tips & Gochas. CC BY-SA 4.0/Apache License 2.0

What does "US-ASCII only" means in Java regexp?

Java Patten class (regexp) supports POSIX character classes like \p{XDigit}. They are very useful when you want to check hex strings. In Java API Document, POSIX character classes say (US-ASCII only). What does it mean? https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#sum

You can set UNICODE_CHARACTER_CLASS flag with embedded flag (?U).

System.out.println("F8FF".matches("\\p{XDigit}+"));         // true
System.out.println("F8FF".matches("(?U)\\p{XDigit}+"));     // true
System.out.println("F8FF".matches("[0-9a-fA-F]+"));         // true
System.out.println("F8FF".matches("(?U)[0-9a-fA-F]+"));     // true
System.out.println("F8FF".matches("\\p{XDigit}+"));     // false
System.out.println("F8FF".matches("(?U)\\p{XDigit}+")); // true
System.out.println("F8FF".matches("[0-9a-fA-F]+"));     // false
System.out.println("F8FF".matches("(?U)[0-9a-fA-F]+")); // false

As you can see, when you specify (?U), \p{XDigit} matches with non-ascii (full-width) letters (e.g. U+FF10-FF19). \p{XDigit} may have match to full-width letters in POSIX context, but it is better not to match in practical context. So, I think you don’t have to worry about the UNICODE_CHARACTER_CLASS flag.