POSIX character classes in regexp

2010-12-07

Here is a string `"Hello, world!^["' that containing escape character. How can I remove escape character from this string? Let's get start it.

Note: raw escape char cannot see in web browser, so in this article I use two-bytes string `^[' as a single-byte escape char.

First way, put escape char literally in regexp.

    my $str = "Hello, world!^[";
    $str =~ s/^[//;

This is ok, but this way isn't copy-and-paste friendly.

Second way, use octal/hex char instead of raw escape sequence.

    my $str = "Hello, world!^[";
    $str =~ s/\x1B//;            # \x1B is escape char
    my $str = "Hello, world!^[";
    $str =~ s/\033//;            # \033 is escape char

These are also ok, but I think `\x1B' and `\033' are bit unreadable.

Third way, use POSIX character class `[[:print:]]'.

    my $str = "Hello, world!^[";
    $str =~ s/[^[:print:]]//;

Good. This is more readable than second way.

POSIX character classes looks like `[:name:]'. `[:' and `:]' are delimiter. `name' is self-descriptive. POSIX character classes are only available in bracketed character class. So valid regexp has continuous brackets looks like `[[:print:]]'.

POSIX character classes are just sets of characters, similar to a bracketed character class like `[aiueo]'. You can use POSIX character classes with any other parts of normal characters and meta characters.

    /[[:alpha:]]/;      # matches all of alphabet characters, same as [a-zA-Z].
    /[cntrl[:cntrl:]]/; # matches control characters and five lower case alphabets `c', `n', `t', `r', `l'.
    /[^[:print:]]/;     # negation of [[:print:]], matches all non-printable characters.

For more details, see perlre and perlrecharclass.

Kensuke Kaneko (@kyanny)