Encode::UTF8Mac makes you happy while handling file names on MacOSX.

tomi-ru
2010-12-24

This entry describes Encode::UTF8Mac.

Summary

Use Encode::UTF8Mac to handle file names on MacOSX.

use Encode;
use Encode::UTF8Mac; # provides 'utf-8-mac'

my $filename = Encode::encode('utf-8-mac', '<filename from MacOS>'));


Mac OSX and utf-8

On Mac OSX, utf-8 encoding is used in filesystem. This is good news than Microsoft Codepage in Win32 world.

use autodie;
use utf8;
use Encode;

system(touch => Encode::encode('utf-8', '問題ない.txt'));

But we keep in mind that "utf-8 on Mac OSX" is based on NFD normalized unicode.


What is the NFD?

In Unicode, some characters have multiple representations.

for example:

Simply put, NFD(Normalization Form Canonical Decomposition) is normalization name "single Unicode" style to "one character by two Unicode" style.

and NFC(Normalization Form Canonical Composition) makes reverse.


Mac OSX filesystem use utf-8 based on NFD-ed unicode

We usually are using NFC style. but OSX uses NFD style.

When passing a string to the file system, is transparently converted to the NFD style.

use autodie;
use utf8;
use Encode;

system(touch => Encode::encode('utf-8', 'pokémon.txt'));
system(touch => Encode::encode('utf-8', 'だん.txt'));

This above code passes utf-8 bytes(NFC style) to filesystem, but OSX will NFD normalize automatically.

That's all right.

But You have to be careful when you receive from the file system.

use Encode;
my @files = map { Encode::decode('utf-8', $_) } glob('*.txt');

use Data::Dumper;
warn Dumper \@file;

# $VAR1 = [
#           "\x{305f}\x{3099}\x{3093}.txt",
#           "poke\x{301}mon.txt"
#         ];

Woot, NFD style! I want NFC style text X(

Don't panic. We can use Unicode::Normalize.

use Encode;
use Unicode::Normalize;
my @files = map { NFC( Encode::decode('utf-8', $_) ) } glob('*.txt');

use Data::Dumper;
warn Dumper \@file;

# $VAR1 = [
#           "\x{3060}\x{3093}.txt",
#           "pok\x{e9}mon.txt"
#         ];


Is it enough?

Unfortunately, No.

OSX not follow the exact NFD specification(for compatibility with old Mac system). Some characters are not normalized.

http://developer.apple.com/library/mac/#qa/qa2001/qa1173.html

for example

Most of these special character are special rare Kanji. So, unless you use Asian characters, the problem will not happen. But even if you are in Asia, may be included in the itunes music directory.

If you simply use Unicode::Normalize::NFD(), you may get the file name you do not expect.

use autodie;
system(touch => Encode::encode('utf-8', '七福神.txt'));

use Encode;
use Unicode::Normalize;
my @files = map { NFC( Encode::decode('utf-8', $_) ) } glob('*.txt');

warn @file; # => "七福神.txt",  not 七福神.txt


How to handle mac specific normalized utf-8?

use Encode::UTF8Mac. It provides a "utf-8-mac" encoding. It will convert automatically, except special characters.

my @files = map { Encode::decode('utf-8-mac', $_) }  glob('*.txt');

warn @file; # => "七福神.txt"


One more thing

I hope that the Encode::Locale that will support this.

when it happens, just use encoding "locale" to handle file name. on Mac or Windows or Linux.


-- Naoki Tomita (tomi-ru)