Skip to main content

Opening files with unicode (japanese/chinese) characters in filename using Perl

(Read a complete article on the matter and much more
"Unicode issues regarding the Window OS file system and their handling from Perl)

Currently there is no way to manipulate a file named using Unicode characters,by using Perl's built in functions.
The perl 5.10 todo wish list states that functions like chdir, opendir, readdir, readlink, rename, rmdir e.g
"could potentially accept Unicode filenames either as input or output".
Windows default encoding is UTF-16LE,but the console 'dir' command will only return ANSI names.Thus unicode characters are replaced with "?"
,even if you invoke the console using the unicode switch (cmd.exe /u),change the codepage to 65001 which is utf8 on windows and use lucida console true type font which supports unicode.
A workaround is to use the COM facilities provided by windows (in this case Scripting.FileSystemObject) which provide a much higher level of abstraction or use the Win32 api calls.
I tried to read a file with japanese characters in the filename which resides in the current folder and then move the file to another folder.
The filename is "は群馬県高崎市を拠点に、様々なメディ.txt"
Since opendir ,readdir,rename etc do not support unicode you have to reside to the Scripting.FileSystemObject methods and properties which accept unicode.

This is the actual code which moves all files with .txt extension:

use Win32::OLE qw(in);
use Devel::Peek;

#CP_UTF8 is very important as it translates between Perl strings and U
+nicode strings used by the OLE interface

Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);  

$obj = Win32::OLE->new('Scripting.FileSystemObject');

$folder = $obj->GetFolder(".");

$collection= $folder->{Files};

mkdir ("c:\\newfolder")||die;

foreach $value (in $collection) {
$filename= %$value->{Name};
next if ($filename !~ /.txt/);
Dump("$filename");  #check if the utf8 flag is on
$file=$obj->GetFile("$filename"); 
$file->Move("c:\\newfolder\\$filename");
print Win32::OLE->LastError() || "success\n\n";
}

If all goes well then a new folder -suprsingly- called 'newfolder' in C: drive would have been created and the japanese named filed should been have moved there.

This will only work if you have the asian languages (regional setings) support enabled and you should be able to see the japanase name in explorer as above

Comments

Anonymous said…
Thanks a lot,saved me loads of time.Exactly what I needed!
Daku said…
Spent ages trying to rename a file using Win32::Unicode / Win32API::File with no luck.
This worked without any hassle, thanks :)

Popular posts from this blog

Book Review : How To Create Pragmatic, Lightweight Languages

At last, a guide that makes creating a language with its associated baggage of lexers, parsers and compilers, accessible to mere mortals, rather to a group of a few hardcore eclectics as it stood until now.

The first thing that catches the eye, is the subtitle:

The unix philosophy applied to language design, for GPLs and DSLs"
What is meant by "unix philosophy" ?. It's taking simple, high quality components and combining them together in smart ways to obtain a complex result; the exact approach the book adopts.
I'm getting ahead here, but a first sample of this philosophy becomes apparent at the beginnings of Chapter 5 where the Parser treats and calls the Lexer like  unix's pipes as in lexer|parser. Until the end of the book, this pipeline is going to become larger, like a chain, due to the amount of components that end up interacting together.

The book opens by putting things into perspective in Chapter 1: Motivation: why do you want to build lan…

How Much Gameplay Can You Pack In Just 13K?

Given our expectations of Xbox games, you might consider writing a game within a 13K limit, which is the challenge for the annual js13K competition far too restrictive. Its results are now out and prove that it is possible to produce a game that is fun to play. 

Back in the tape loading days and on platforms the likes of Commodore64 games came in sizes of 4K or less. As proof of concept, here's a list of a few such 4K titles, copied over from Lemon64 's archive:
Alien SidestepBug CrusherDot GobblerClose EncountersDot Gobbler v2GridrunnerLaser CyclesMarios BrewerySpace ActionSpace RicoshayTank WarsHesmon64Retro Ball  Fast forward to now, at a time when Javascript's eating the world by making all sorts of applications or  games available to everyone through the medium of the browser, rendering the need of dedicated platforms and Operating systems obsolete, 13K is sufficient enough to pack both gameplay AND cool graphics due to the advanced browser engines and HTML5.

Hour of Code 2017 Introduces App Lab

t's the time of year when the world-class Hour of Code once more commences; just an hour for introducing coding to the uninitiated, having them complete self guided tutorials. But is a hour sufficient? What can a beginner actually code within this limit? The answer is a bit more complicated than that, so let's find out all about it! Integrated into the larger, worldwide, annual Computer Science Education week, this year taking place December 4-10, Hour of Code's novel mission has always been to get everybody coding, aged from 4 to 104, by providing: "a one-hour introduction to computer science, designed to demystify code, showing that anybody can learn the basics, and broadening participation in the field of computer science". But first of all, why this obsession with Computer Science, in particular in getting  kids as young as 4 to learn to code? The answer is simple. Nowadays code is everywhere around us, from desktop computers to mobile phones and, thanks to w…