Recent

Author Topic: MAC OS Unicode File Name  (Read 11492 times)

nhatdung

  • New Member
  • *
  • Posts: 37
MAC OS Unicode File Name
« on: April 09, 2011, 10:38:38 am »
hi!
I just port my app from Linux to MAC OS. I got problem with file name unicode!
I use findfirst to enum all file in a directory and get the file name. But the strange is : all unicode char is encode to an other type that are different from Windows/Linux
Example :
the character "ệ" (code =  7879) will have length = 3, each char will have code : 101 803 770

i attach two file, one is the character in MAC OS, another is in Windows. They're display same but the size is different, both encode in UNICODE.

Is there any way to convert between two type ?

« Last Edit: April 09, 2011, 10:40:18 am by nhatdung »

Phil

  • Hero Member
  • *****
  • Posts: 2737
Re: MAC OS Unicode File Name
« Reply #1 on: April 09, 2011, 06:23:31 pm »
If you look at the content of those files, you can see what's happening:

hexdump -C char_MAC_OS.txt
00000000  ff fe 65 00 23 03 02 03                           |..e.#...|

hexdump -C char_WINDOWS.txt
00000000  ff fe c7 1e                                       |....|

Both files are UTF-16 with little-endian byte mark at the start (ff fe). The Windows file then shows the UTF-16 char ệ. However, the Mac file shows the normal lower-case e, followed by two UTF-16 diacritical chars (dot below, circumflex above). Combined, those 3 chars appear the same as the single ệ char.

If you got those 3 UTF-16 chars on Mac via FindFiles, that would sound like an FPC problem that you need to solve.

Thanks.

-Phil



nhatdung

  • New Member
  • *
  • Posts: 37
Re: MAC OS Unicode File Name
« Reply #2 on: April 09, 2011, 07:28:24 pm »
hmm, seem like the FindFirst/FindFirstUtf8 on MAC got problem or this is the MAC system like that ?

Phil

  • Hero Member
  • *****
  • Posts: 2737
Re: MAC OS Unicode File Name
« Reply #3 on: April 09, 2011, 07:40:47 pm »
How did you enter ệ into the file name?

Try using Character Viewer to insert ệ to be sure that you're entering a single char.

Thanks.

-Phil

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1059
Re: MAC OS Unicode File Name
« Reply #4 on: April 09, 2011, 08:31:29 pm »
I just port my app from Linux to MAC OS. I got problem with file name unicode!
I use findfirst to enum all file in a directory and get the file name. But the strange is : all unicode char is encode to an other type that are different from Windows/Linux
Example :
the character "ệ" (code =  7879) will have length = 3, each char will have code : 101 803 770
The Mac way is called "decomposed unicode". The reason they do that, is so that you cannot have two different files in a directory that are both called ệ. On Linux, you can (one encoded without decomposition, one with -- EDIT: you can actually have 4 different files named ệ in Linux, since you can decompose it in three different ways).

They could of course also have picked the composed way for standardizing on, but I guess it's quicker to convert all possible precomposed characters into decomposed ones than to convert all possible partially precomposed and decomposed characters into precomposed ones.

See http://developer.apple.com/library/mac/#qa/qa2001/qa1235.html for more info.

You can pass precomposed strings to file API functions on Mac OS X, but the system will internally convert them to decomposed form before using them.

Quote
i attach two file, one is the character in MAC OS, another is in Windows. They're display same but the size is different, both encode in UNICODE.

Is there any way to convert between two type ?
If you only want to compare file names, use the LCL function filectrl.CompareFilenames (or filectrl.CompareFilenamesIgnoreCase).

I don't know whether there's a standard Lazarus function to convert to precomposed/decomposed form. The link above contains a C example that you can convert to Pascal on Mac OS X though (using the declarations from the MacOSAll unit).
« Last Edit: April 09, 2011, 08:48:13 pm by jmaebe »

nhatdung

  • New Member
  • *
  • Posts: 37
Re: MAC OS Unicode File Name
« Reply #5 on: April 10, 2011, 08:26:44 am »
i need to get exactly the file name to sha1 it :(, that's why i need to decode the file name
I followed the link
http://developer.apple.com/library/mac/#qa/qa2001/qa1235.html
but it seem i done something wrong :(
is the someone known how to use CFStringNormalize ? i always got EXC_BAD_ACCESS.
Code: [Select]
var s : WideString;
    sz : integer;
    P : CFMutableStringRef;
begin
  s := edit1.Text;
  sz := length(s) * sizeof(WORD);
  GetMem(P, sz);
  System.Move(s[1], P^, sz);
  CFStringNormalize(P, kCFStringNormalizationFormKD);
  FreeMem(P);     
I also trying method 2 but it didn't work
this is original code :

Code: [Select]
static OSStatus ConvertUnicodeToCanonical(
            Boolean precomposed,
            const UniChar *inputBuf, ByteCount inputBufLen,
            UniChar *outputBuf, ByteCount outputBufSize,
            ByteCount *outputBufLen)
    /* As is standard with the Unicode Converter,
    all lengths are in bytes. */
{
    OSStatus            err;
    OSStatus            junk;
    TextEncodingVariant variant;
    UnicodeToTextInfo   uni;
    UnicodeMapping      map;
    ByteCount           junkRead;

    assert(inputBuf     != NULL);
    assert(outputBuf    != NULL);
    assert(outputBufLen != NULL);

    if (precomposed) {
        variant = kUnicodeCanonicalCompVariant;
    } else {
        variant = kUnicodeCanonicalDecompVariant;
    }
    map.unicodeEncoding = CreateTextEncoding(kTextEncodingUnicodeDefault,
                                             kUnicodeNoSubset,
                                             kTextEncodingDefaultFormat);
    map.otherEncoding   = CreateTextEncoding(kTextEncodingUnicodeDefault,
                                             variant,
                                             kTextEncodingDefaultFormat);
    map.mappingVersion  = kUnicodeUseLatestMapping;

    uni = NULL;

    err = CreateUnicodeToTextInfo(&map, &uni);
    if (err == noErr) {
        err = ConvertFromUnicodeToText(uni, inputBufLen, inputBuf,
                                       kUnicodeDefaultDirectionMask,
                                       0, NULL, NULL, NULL,
                                       outputBufSize, &junkRead,
                                       outputBufLen, outputBuf);
    }

    if (uni != NULL) {
        junk = DisposeUnicodeToTextInfo(&uni);
        assert(junk == noErr);
    }

    return err;
}

and this is my convert code :

Code: [Select]
function ConvertUnicodeToCanonical(UniChar: string;var OutBuf : string): integer;
var v : TextEncodingVariant;
    map : UnicodeMapping;
    uni : UnicodeToTextInfo;
    junkRead : ByteCount;
    outputBufLen, outputBufSize : ByteCount;
    outputBuf, P : UniCharPTR;
begin
  result := -1;
  v := kUnicodeCanonicalCompVariant;
  //v := kUnicodeCanonicalDecompVariant;
  map.unicodeEncoding := CreateTextEncoding(kTextEncodingUnicodeDefault,
                                            kUnicodeNoSubset,
                                            kTextEncodingDefaultFormat);
  map.otherEncoding   := CreateTextEncoding(kTextEncodingUnicodeDefault,
                                            v,
                                            kTextEncodingDefaultFormat);
  map.mappingVersion  := kUnicodeUseLatestMapping;

  uni := nil;

  result := CreateUnicodeToTextInfo(@map, uni);
  OutBuf := '';
  if (result = 0) then
  begin
      outputBufSize := 10240;
      GetMem(outputBuf, outputBufSize);
      P := UniCharPTR(PWideChar(Unichar));
      result := ConvertFromUnicodeToText(uni, Length(UniChar), P,
                                     kUnicodeDefaultDirectionMask,
                                     0, nil, nil, nil,
                                     outputBufSize, junkRead,
                                     outputBufLen, outputBuf);
      outBuf := PChar(outputBuf);
      FreeMem(outputBuf);
  end;


  if (uni <> nil) then
  begin
    DisposeUnicodeToTextInfo(uni);
  end;
end;   

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1059
Re: MAC OS Unicode File Name
« Reply #6 on: April 10, 2011, 11:59:48 am »
i need to get exactly the file name to sha1 it

On Mac OS X, the exact filename is the decomposed form. That's how it is stored on disk. If you are going to always force a precomposed form, your program will be wrong more often than not.

I don't have time right now to look at your code.

 

TinyPortal © 2005-2018