* * *

Author Topic: [SOLVED] Reading bytes of data that represent Unicode UTF-16, little endian.  (Read 1807 times)

Gizmo

  • Sr. Member
  • ****
  • Posts: 304
Dudes...I'm struggling...I have read http://wiki.freepascal.org/UTF-8 and no doubt the answer is there but I'm struggling to spot it. Perhaps someone can help.

I have a series of raw bytes that are filenames stored as UTF-16 Unicode. The bytes are in part of an array. I'm trying to code something that loops through the appropriate number of bytes of that array to return the ASCII version of those bytes. However, I keep getting blank fields! 

Code: [Select]
try
            // First, get the length of the filename as a decimal value from byte 184.
            // Windows filenames are stored as Unicode, though, so this value
            // will be double, when it comes to byte reading. This bit works OK.
            ShortFileNameLength := pCint8(@MyArray.OtherData[184])^;
            // Jump over byte 185 and start at byte 186, that contains the first filename letter
            StartPosition := 186;
            EndPosition := StartPosition + (ShortFileNameLength*2);
            ShowMessage(IntToStr(ShortFileNameLength));  // Returns '11', which is the number of letters in filename, so this bit works OK.
            ShowMessage(IntToHex(MFTRecordToParse.OtherData[184],2)); // Returns 0x0B, so this bit works OK too. So we know the filename is 11 characters in length, but stored as Unicode, so 22 bytes in reality, with every other byte being 0x00 (thus ShortFileNameLength*2 above).
            for j := MyArray.OtherData[StartPosition] to MyArray.OtherData[EndPosition] do
              begin
                ShortFileNameRAW := Chr(MFTRecordToParse.OtherData[j]);  // Get the char value of each byte
                ShortFileName := ShortFileName + ShortFileNameRAW;  // Add each char as a string
              end;
          finally
            ShowMessage(UTF8ToAnsi(ShortFileName)); // Given that filename will be a Unicode UTF-16 value "T e x t . t x t", use UTF8 conversion
          end;             
« Last Edit: April 18, 2012, 09:13:04 pm by tedsmith »

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Reading bytes of data that represent Unicode UTF-16, little endian.
« Reply #1 on: April 18, 2012, 12:33:56 am »
Dudes...I'm struggling...I have read http://wiki.freepascal.org/UTF-8 and no doubt the answer is there but I'm struggling to spot it. Perhaps someone can help.

Wound't this link be more appropriate ->
http://wiki.freepascal.org/LCL_Unicode_Support

Gizmo

  • Sr. Member
  • ****
  • Posts: 304
Re: Reading bytes of data that represent Unicode UTF-16, little endian.
« Reply #2 on: April 18, 2012, 12:38:44 am »
D'Oh...that's the one I meant to link to! I have them both open in seperate tabs. I'm reading the section 'Accessing UTF8 characters' which seems to be similar to what I need, but I'm trying to work out what's going on there. I was hoping there migth be an easier way....e.g "Feed in X bytes from array to function XYZ and cast them as a Unicode string and then display them"
« Last Edit: April 18, 2012, 12:43:34 am by tedsmith »

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Reading bytes of data that represent Unicode UTF-16, little endian.
« Reply #3 on: April 18, 2012, 12:55:31 am »
To be honest I'm not a unicode expert.

But from what I can gather, your wanting to convert a UTF16 to an AnsiString.

So I assume you could convert UTF16 to UTF8, then from utf8 to ansi.

eg.

Code: [Select]
ShortFileName :=utf8ToAnsi(UTF16ToUTF8(PUnicodeString(@MyArray.OtherData[StartPosition])^));

ps.  Also from what I can gather, assuming all characters in UTF16 are two bytes is actually incorrect, certain characters could take 4 bytes.

ludob

  • Hero Member
  • *****
  • Posts: 1173
Re: Reading bytes of data that represent Unicode UTF-16, little endian.
« Reply #4 on: April 18, 2012, 08:35:53 am »
If your unicode data is zero terminated you can just do:
Code: [Select]
ShortFileName := pwidechar(@MFTRecordToParse.OtherData[StartPosition])What is this doing? It takes the address of your first unicode character and then says that it is a pointer to a widechar (ucs2 char). Since pwidechar, as pchar,  is assignment compatible with strings, the compiler will convert the zero terminated widechar string to a string using the system encoding. Depending on your system encoding and what final encoding you need you can add a UTF8encode:
Code: [Select]
ShortFileName := UTF8encode(pwidechar(@MFTRecordToParse.OtherData[StartPosition]))

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Reading bytes of data that represent Unicode UTF-16, little endian.
« Reply #5 on: April 18, 2012, 12:50:18 pm »
oops, I've just realized you also stipulated Endianness.

If you running the application from windows, then your little endian anyway.
But if your grabbing this and running from a big endian machine, then you will first want to change them too Native endian first.

Look at the LEToN on how to do this, I think you could loop through your array where the string is stored and cast as a PWORD and call LEToN on them.

Alternatively I think you could also place a BOM tag at the front of the string.
Eg. 0xFF,0xFE,...,...,,...

I've not really used Unicode that much, my comments might be the blind leading the blind here. :)

Gizmo

  • Sr. Member
  • ****
  • Posts: 304
Re: Reading bytes of data that represent Unicode UTF-16, little endian.
« Reply #6 on: April 18, 2012, 09:12:49 pm »
Ludbob...as easy as that! Yes, your suggestion worked. My earlier mistake, I think, was not quite realising and understanding the types of variables needed for Unicode and how to cast them.

To improve just slightly, I have the following:

Code: [Select]
ShortFileName := UTF8encode(WideCharLenToString(@MFTRecordToParse.OtherData[186],ShortFileNameLength))

which ensures only the actual filename length is pulled out (which is calculated and passed as ShortFileNameLength) as opposed to any remaining data from earlier entries (record slack) which was being returned at the tail end of the filename.

Many thanks as always gents. I do value your time and contribution.
« Last Edit: April 18, 2012, 09:25:07 pm by tedsmith »

KpjComp

  • Hero Member
  • *****
  • Posts: 680
So you don't require UTF-16 decoding, but UCS-2 decoding.
I blame Microsoft for inventing it's own standards, and just adding to the confusion that is Unicode. :)

ludob

  • Hero Member
  • *****
  • Posts: 1173
Re: Reading bytes of data that represent Unicode UTF-16, little endian.
« Reply #8 on: April 19, 2012, 08:51:43 am »
To improve just slightly, I have the following:

Code: [Select]
ShortFileName := UTF8encode(WideCharLenToString(@MFTRecordToParse.OtherData[186],ShortFileNameLength))
The problem with this is that WideCharLenToString converts to ascii which will result in data loss for characters outside the system encoding. UTF8encode won't bring them back. If the widechars are not zero terminated, which is possible on windows when a length is provided, and want to support all ucs2 characters, you better do:
Code: [Select]
var ws:widestring;
...
ws:= @MFTRecordToParse.OtherData[186];
setlength(ws,ShortFileNameLength);
ShortFileName := UTF8encode(ws);


 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads