Recent

Author Topic: [SOLVED] How to determine the character set returned by FindFirst / FindNext ?  (Read 539 times)

Hartmut

  • Sr. Member
  • ****
  • Posts: 280
In Germany we have 7 special characters, called "Umlaute". These are "Ä Ö Ü ä ö ü ß". They can occur in filenames.

If I write a simple console program (I use FPC 3.0.4 on Windows 7) then FindFirst / FindNext returns the ANSI charset:

Code: Pascal  [Select]
  1. unit unit1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. procedure showfiles(pattern: ansistring);
  8.  
  9. implementation
  10.  
  11. uses sysutils;
  12.  
  13. function hexString(s: ansistring): ansistring;
  14.    {returns 's' as a hex-string}
  15.    var z: ansistring;
  16.        i: longint;
  17.    begin
  18.    z:=''; for i:=1 to length(s) do  z:=z + ' ' + hexStr(ord(s[i]),2);
  19.    exit(z);
  20.    end;
  21.  
  22. procedure showfiles(pattern: ansistring);
  23.    {shows all files which match to 'pattern'}
  24.    var SR: TSearchRec;
  25.    begin
  26.    if FindFirst(pattern,faAnyfile,SR) = 0 then
  27.       repeat writeln(SR.Name, hexString(SR.Name));
  28.       until  FindNext(SR) <> 0;
  29.    FindClose(SR);
  30.    end;
  31.  
  32. end.  

Code: Pascal  [Select]
  1. program project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses unit1;
  6.  
  7. begin
  8. showfiles('d:\tst\xx_*.*');
  9. end.

The result for a file with special characters (file is attached) is ANSI:
Code: [Select]
>project1.exe
xx_äöüÄÖÜ.txt 78 78 5F E4 F6 FC C4 D6 DC 2E 74 78 74

But as soon as I write a (minimal) GUI application, using the same "unit1", FindFirst / FindNext returns now the UTF8 charset:

Code: Pascal  [Select]
  1. program project2;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$apptype console} {neccessary for writeln}
  5.  
  6. uses
  7.  Interfaces, // this includes the LCL widgetset
  8.  unit1;
  9.  
  10. begin
  11. showfiles('d:\tst\xx_*.*');
  12. end.

The result for the same file with special characters is now UTF8:
Code: [Select]
>project2.exe
xx_äöüÄÖÜ.txt 78 78 5F C3 A4 C3 B6 C3 BC C3 84 C3 96 C3 9C 2E 74 78 74

So I have 2 questions:
 - what is the minimal Unit, which "switches" the charset returned by FindFirst / FindNext from ANSI to UTF8?
 - is there a way (e.g. a function or global variable or conditional) to determine in a common unit (like "unit1"), whether this "switching" unit is used somewhere in the whole program so that FindFirst / FindNext returns UTF8 instead of ANSI?

I'm a beginner to character sets and codepages. Thanks in advance. I attached my 2 small projects and a demo file with german special characters in the filename.
« Last Edit: October 19, 2019, 04:01:58 pm by Hartmut »

Thaddy

  • Hero Member
  • *****
  • Posts: 9142
Re: How to determine the character set returned by FindFirst / FindNext ?
« Reply #1 on: October 19, 2019, 02:08:54 pm »
FindFirst/Next are system calls, so they are in the system codepage, which you can query.
also related to equus asinus.

lucamar

  • Hero Member
  • *****
  • Posts: 2075
Re: How to determine the character set returned by FindFirst / FindNext ?
« Reply #2 on: October 19, 2019, 02:47:42 pm »
For more information about character set conversions see here: Unicode Support (reference for the System unit) and Unicode and codepage awareness (ref. for the SysUtils unit).

IIRC, the LCL makes some changes on startup so that all the internal processing is done in UTF-8, while the RTL (as used in console programs) remains mostly "agnostic" and uses RawByteString's to prevent any modification of the values returned by the OS.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7493
Re: How to determine the character set returned by FindFirst / FindNext ?
« Reply #3 on: October 19, 2019, 02:49:54 pm »
So I have 2 questions:
 - what is the minimal Unit, which "switches" the charset returned by FindFirst / FindNext from ANSI to UTF8?

components/lazutils/lazutf8, but all units that depend on it will pull it in.

see https://wiki.freepascal.org/Lazarus_with_FPC3.0_without_UTF-8_mode for a workaround.

Quote
- is there a way (e.g. a function or global variable or conditional) to determine in a common unit (like "unit1"), whether this "switching" unit is used somewhere in the whole program so that FindFirst / FindNext returns UTF8 instead of ANSI?

Check defaultfilesystemcodepage for value $65001, if so, then it is utf8.

Quote
I'm a beginner to character sets and codepages. Thanks in advance. I attached my 2 small projects and a demo file with german special characters in the filename.

One warning/tip, don't rely on visuals if something is ok. Seeing umlaute doesn't mean it is ok. It is only ok if it is really IS the encoding you think it should be.

It is easy to get confused if you print some result (to console or file) and then the wrong encoding gets "corrected" by the console or notepad that might assume a different encoding then your problem. A simple workaround is to dump a string with an accent in the first few letters to file, and look at it with an hexeditor instead of notepad.
« Last Edit: October 19, 2019, 02:54:35 pm by marcov »

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7493
Re: How to determine the character set returned by FindFirst / FindNext ?
« Reply #4 on: October 19, 2019, 02:57:15 pm »
FindFirst/Next are system calls, so they are in the system codepage, which you can query.

No, the systemcalls add -file, e.g. findfirstfilea

findfirst/next/close are a Delphi wrapper for those calls.

And in FPC's case they also convert encoding since FPC 3.x. Originally concept and code written by Jonas(including differing between  defaultfilesystemcodepage and k defaultsystemcodepage), and then Michael and I implemented it for all targets.

Hartmut

  • Sr. Member
  • ****
  • Posts: 280
Re: How to determine the character set returned by FindFirst / FindNext ?
« Reply #5 on: October 19, 2019, 04:01:26 pm »
For my questions I got very good answers. To determine the character set returned by FindFirst / FindNext I can use function system.DefaultSystemCodePage. It returns 1252 for ANSI or 65001 for UTF8. Function system.DefaultFileSystemCodePage returns in both cases 65001, so it does not work for me.

I will read the recommended links to learn more.

Thanks a lot to all who helped me. This is a great forum.