Recent

Author Topic: [Windows] UTF8 encoding with ReadLn  (Read 1488 times)

bstewart

  • New Member
  • *
  • Posts: 14
Re: [Windows] UTF8 encoding with ReadLn
« Reply #15 on: January 20, 2026, 08:44:15 pm »
This works without a manifest:

Code: Pascal  [Select][+][-]
  1. program test;
  2.  
  3. {$CODEPAGE UTF8}
  4. {$MODE OBJFPC}
  5. {$MODESWITCH UNICODESTRINGS}
  6.  
  7. uses
  8.   windows;
  9.  
  10. var
  11.   S: string;
  12.  
  13. begin
  14.   SetMultiByteConversionCodePage(CP_UTF8);
  15.   Write('Enter δείγμα: ');
  16.   ReadLn(S);
  17.   WriteLn('S := ' + S);
  18. end.
  19.  

This means, though, that there's an implicit conversion happening. The goal is to read and write UTF8 strings without any conversions.

LV

  • Sr. Member
  • ****
  • Posts: 412
Re: [Windows] UTF8 encoding with ReadLn
« Reply #16 on: January 20, 2026, 09:05:18 pm »
@LV - run your sample without calling the extra Windows APIs and without a resource (*.res) file.

Running this code

├─ utf8_2
   ├─ test.lpr

Code: Pascal  [Select][+][-]
  1. program test;
  2.  
  3. {$MODE OBJFPC}{$H+}
  4. {$codepage utf8}
  5.  
  6. uses
  7.   SysUtils, Windows;
  8.  
  9. var
  10.   S: String;
  11.  
  12. begin
  13.   SetConsoleCP(CP_UTF8);
  14.   SetConsoleOutputCP(CP_UTF8);
  15.  
  16.   SetMultiByteConversionCodePage(CP_UTF8);
  17.   SetTextCodePage(Output, CP_UTF8);
  18.   SetTextCodePage(Input, CP_UTF8);
  19.  
  20.   Write('Enter 🤔 δείγμα 😊: ');
  21.   ReadLn(S);
  22.   WriteLn('S := ', S);
  23.  
  24.   WriteLn('Windows 11; FPC 3.2.2');
  25.   ReadLn;
  26. end.
  27.  

from the cmd

Code: Text  [Select][+][-]
  1. @echo off
  2. set FPC="C:\lazarus\fpc\3.2.2\bin\x86_64-win64\fpc.exe"
  3.  
  4. %FPC% -Twin64 -Px86_64 -otest64.exe S:\utf8_2\test.lpr
  5. %FPC% -Twin32 -Pi386    -otest32.exe S:\utf8_2\test.lpr
  6.  
  7. echo Done
  8. pause
  9.  

🫡

bstewart

  • New Member
  • *
  • Posts: 14
Re: [Windows] UTF8 encoding with ReadLn
« Reply #17 on: January 20, 2026, 09:08:55 pm »
Correct; as I acknowledged, it works when you call SetMultiByteConversionCodePage.

However, as I'll say again, this works because there's an implicit conversion happening.

As I wrote previously: The goal is for it to work without implicit conversions.

andersonscinfo

  • Full Member
  • ***
  • Posts: 156
Re: [Windows] UTF8 encoding with ReadLn
« Reply #18 on: January 20, 2026, 11:38:48 pm »
Quote
Quote from: bstewart on Today at 07:46:59 pm [[https://forum.lazarus.freepascal.org/index.php/topic,73290.msg574592.html#msg574592](https://forum.lazarus.freepascal.org/index.php/topic,73290.msg574592.html#msg574592)]How do we fix ReadLn without a manifest so it runs on older platforms? Or is it not possible?

Unfortunately, **it's not possible to reliably fix `ReadLn` for UTF-8 input on older Windows versions without a manifest**. Here’s why:

The issue stems from how the Windows console (cmd.exe) handles Unicode input/output before Windows 10 version 1903. Prior to this, the console was fundamentally designed around ANSI codepages, and even when you set `chcp 65001`, the underlying I/O functions used by the RTL often encounter bugs or limitations with multi-byte sequences.

### Why the Manifest Works

The manifest enables the "ANSI codepage is UTF-8" feature introduced in Windows 10 1903. This tells the OS to treat all ANSI APIs (like those used by FPC’s `ReadLn`) as UTF-8, effectively making them work correctly with Unicode strings. Without this, the console’s internal handling of UTF-8 sequences breaks, leading to garbled output like `de??µa`.

### Alternatives for Older Platforms

If you need compatibility with older Windows versions (7/8/early 10), here are some workarounds:

1. **Use Windows API (`ReadConsoleW`)**
Standard `ReadLn` relies on ANSI system calls. To properly read Unicode on older Windows, you must bypass the standard input mechanism and use `ReadConsoleW`, which reads UTF-16 (WideString) directly from the console buffer. You can then convert the result to UTF-8.
```pascal
Code: Pascal  [Select][+][-]
  1. program test;
  2. {$MODE OBJFPC}{$H+}
  3. uses Windows, SysUtils;
  4.  
  5. var
  6.   hStdIn: THandle;
  7.   dwRead: DWORD;
  8.   Buffer: array[0..1023] of WideChar;
  9.   WStr: WideString;
  10.   S: String;
  11. begin
  12.   // Set output to UTF-8 so we can print the result correctly
  13.   SetConsoleOutputCP(CP_UTF8);
  14.   Write('Enter δείγμα: ');
  15.  
  16.   hStdIn := GetStdHandle(STD_INPUT_HANDLE);
  17.  
  18.   // Read WideChars directly to avoid ANSI conversion issues
  19.   if ReadConsoleW(hStdIn, @Buffer, Length(Buffer), dwRead, nil) then
  20.   begin
  21.     SetString(WStr, Buffer, dwRead);
  22.  
  23.     // Trim CR/LF from the end manually
  24.     while (Length(WStr) > 0) and (WStr[Length(WStr)] < #32) do
  25.       Delete(WStr, Length(WStr), 1);
  26.  
  27.     S := UTF8Encode(WStr); // Convert safely to UTF-8
  28.   end;
  29.  
  30.   WriteLn('S := ' + S);
  31. end.
  32.  

```


2. **Use a Different Console Emulator**
Tools like ConEmu or Windows Terminal provide better Unicode support than the default cmd.exe, even on older Windows versions. However, this shifts the burden to the user rather than solving it programmatically.
3. **Switch to GUI Applications**
If feasible, consider building a GUI application using Lazarus or another framework. GUI controls handle Unicode input much more reliably across different Windows versions.

### Conclusion

While there are workarounds, none are as seamless as using the manifest approach. For legacy support on older Windows, using `ReadConsoleW` is the most robust programmatic solution, as the standard `ReadKey` or `ReadLn` functions are prone to encoding errors with multi-byte characters in those environments.

Hope this helps! 😊

bstewart

  • New Member
  • *
  • Posts: 14
Re: [Windows] UTF8 encoding with ReadLn
« Reply #19 on: January 21, 2026, 12:08:42 am »
Quote
Standard ReadLn relies on ANSI system calls.

...and thus this issue. In my thinking, {$MODESWITCH UNICODESTRINGS} should use the W-suffix API calls on the Windows side. Perhaps this is coming at some point.

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: [Windows] UTF8 encoding with ReadLn
« Reply #20 on: January 21, 2026, 10:42:15 am »
No because different windows versions have different codepages: you either have to explicitly set your codepage for the terminal in your Windows version to UTF8/65001 OR use the suggestion by LV which is the only correct one:
Code: Pascal  [Select][+][-]
  1.   SetConsoleCP(CP_UTF8);
  2.   SetConsoleOutputCP(CP_UTF8);
  3.   SetMultiByteConversionCodePage(CP_UTF8);
  4.   SetTextCodePage(Output, CP_UTF8);
  5.   SetTextCodePage(Input, CP_UTF8);

For a good example on how this works, see the console code from member Warfley
https://github.com/Warfley/LazTermUtils
That code is highly recommended to study the matter at hand.

That also shows how to keep your code compatible over multiple platforms.
(And YES he also uses the above suggestion by LV.....because that is how it should be done)

The thing is that simply SetMultiByteConversionCodePage is not enough to cover full console input/output, subs readln/writeln and consorts on Windows.

(Just to be sure: if you use that code from a GUI app, make sure there is a console in the first place, either through the Windows API or through compiling with {$apptype console} the latter automatically creates a console window in GUI apps)

LV did it a bit better, though, capturing  SetTextCodePage(Input, CP_UTF8) too and that is for your readln.
« Last Edit: January 21, 2026, 10:55:49 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

PascalDragon

  • Hero Member
  • *****
  • Posts: 6321
  • Compiler Developer
Re: [Windows] UTF8 encoding with ReadLn
« Reply #21 on: January 22, 2026, 09:51:05 pm »
Quote
Standard ReadLn relies on ANSI system calls.

...and thus this issue. In my thinking, {$MODESWITCH UNICODESTRINGS} should use the W-suffix API calls on the Windows side. Perhaps this is coming at some point.

No, because {$ModeSwitch UnicodeString} is only unit local, thus does not influence functions in other units (most importantly the Windows unit).
Also it's best to simply not use that switch. In the development version the correct approach is to compile the RTL for Unicode which will make use of the -W-functions and use String = UnicodeString in general.

bstewart

  • New Member
  • *
  • Posts: 14
Re: [Windows] UTF8 encoding with ReadLn
« Reply #22 on: January 26, 2026, 05:56:56 pm »
Quote
No, because {$ModeSwitch UnicodeString} is only unit local, thus does not influence functions in other units (most importantly the Windows unit).

Of course; what I was trying to say (rather poorly) is that the RTL and Windows unit should be compiled to use Unicode strings (i.e., the 'W'-suffix API calls).

Quote
Also it's best to simply not use that switch.

If the entire RTL can be compiled on Windows using the 'W'-suffix APIs, then I would agree.

Just to state it explicitly, one of the primary purposes of using the 'W' APIs [Unicode] is to avoid these kinds of issues.

Quote
In the development version the correct approach is to compile the RTL for Unicode...

Is there a how-to guide somewhere on how to accomplish this?
« Last Edit: January 27, 2026, 03:25:50 pm by bstewart »

PascalDragon

  • Hero Member
  • *****
  • Posts: 6321
  • Compiler Developer
Re: [Windows] UTF8 encoding with ReadLn
« Reply #23 on: January 27, 2026, 08:58:11 pm »
Quote
Also it's best to simply not use that switch.

If the entire RTL can be compiled on Windows using the 'W'-suffix APIs, then I would agree.

Just to state it explicitly, one of the primary purposes of using the 'W' APIs [Unicode] is to avoid these kinds of issues.

You currently get more confusion and conversion by using that modeswitch, because the remainder of the RTL still uses String = AnsiString and thus the unit where you enable the switch will convert (e.g. if you use a TStringList) while if the mode is disabled the strings will mainly be passed through.

Quote
In the development version the correct approach is to compile the RTL for Unicode...

Is there a how-to guide somewhere on how to accomplish this?

Here.

 

TinyPortal © 2005-2018