Recent

Author Topic: using the CODEPAGE correctly?  (Read 29365 times)

Bart

  • Hero Member
  • *****
  • Posts: 5647
    • Bart en Mariska's Webstek
Re: using the CODEPAGE correctly?
« Reply #30 on: June 22, 2016, 11:24:20 pm »
Summa summarum:
 Do not define {$Codepage UTF8} and assign constants only to an AnsiString. Then everything works as magic.

IIRC, if your sourcode is saved in UTF8 encoding and your string constants have any non-ASCII character, then you should insert a codepage define, or save the sourcecode as UTF8-with-BOM.
If you do not do so, 'CÇ' in your sourcecode wil just be #$43#$C3#$87, which on codepage 1252 is 'CÇ' (and 'CÃ╬' on my OEM-codepage which is used by the console), so something completely different then you want it to be.

See: http://wiki.lazarus.freepascal.org/FPC_Unicode_support#Source_file_codepage

And from http://wiki.lazarus.freepascal.org/FPC_Unicode_support#String_constants
Quote
From the above it follows that to ensure predictable interpretation of string constants in your source code, it is best to either include an explicit {$codepage xxx} directive (or use the equivalent -Fc command line option), or to save the source code in UTF-8 with a BOM.

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4650
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #31 on: June 23, 2016, 12:01:04 am »
Bart, I don't know why you deliberately confuse people now.
This discussion is about the default Unicode system in Lazarus which in FPC's point of view is a hack, thus their documentation does not apply for this one detail.
You have tested the system and you know how it works.
« Last Edit: June 23, 2016, 12:19:14 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5647
    • Bart en Mariska's Webstek
Re: using the CODEPAGE correctly?
« Reply #32 on: June 23, 2016, 11:24:31 am »
This discussion is about the default Unicode system in Lazarus which in FPC's point of view is a hack, thus their documentation does not apply for this one detail.

OK, forgot this is Lazarus specific section of the forum.
However, inserting {$codepage UTF8} AFAIK does not do any harm, as long as the sourcecode is actually saved in UTF8 encoding.
It may produce unneccesary conversions though (which will be lossless).

I build all my projects with -FcUTF8 and our "Utf8InRTL hack" and use diacritics all over the place without any problem.
And yes, I only use "String", never UTF8String.
Then again I do not only write GUI applications, and there it matters.
All my comments above were based on pure fpc programs.

From your posts I get that you do not use the codepage identifier at all, and you have no problem with unicode constants in Lazarus at all. This is even better.

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4650
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #33 on: June 23, 2016, 01:18:46 pm »
However, inserting {$codepage UTF8} AFAIK does not do any harm, as long as the sourcecode is actually saved in UTF8 encoding.
It may produce unneccesary conversions though (which will be lossless).

Yes it does harm sometimes. This was discussed just 3 months ago in mailing list:
  http://free-pascal-lazarus.989080.n3.nabble.com/Lazarus-Feature-Request-Insert-codepage-UTF8-per-default-td4047868.html

There I gave a link to the forum thread that made us understand {$codepage UTF8} should not be used by default:
  http://forum.lazarus.freepascal.org/index.php?topic=30022

You participated the in discussion. I though you have understood the issue by now. Apparently not.
It would be unfortunate to start this same discussion again in a thread where new users are confused about complex Unicode and just need clear instructions.
Thus please start a new thread in forum or mailing list if you want to discuss the details more.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4650
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #34 on: June 23, 2016, 01:53:53 pm »
One wish - maybe this is also a proposal - is that the developer core, who are the most familiar on this subject conserning FPC&Laz would write up short rule of thumb FAQ on the subject, as what this transition to new system of UTF8&16 and ansistring is and what it is not. In layman's terms so that us mere mortals could maybe understand its effects a bit better.

This wiki page tries to do it:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus
It has sections for "Compatibility with Unicode Delphi" and "Compatibility with LCL in Lazarus 1.x".

The good news is that the current system is more Delphi compatible than the old system was.
Another good news is the same old Length() and Pos() etc. can often be used with Unicode, too, as demonstrated here:
  http://wiki.freepascal.org/UTF8_strings_and_characters

The bad news is that Unicode is complex. Once you must implement a Unicode-aware text editor or document viewer etc., you must learn the details. There is plenty of information available in the net, FPC/Lazarus wiki is not the right place for it.

One more good news: I plan to port my encoding agnostic string functions and iterators for Delphi.
It will be possible to maintain also advanced Unicode stuff between Delphi and FPC/Lazarus. As an extra benefit the UTF-16 code will be as robust and correct as its UTF-8 counterpart.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

loopbreaker

  • New Member
  • *
  • Posts: 32
Re: using the CODEPAGE correctly?
« Reply #35 on: June 23, 2016, 07:28:31 pm »
FPC and Lazarus are made by volunteer people who don't force anybody to use their products.

I volunteered to find the solution to the current Lazarus problems.
All Lazarus users would profit and this case would be closed finally.
Utf8 supporters would have a safe and simple solution:
String can be alias of utf8String in unit scope. (requires FPC support)

Your utf8 alternative (forever?) is unsafe and needs a wiki page with workarounds if possible (if source is not closed).

Bart

  • Hero Member
  • *****
  • Posts: 5647
    • Bart en Mariska's Webstek
Re: using the CODEPAGE correctly?
« Reply #36 on: June 23, 2016, 11:12:04 pm »
You participated the in discussion. I though you have understood the issue by now. Apparently not.

I won't contribute to this thread anymore.

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4650
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #37 on: June 24, 2016, 12:57:58 am »
I volunteered to find the solution to the current Lazarus problems.
All Lazarus users would profit and this case would be closed finally.
Utf8 supporters would have a safe and simple solution:
String can be alias of utf8String in unit scope. (requires FPC support)

Exactly, it is an FPC issue. You are not the first person with that idea. I also mentioned in one thread here that it would be the cleanest solution in the long run.
Right now it is not realistic. FPC team is building the UTF-16 RTL and other missing parts. Maybe also you will be happy with the fully Delphi compatible system when it comes out.

A new compiler mode with String = UTF8String can happen after the UTF-16 solution is ready and Unicode is replacing the old system codepages everywhere. Even then it needs somebody to do it. Nothing happens without people who actually implement things, you know.
The Lazarus UTF-8 hack can be seen as a proof of concept helping to design the new mode. Now the old codepages are assumed in many places, all code must be studied. It is not trivial.
If you really want to volunteer to improve things, please study the code in FPC and/or Lazarus and provide patches.
I don't think anybody will be against a new compiler string UTF-8 mode if you decide to implement it.

Quote
Your utf8 alternative (forever?) is unsafe and needs a wiki page with workarounds if possible (if source is not closed).

It is not unsafe! Yes, it needs some workarounds and it is not 100% Delphi compatible, but it is as safe as other Unicode solutions.
From your earlier message:
Quote
It is unsafe because many units (also closed source, third parties) can change the global codepage (the heart of your string) any time.
That is nonsense. How can a unit in your project do such thing without you knowing? Even if you use third party libs without source, you must trust them. If a unit really wants to be malicious, it can do much worse things than change a codepage. Just use your imagination ...
Besides, how is it the fault of Lazarus if you have malicious third party libs?
Quote
And it is also unsafe at compiletime as you can see with "Don't use {$Codepage UTF8}", and I could continue with parameter passing
between libraries and other things.
"Don't use {$Codepage UTF8}" does not make it any less safe. Why would it? You are quite desperate to find something to complain.
Parameter passing goes fine when you use proper string type (eg. UnicodeString) or convert a system codepage with a SetCodePage() call. Why do I need to repeat this sentence so many times?

The current Unicode situation is this:

Many years were wasted discussing and arguing about FPC's Unicode solution. Because RTL and other libs do not support UTF-16 yet, a clever solution was created so that Lazarus can change the String default encoding and continue to use UTF-8 as always but in a more clever way. And frankly, it works better than anybody could expect.
Anyway, that is the solution we will have in near future. Just bashing it does not help anybody. It only wastes time and energy from many people, including both you and me.

If you deside to still use Lazarus in future, please attach your problematic code and an explanation of the problem so it can be solved.
Using Delphi is also a valid option if Lazarus does not suit your needs, as you mentioned yourself.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

lainz

  • Hero Member
  • *****
  • Posts: 4738
  • Web, Desktop & Android developer
    • https://lainz.github.io/
Re: using the CODEPAGE correctly?
« Reply #38 on: June 24, 2016, 01:18:32 am »
Hi JuhaManninen, I post it here because is related. As you say is better to don't use the codepage utf-8 I get it.

I have a real world example of what I was doing in older lazarus, this is the code:

Code: Pascal  [Select][+][-]
  1.  
  2. CASE 1
  3. function UTF8UpperFirst(Value: UTF8String): UTF8String;
  4. var
  5.   temp: WideString;
  6. begin
  7.   temp := UTF8Decode(Value);
  8.   if length(temp) > 0 then
  9.     temp[1] := WideUpperCase(temp[1])[1];
  10.   Result := UTF8Encode(temp);
  11. end;
  12.  
  13. CASE 2
  14.  
  15. function FormatoString(Valor: UTF8String): UTF8String;
  16. begin
  17.   { Elimina espacios duplicados
  18.     Elimina espacios al principio y al final
  19.     Primer letra en mayúscula }
  20.   Result := UTF8UpperFirst(Trim(DelSpace1(Valor)));
  21. end;  
  22.  
  23. CASE 3
  24. somewhere in my code..
  25.  
  26. var
  27.   s: TStringList;
  28.   html: string;    
  29.  
  30. s.Text := UTF8Decode(html);
  31. s.SaveToFile(dialogoGuardar.FileName);
  32.  
  33.  

I'm using LCL TStringGrid / TEdit to read text with spanish characters like ñ or á é í ó ú ü, not more than that.

How I can convert (if needed) each case to newest lazarus with no usage of codepage. Thanks.

wp

  • Hero Member
  • *****
  • Posts: 13265
Re: using the CODEPAGE correctly?
« Reply #39 on: June 24, 2016, 01:27:45 am »
This has been working "all" the time (tested with Laz 1.4.4/fpc 2.6.4 and Laz trunk/fpc 3.0):
Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUtf8;
  3.  
  4. function UTF8UpperFirst(Value: string): string;
  5. begin
  6.   Result := UTF8UpperString(UTF8Copy(Value, 1, 1)) + UTF8Copy(Value, 2, UTF8Length(Value));
  7. end;

lainz

  • Hero Member
  • *****
  • Posts: 4738
  • Web, Desktop & Android developer
    • https://lainz.github.io/
Re: using the CODEPAGE correctly?
« Reply #40 on: June 24, 2016, 03:34:43 am »
Thanks wp.

BTW I tested that old code and seems to work without any modifications the same. Of course with your code conversions are not needed, I'm switching to yours function.

I just removed:
Code: Pascal  [Select][+][-]
  1. s.Text := UTF8Decode(html);

Because conversion is not needed, no more! :)

And added this new first line, because it's saved in a format for an editor that needs the BOM:
Code: Pascal  [Select][+][-]
  1. uses LConvEncoding;
  2.  
  3. s.Add(UTF8BOM);  
   

And that's all, everything is compatible. Thanks lazarus + fpc devs.  ::)

loopbreaker

  • New Member
  • *
  • Posts: 32
Re: using the CODEPAGE correctly?
« Reply #41 on: June 24, 2016, 08:20:49 am »
Exactly, it is an FPC issue. You are not the first person with that idea. I also mentioned in one thread here that it would be the cleanest solution in the long run.

Ok, I did not know this, it always seemed you were an opponent.
To resume; we already have the utf16 option (switch) per unit scope (in Delphi impossible),
now utf8 would be another one. String is just an optional alias at unit level, nothing more.

But one also needs to know when to grab a chance, years have passed,
people disappeared, even my time to visit these groups is very limited, I'm Delphi developer
and I don't need or use Lazarus now. The change in FPC would be very small, it's just an alias thing.
And the current situation "can" dispel potential users. "in the long run" is not enough.

About the current solution, I've said all and it was not restricted to my personal needs, but for all.

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: using the CODEPAGE correctly?
« Reply #42 on: June 24, 2016, 09:55:35 am »
If you want to feel happy,
  • You never use UTF8String type. Use plane string type.
  • You have to Use UTF8Decode(), UTF8Encode(), or etc as you have so far.. Don' t trust Auto-converting-String-Codepage.


Do not ask me why. That are magic.
« Last Edit: June 24, 2016, 10:02:36 am by malcome »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4650
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #43 on: June 24, 2016, 11:47:20 am »
You have to Use UTF8Decode(), UTF8Encode(), or etc as you have so far.. Don' t trust Auto-converting-String-Codepage.

That is not true. I remember this was discussed with you already.
You don't need UTF8Decode() nor UTF8Encode() any more. Assignment between string variables goes always right thanks to the dynamic encoding info.
Assigning constants is trickier but can be solved easily, too.

Looks like you still have not understood this system. How come? Please open a new thread and attach your code there.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

BeniBela

  • Hero Member
  • *****
  • Posts: 948
    • homepage
Re: using the CODEPAGE correctly?
« Reply #44 on: June 24, 2016, 11:54:04 am »
Exactly, it is an FPC issue. You are not the first person with that idea. I also mentioned in one thread here that it would be the cleanest solution in the long run.

It is the obvious idea.

Just like shortstring is handled.

I never imagined someone could be so stupid to implement it otherwise


Even if you use third party libs without source, you must trust them. If a unit really wants to be malicious, it can do much worse things than change a codepage. Just use your imagination ...

Especially for Unicode there are many ways

E.g this one is evil:

Code: [Select]
var oldA2U :procedure(source:pchar;cp : TSystemCodePage;var dest:unicodestring;len:SizeInt);

procedure Ansi2UnicodeMoveProc(source: pchar; cp: TSystemCodePage; var dest: unicodestring; len: SizeInt);
var
  i: Integer;
begin
  oldA2U(source,cp,dest,len);;
  for i := 1 to length(dest) do
    if dest[i] = '_' then begin
      dest[i] := '-';
      break;
    end;
end;


  oldA2U := widestringmanager.Ansi2UnicodeMoveProc;
  widestringmanager.Ansi2UnicodeMoveProc := @Ansi2UnicodeMoveProc;



Now if you write

Code: [Select]
writeln(utf8string('x_y_z'));
It prints

Code: [Select]
'x-y_z'
They will never notice it

Begs the question wtf is it using utf-16 to print utf-8?

This shows the string might be corrupted in any way during an assignment, not just the encoding
« Last Edit: June 24, 2016, 11:57:10 am by BeniBela »

 

TinyPortal © 2005-2018