Recent

Author Topic: using the CODEPAGE correctly?  (Read 29033 times)

rvk

  • Hero Member
  • *****
  • Posts: 6886
Re: using the CODEPAGE correctly?
« Reply #15 on: June 21, 2016, 09:17:18 pm »
O, wow... and even I get confused.

I would have thought this would result in 2 but it results in 3 in Laz/FPC trunk (default settings):
Code: Pascal  [Select][+][-]
  1. var
  2.   s1: WideString;
  3. begin
  4.   s1 := 'CÇ';
  5.   showmessage(IntToStr(Length(s1)));
  6. end;

And if I put {$Codepage UTF8} at the top it does result in 2.

It seems Length() takes the WideString parameter and converts it back to string in which case it is 3. And if you put codepage UTF8 it does not  %) %) I've noticed that excessive converting of strings internally before (and it can really slow things down).

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #16 on: June 22, 2016, 12:14:07 am »
Unicodestring is usually defined as UTF16 or UCS2. Is there actually a language which uses UTF-8 for type Unicodestring?

Not with that type name for sure. Some languages have built-in types or libraries for UTF-8.
I understood also many C- and other programs are ported from ASCII to UTF-8 using the old byte-array types. It allows the least changes and lots of old code continues to work.
Then UTF-8 specific functions must be called when needed, quite like Lazarus did until recently.
Now Lazarus + FPC3 is quite an advanced combination. It provides a transparent UTF-8 support!

I would have thought this would result in 2 but it results in 3 in Laz/FPC trunk (default settings):
Code: Pascal  [Select][+][-]
  1. var
  2.   s1: WideString;
  3. begin
  4.   s1 := 'CÇ';
  5.   showmessage(IntToStr(Length(s1)));
  6. end;
And if I put {$Codepage UTF8} at the top it does result in 2.

You should read the String_Literals wiki page I linked earlier. I will not write the same things here again.

Quote
It seems Length() takes the WideString parameter and converts it back to string in which case it is 3. And if you put codepage UTF8 it does not  %) %)

No, Length() returns the actual length of the WideString variable, but its data is screwed and corrupt.

Quote
I've noticed that excessive converting of strings internally before (and it can really slow things down).

There are no internal conversions when you use the same string type for variables and don't change their encodings explicitly.
There are no internal conversions when you assign a constant to a String variable using default Lazarus settings (without {$Codepage UTF8} ).
There is one conversion when you assign a constant to a WideString or UnicodeString variable with {$Codepage UTF8} defined, but it happens at compile time and does not really slow things down.

BTW, most often you should use UnicodeString instead of WideString, as noted by Marco in one thread. It includes most Windows API calls. This is out of topic however ...
« Last Edit: June 22, 2016, 12:21:36 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

lainz

  • Hero Member
  • *****
  • Posts: 4738
  • Web, Desktop & Android developer
    • https://lainz.github.io/
Re: using the CODEPAGE correctly?
« Reply #17 on: June 22, 2016, 01:02:57 am »
Code: Pascal  [Select][+][-]
  1. var
  2.   s, t: UnicodeString;
  3. begin
  4.   s := 'ñandú';
  5.   t := s[1];
  6.   ShowMessage(t); // returns ñ
  7.   ShowMessage(IntToStr(Length(s))); // returns 5

but I stll need the {$Codepage UTF8} 

Edit: The same result without using codepage and using normal string

Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUTF;
  3.  
  4. var
  5.   s, t: string;
  6. begin
  7.   s := 'ñandú';
  8.   t := UTF8Copy(s, 1, 1);
  9.   ShowMessage(t); // result ñ
  10.   ShowMessage(IntToStr(LazUTF8.UTF8Length(s))); // result 5  
« Last Edit: June 22, 2016, 03:02:02 am by lainz »

ArtLogi

  • Full Member
  • ***
  • Posts: 194
Re: using the CODEPAGE correctly?
« Reply #18 on: June 22, 2016, 01:20:50 am »
Owww. Sound like a total *Pliip* for backward compatibility and nightmare for legagy systems maintenance. Geesh.  >:( Need to start to archive old VMs and versions and maybe laptop or two. lol.

Non related to Lazarus nor FPZ, but general transition from codepages to UTF-xx system.
« Last Edit: June 22, 2016, 01:23:47 am by ArtLogi »
While Record is a drawer and method is a clerk, when both are combined to same space it forms an concept of office, which is alias for a great suffering.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #19 on: June 22, 2016, 08:06:15 am »
@lainz, these things were discussed a lot when the new Unicode system was under construction.
Just please read the wiki!

Again:
{$Codepage UTF8} is not needed if you always assign a constant to a String variable.
If you must assign a to a UnicodeString variable then you need it, yes.
Assignment between variables goes always right.

@ArtLogi, the old codepages require explicit conversion. That is the only thing that broke with the new Unicode system.
The conversion is easy with SetCodePage() function calls.
The ideal solution is to convert all data to Unicode but it is not possible always, I understand.

Otherwise code maintenance, including code between Delphi and Lazarus, is easier now than ever. Our new Unicode system is almost Delphi compatible when used right.

Now even advanced portable Unicode is possible. See my thread about encoding agnostic functions.
« Last Edit: June 22, 2016, 10:24:01 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

loopbreaker

  • New Member
  • *
  • Posts: 32
Re: using the CODEPAGE correctly?
« Reply #20 on: June 22, 2016, 09:19:44 am »
Could you give an example of use? I am very confused !!!

The reason is that lazarus devs here want utf8 via changing systemcodepage (runtime) instead of using the proper utf8string (compiletime). I've given up hope already.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #21 on: June 22, 2016, 10:22:59 am »
The reason is that lazarus devs here want utf8 via changing systemcodepage (runtime) instead of using the proper utf8string (compiletime). I've given up hope already.

What exactly makes you hopeless?
Do you honestly think that every programmer should change every "String" to a proper "UTF8String" in all their programs?

Some people claim that the UTF-8 system is unusable. However they have refused to give example code.
From now on, please include code samples when you have problems and we can find solutions.
The problems I have seen could be solved by using proper string types (like UnicodeString) for API calls and by converting old codepage data with  SetCodePage().
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

loopbreaker

  • New Member
  • *
  • Posts: 32
Re: using the CODEPAGE correctly?
« Reply #22 on: June 22, 2016, 10:59:32 am »
Do you honestly think that every programmer should change every "String" to a proper "UTF8String" in all their programs?

If people want utf8 via "String" (and not a different alias as I can see from your reply), then it is very obvious that they want the compiler option: type String = Utf8String; There is no other conclusion and it would be much safer than the current "solution".

BeniBela

  • Hero Member
  • *****
  • Posts: 947
    • homepage
Re: using the CODEPAGE correctly?
« Reply #23 on: June 22, 2016, 11:37:08 am »

Do you honestly think that every programmer should change every "String" to a proper "UTF8String" in all their programs?

UTF8String string is older than 3.0

They should have made that change  years ago

If people want utf8 via "String" (and not a different alias as I can see from your reply), then it is very obvious that they want the compiler option: type String = Utf8String; There is no other conclusion and it would be much safer than the current "solution".

True

Although I just noticed a lot of older libraries use ansistring everywhere instead string. They would be broken

ArtLogi

  • Full Member
  • ***
  • Posts: 194
Re: using the CODEPAGE correctly?
« Reply #24 on: June 22, 2016, 12:36:20 pm »
Edit: Lets put this line also at the beginnin. So read post as it would be nonsense  :) :

Propably I'm just confused, mixing things together and ill informed.


I mean if the change is also on RTL and whole system based on these a character is a string things and It arises a need to do a program to convert something random ascii based data (Codepage known or quessed) from form to another. Then it can't be managed easily with ie. RTL commands like wordcount() etc. since it would reguire ANSI/ASCII -> UTF8 ->ANSI/ASCII conversions in pretty much any function call? Or terminal style communication with legacy HW etc.

Or is there still to be something like a legacy string format for such.?, but then there is still the RTL problem that characters are strings with "random" lenght. Which makes it really questionable to work with legacy strings (ascii), since ie. thousand separator return a sting of Random(lenght(string))?

Propably I'm just confused, mixing things together and ill informed.

PS. Also what I don't understand is why this needs to be implemented like it is done and not by just introducing a new basic variable type to the language.  str16 and/or str8 to work alongside the old string and ansistring and just overload the existing fuctions with this new type like is done with integer and int64. Since the language coding itself stays on english 8bit right (Repeat..Untill will not be кайталап чейин )??? Ofcourse thing need to change, because world is changing around, but isn't this kind of change of system also cousing total brokedown to whole infrastructure (Code snippets, tutorials, manuals and libraries (and their (Code snippets, tutorials, manuals)) which would mean of returning to stoneage what comes the supporting infrastucture of programming.

This new system in my eyes (based on what I think it is) brokes the pascals core idea of clear, easy to read, strong logical structure. Since now it introduces a seemingly randomly morphing string type? Based on this example:
program project1;

Code: [Select]
{$codepage utf8}
{$mode objfpc}{$H+}
{$ifdef unix}
uses cwstring;
{$endif}
var
  a,b,c: string;
begin
  a:='ä';
  b:='='#$C3#$A4; // #$C3#$A4 is UTF-8 for ä
  c:='ä='#$C3#$A4; // after non ascii 'ä' the compiler interprets #$C3 as widechar.
  writeln(a,b); // writes ä=ä
  writeln(c);   // writes ä=ä
end.

http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals


(yep, I haven't readed any of the forementioned lenghty historical conversiotions, since I'm a new comer)

[/code]
« Last Edit: June 22, 2016, 04:25:55 pm by ArtLogi »
While Record is a drawer and method is a clerk, when both are combined to same space it forms an concept of office, which is alias for a great suffering.

loopbreaker

  • New Member
  • *
  • Posts: 32
Re: using the CODEPAGE correctly?
« Reply #25 on: June 22, 2016, 01:12:23 pm »
Yes, utf16, utf8, ansistring should work alongside on
all systems (Windows and Linux), and without
changing a global systemcodepage at runtime.
The question remains what to do with the alias "String".

I had in mind that app developers can define "String"
as alias with unitscope, for utf16 (this is possible already) and
alternatively, utf8.  Both options (utf16 or utf8) would be on par.

In general, developers of shared code should prefer using encoding-specific names (not "String"), which works without special unit header. Example: Inside LCL (is already! utf8), there should be a name convention to use utf8String. For the sake of readability and for copying code snippets between different environments.

Personally I never use "String" but this is a matter of preference.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #26 on: June 22, 2016, 03:57:28 pm »
Propably I'm just confused, mixing things together and ill informed.

Yes, exactly! Your arguments were rather nonsense.

About the code snippets, tutorials, manuals and libraries you mentioned:
Delphi's makers have provided code snippets and tutorials that claim their UnicodeString is source compatible with AnsiString and should still be indexed as before with
Code: Pascal  [Select][+][-]
  1. Str[i]
They have fed the common misconception that UTF-16 is a fixed width encoding. The result is even more broken UTF-16 code out there.
Our UTF-8 solution is more robust. No false promises are given. Code must be done always right because multi-byte codepoints are so common.

The sample code from wiki was added by Mattias to explain some corner cases. Maybe it should be deleted, it only confuses people.
The recommended way to assign a constant to a String variable is without {$codepage utf8} as was repeated many times.

Quote
(yep, I haven't readed any of the forementioned lenghty historical conversiotions, since I'm a new comer)

Yes, that is a problem with many people.
The heated Unicode discussion continued in FPC lists for many many years. I believe FPC developers were exhausted and frustrated after that.
Now they have a plan for a Delphi compatible UTF-16 solution which could be ready by now without so many people fighting them during the process.

We have a working UTF-8 solution now but some people want to emphasize it is not usable. For some reason those people refuse to show any code to back their claims. All the actual problems I have seen could be solved easily.
Partly this is psychological issue. Some people just want to oppose and fight others using whatever excuse. Reminds me of the "poor development process" of Lazarus which Git would solve. Well, Git is supported now but nobody still has offered his forked Git repo during the 5-6 years...

So hey, let's stop the useless whining and do something productive.
From now on please attach example code when you have real problems with Unicode.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

ArtLogi

  • Full Member
  • ***
  • Posts: 194
Re: using the CODEPAGE correctly?
« Reply #27 on: June 22, 2016, 05:30:58 pm »
Thank you for your responce.


Quote
(yep, I haven't readed any of the forementioned lenghty historical conversiotions, since I'm a new comer)

Yes, that is a problem with many people.
The heated Unicode discussion continued in FPC lists for many many years. I believe FPC developers were exhausted and frustrated after that.
.
That would be fully understandable and expected.

One wish - maybe this is also a proposal - is that the developer core, who are the most familiar on this subject conserning FPC&Laz would write up short rule of thumb FAQ on the subject, as what this transition to new system of UTF8&16 and ansistring is and what it is not. In layman's terms so that us mere mortals could maybe understand its effects a bit better.
« Last Edit: June 22, 2016, 05:35:49 pm by ArtLogi »
While Record is a drawer and method is a clerk, when both are combined to same space it forms an concept of office, which is alias for a great suffering.

loopbreaker

  • New Member
  • *
  • Posts: 32
Re: using the CODEPAGE correctly?
« Reply #28 on: June 22, 2016, 05:37:20 pm »
We have a working UTF-8 solution now but some people want to emphasize it is not usable.

It's below the quality standards of companies.
It is unsafe because many units (also closed source, third parties) can change the global codepage
(the heart of your string)  any time. And it is also unsafe at compiletime as you can
see with "Don't use {$Codepage UTF8}", and I could continue with parameter passing
between libraries and other things. I rather stay with Delphi than waste my time here...

Also in regards to your iterator: There is no!! point in writing an iterator beyond
codepoints. There are many reasons why multiple codepoints can combine,
it's not only diacritics (where one base character can have a group of them).
There are also different granularity needs (draw, caret, ...). Uniscribe
(and equivalents on other systems) provide advanced iterators,
no need to reinvent them and also rarely needed.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: using the CODEPAGE correctly?
« Reply #29 on: June 22, 2016, 09:20:05 pm »
I rather stay with Delphi than waste my time here...

So why do you waste your time here then?
Using Delphi is a perfectly valid choice. FPC and Lazarus are made by volunteer people who don't force anybody to use their products.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018