Recent

Author Topic: Multilanguage UpCase and LowerCase  (Read 2824 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: Multilanguage UpCase and LowerCase
« Reply #15 on: June 10, 2025, 07:11:33 am »
I am not really familiar with the Lazarus provided UTF8 functions but I have a unicode rtl installed and there Uppercase and Lowercase seem to work correctly. That is Unicode-16, though.
I do not have a Lazarus installed that makes use of the unicode16 rtl, so your mileage may vary.
Also note that misprinted characters may be due to wrong terminal settings: the terminal/console must also support unicode (8 or 16 or 32) so you should always check that first (with e.g. chcp, on linux you may want to install nilfs-tools for that).
« Last Edit: June 10, 2025, 07:29:23 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #16 on: June 10, 2025, 09:24:41 am »
I'll start by saying that I know less than zero about Unicode in Lazarus and FPC. I use other tools that don't have these problems (that's why I wasn't updated on Unicode regulations)
What I said and wrote concerns Windows, with Lazarus 4.0 stable and FPC 3.2.2 and therefore CP and localization (except for some elements such as punctuation marks) are fortunately a distant memory and fall within historical culture.
The functions used are those indicated in the posts, which I believe we use FPC as regards the use of unicode.
The malfunction, if you can call it that, perhaps derives from a single simple update not present in the stable version of FPC.

However, for those who want to create their own function, they can rely on the attached file (original link at bottom) that matches the code points of all uppercase / lowercase characters.
Exceptions such as changes in string length are also reported (I had never encountered any).
A simple matching table would solve the UPPERCASE and LOWERCASE,
A second file indicates the special cases for those who want to further refine everything.

https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
« Last Edit: June 10, 2025, 09:27:05 am by gues1 »

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: Multilanguage UpCase and LowerCase
« Reply #17 on: June 10, 2025, 09:33:17 am »
I do not see the problem. My fifth language - skipping classics - is Lithuanian which is notoriously difficult with its 32 character alphabet and that simply works with the UTF functions from Lazarus.

A post by me about a year or so ago handles simplified Chinese and others. Plz check your terminal, I do not think fpc/lazarus are at fault.
Otherwise provide a simple full example on what you are missing and "does not work."

Only then I can provide you with an example that solves your coding issue and display issue. I suspect it is just the display issue.
« Last Edit: June 10, 2025, 09:52:40 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #18 on: June 10, 2025, 10:02:53 am »
Otherwise provide a simple full example on what you are missing and "does not work."
Only then I can provide you with an example that solves your coding issue and display issue. I suspect it is just the display issue.
Maybe I didn't explain myself well. But it's not a problem.

I have already posted the solution (for Windows), for the community's use, and it is extracted from "Pascalscript".

The problem that I had reported here anyway is that with the standard functions present as also posted in this topic, those characters are not converted:
Code: Pascal  [Select][+][-]
  1.     uses LazUTF8;
  2.     procedure TForm1.Button1Click(Sender: TObject);
  3.     begin
  4.       ShowMessage ('ӄӥҗ'+#13#10+UTF8UpperCase('ӄӥҗ')+#13#10+UTF8LowerCase('ӄӥҗ'));
  5.     end;
Dont' tell that is a Windows problem, terminal or anything else, use "WideUpperCase" and "WideLowerCase" ... they works.

But I don't want to create anything that might cause controversy.
For me the question is closed.
Bye

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: Multilanguage UpCase and LowerCase
« Reply #19 on: June 10, 2025, 10:14:02 am »
Use the predefined LineEnding instead of #13#10
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

munair

  • Hero Member
  • *****
  • Posts: 887
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: Multilanguage UpCase and LowerCase
« Reply #20 on: June 10, 2025, 11:49:07 am »
Years ago I ported Lazarus' utf8 upper/lower case conversion routines to a FreeBASIC library and found that not all languages/characters are covered. There are currently 290,000+ unicode characters with data points supporting 160+ languages/scripts. Libraries will probably never be complete.
It's only logical.

CM630

  • Hero Member
  • *****
  • Posts: 1615
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Multilanguage UpCase and LowerCase
« Reply #21 on: June 10, 2025, 12:01:21 pm »
The problem that I had reported here anyway is that with the standard functions present as also posted in this topic, those characters are not converted:
...
Code: Pascal  [Select][+][-]
  1.     uses LazUTF8;
  2.     procedure TForm1.Button1Click(Sender: TObject);
  3.     begin
  4.       ShowMessage ('ӄӥҗ'+#13#10+UTF8UpperCase('ӄӥҗ')+#13#10+UTF8LowerCase('ӄӥҗ'));
  5.     end;
...
But I don't want to create anything that might cause controversy.
...
Controversy? This seems like nothing but a bug to me.
But the better place for reporting is not “here”, but the bugtracker, have you created a report?

But here is something that might be controversial:
Code: Pascal  [Select][+][-]
  1. ShowMessage ('ӄӥҗѝьß'+#13#10+UTF8UpperCase('ӄӥҗѝьß')+#13#10+UTF8LowerCase('ӄӥҗѝьß'));
  2.  
AFAIK, there is already a capital eszett, but UTF8UpperCase('ß') converts it to SS (the case is similar to the cyrillic letter ь, which is never used in the beginning of the word, it is even named small “er”.)
« Last Edit: June 10, 2025, 12:16:52 pm by CM630 »
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #22 on: June 10, 2025, 12:36:47 pm »
Controversy? This seems like nothing but a bug to me.
If one tell that all is perfect, that its' working 100% and it'is my issue ... what I must say?
In this forum when you write something that can help (may be this is the thinking), talking about bug or improvements seems that if one doesn't have more then 1000 posts you want only create problems.
So, better stop the conversation.

Years ago I ported Lazarus' utf8 upper/lower case conversion routines to a FreeBASIC library and found that not all languages/characters are covered. There are currently 290,000+ unicode characters with data points supporting 160+ languages/scripts. Libraries will probably never be complete.
That is right, of course, like always. But me be some better things one can do (or propose). I don't meaning to say "core developers let work" but we can do something, a little bit. Excpecially if "outside" those thing works better.
In this case, under Windows the things should works better (actually no sorry, they were already working well), without call "core developers".

Code: Pascal  [Select][+][-]
  1. ShowMessage ('ӄӥҗѝьß'+#13#10+UTF8UpperCase('ӄӥҗѝьß')+#13#10+UTF8LowerCase('ӄӥҗѝьß'));
  2.  
AFAIK, there is already a capital eszett, but UTF8UpperCase('ß') converts it to SS (the case is similar to the cyrillic letter ь, which is never used in the beginning of the word, it is even named small “er”.)
'SS' is the right uppercase from standard (codepoint $00DF -> $0073 $0073). And it is one of the case that the lenght grows.
Other forms are (the graphics come form a PDF and are not rights):
Lowercase letters
Quote
00DF ß LATIN SMALL LETTER SHARP S
= Eszett
• German
• not used in Swiss High German
• uppercase is “SS” (standard case mapping),
alternatively 1E9E ẞ
• typographically the glyph for this character can
be based on a ligature of 017F ſ with either
0073 s or with an old-style glyph for 007A z
(the latter similar in appearance to 0292 ʒ ).
Both forms exist interchangeably today.
→ 017F ſ latin small letter long s
→ 0292 ʒ latin small letter ezh
→ 03B2 β greek small letter beta
→ 1E9E ẞ latin capital letter sharp s
→ A7B5 ꞵ latin small letter beta
→ A7D7 ꟗ latin small letter middle scots s

CM630

  • Hero Member
  • *****
  • Posts: 1615
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Multilanguage UpCase and LowerCase
« Reply #23 on: June 10, 2025, 01:03:06 pm »
...
If one tell that all is perfect, that its' working 100% and it'is my issue ... what I must say?
In this forum when you write something that can help (may be this is the thinking), talking about bug or improvements seems that if one doesn't have more then 1000 posts you want only create problems.
So, better stop the conversation.
...
I have 1385 pots in this forum, in approx 1300 I have asked questions, many of them were stupid. 90 % of my questions were answered (with the remaining 85 posts I have tried to help someone else).
If you are restrained for some reason to create bug reports yourself, give me the list of the inhandled characters, I will create the report.
It might be rejected, but it might be accepted, fully or partially.
Indeed, if it is not, I would rather stop using the inbuilt *case routines, as long as a working 3th party solution is present.
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

Thaddy

  • Hero Member
  • *****
  • Posts: 18729
  • To Europe: simply sell USA bonds: dollar collapses
Re: Multilanguage UpCase and LowerCase
« Reply #24 on: June 10, 2025, 01:14:37 pm »
Years ago I ported Lazarus' utf8 upper/lower case conversion routines to a FreeBASIC library and found that not all languages/characters are covered. There are currently 290,000+ unicode characters with data points supporting 160+ languages/scripts. Libraries will probably never be complete.
Lazarus and Fpc use the official tables ad verbatum so that is not the case, even false information.
If it does not work it is your fault,
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

Warfley

  • Hero Member
  • *****
  • Posts: 2038
Re: Multilanguage UpCase and LowerCase
« Reply #25 on: June 10, 2025, 01:14:53 pm »
If one tell that all is perfect, that its' working 100% and it'is my issue ... what I must say?
In this forum when you write something that can help (may be this is the thinking), talking about bug or improvements seems that if one doesn't have more then 1000 posts you want only create problems.
So, better stop the conversation.
This Forum is also not the place to report bugs, first it will spark discussions and will not end up on any todo list of the developers until it's posted to the gitlab issues.

So the best way is just to skip that and just make a bug report in the gitlab. Just give a short explanation, an example of what goes wrong and what the expected result should be and you should be fine.

Here's the link for Lazarus or LCL: https://gitlab.com/freepascal.org/lazarus/lazarus/-/issues
And for FPC, RTL and FCL: https://gitlab.com/freepascal.org/fpc/source/-/issues

CM630

  • Hero Member
  • *****
  • Posts: 1615
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Multilanguage UpCase and LowerCase
« Reply #26 on: June 10, 2025, 01:49:54 pm »
...
+ 1

...
If it does not work it is your fault,
It does not work and it is not his/her fault. If the standard says indeed that it shall not work, the fault is in (those who made) the standard.
Whether FPC/Lazarus shall follow a faulty standard is another issue.
« Last Edit: June 10, 2025, 02:11:50 pm by CM630 »
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #27 on: June 10, 2025, 02:28:38 pm »
If one tell that all is perfect, that its' working 100% and it'is my issue ... what I must say?
In this forum when you write something that can help (may be this is the thinking), talking about bug or improvements seems that if one doesn't have more then 1000 posts you want only create problems.
So, better stop the conversation.
This Forum is also not the place to report bugs, first it will spark discussions and will not end up on any todo list of the developers until it's posted to the gitlab issues.

So the best way is just to skip that and just make a bug report in the gitlab. Just give a short explanation, an example of what goes wrong and what the expected result should be and you should be fine.

Here's the link for Lazarus or LCL: https://gitlab.com/freepascal.org/lazarus/lazarus/-/issues
And for FPC, RTL and FCL: https://gitlab.com/freepascal.org/fpc/source/-/issues
May be I don't speak a comprensible English:
I already express my point, no one except @CM630 and me (I think is an incomplete work) say that this is a bug, one the most evaluable member of this forum said that all is OK and it's my fault, what report I should make ? I'm the last of the world that should open a bug report.
I don't know nothing about Lazarus / FPC, so from me no report is going to be open.

Like you told this is a forum where to talk about this things and I never meant (I wrote about this) to call the "core developers" here. At the end, for me is that is not an issue and works as expected (bad but as expected).

CM630

  • Hero Member
  • *****
  • Posts: 1615
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Multilanguage UpCase and LowerCase
« Reply #28 on: June 10, 2025, 02:39:41 pm »
Some other people also told you:
forum ≠ bugtracker
Just give me the list of what is wrong, and I will write the bug report.
Meanwhile in Python:
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

paweld

  • Hero Member
  • *****
  • Posts: 1568
Re: Multilanguage UpCase and LowerCase
« Reply #29 on: June 10, 2025, 02:40:04 pm »
I already express my point, no one except @CM630 and me (I think is an incomplete work) say that this is a bug, one the most evaluable member of this forum said that all is OK and it's my fault, what report I should make ? I'm the last of the world that should open a bug report.
I don't know nothing about Lazarus / FPC, so from me no report is going to be open.

Like you told this is a forum where to talk about this things and I never meant (I wrote about this) to call the "core developers" here. At the end, for me is that is not an issue and works as expected (bad but as expected).
No one said it's not a bug (except maybe @Thaddy), it's just that not everyone uses all unicode code points to be able to observe such a problem.
According to what you wrote that you encountered problems with Cyrillic and Arabic and Chinese characters, you are the best person to report the error. In the report, you should state in what cases the error manifests itself (sample characters), and it is good to state what the correct behavior should be - the more information the better.
Best regards / Pozdrawiam
paweld

 

TinyPortal © 2005-2018