Recent

Author Topic: Multilanguage UpCase and LowerCase  (Read 2663 times)

LemonParty

  • Sr. Member
  • ****
  • Posts: 393
Multilanguage UpCase and LowerCase
« on: June 09, 2025, 02:35:36 pm »
If I need UpCase and LowerCase for other languages than English. Then I should implement separate function for each language or is there some ready solutions for that?
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

CM630

  • Hero Member
  • *****
  • Posts: 1581
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Multilanguage UpCase and LowerCase
« Reply #1 on: June 09, 2025, 02:43:59 pm »
Are there any languages, except Turkish, for which the inbuilt Uppercase and Lowercase are not working properly?
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

paweld

  • Hero Member
  • *****
  • Posts: 1561
Re: Multilanguage UpCase and LowerCase
« Reply #2 on: June 09, 2025, 03:03:29 pm »
UTF8UpperCase and UTF8LowerCase from LazUTF8 unit.
Best regards / Pozdrawiam
paweld

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #3 on: June 09, 2025, 03:10:38 pm »
There is no rules about Upper or Lower case in Unicode. This have no sense in many languages (look chinese for example).

Do not blindly rely on generic conversions, but you should perform the verification on all languages ​​if supported.

CM630

  • Hero Member
  • *****
  • Posts: 1581
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Multilanguage UpCase and LowerCase
« Reply #4 on: June 09, 2025, 03:52:58 pm »
...This have no sense in many languages (look chinese for example)...
It makes no sense in Chinese, but does it cause problems?
The code below seems to be executed fine.
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. ...
  3. uses ...LazUTF8...
  4. ...
  5. begin
  6.   ShowMessage ('汉语'+#13#10+UTF8UpperCase('汉语')+#13#10+UTF8LowerCase('汉语'));
  7.   ShowMessage ('юЯ汉语'+#13#10+UTF8UpperCase('юЯ汉语')+#13#10+UTF8LowerCase('юЯ汉语'));
  8. end;
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #5 on: June 09, 2025, 04:18:17 pm »
It makes no sense in Chinese, but does it cause problems?
The code below seems to be executed fine.
Of course, he doesn't make any issue.

But you don't have what you need. And if in Chinese, like old Greek or Jewish languages or other languages, is not an issue, for other languages may be.

Take care that in arabic and other languages the lower o upper case has a particular meaning. Failed to convert can change the meaning of the phrase.

Try to convert those in uppercase (uppercase exists of course and is defined as uppercase):

'ӄӥҗ'

This is the lower case code of one of them (unicode point): #$04E5
« Last Edit: June 09, 2025, 04:24:46 pm by gues1 »

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #6 on: June 09, 2025, 06:15:13 pm »
I found in the PascalScript these functions that in Windows do excelent work:

Code: Pascal  [Select][+][-]
  1. function WideUpperCase(const S: WideString): WideString;
  2. var
  3.   Len: Integer;
  4. begin
  5.   // CharUpperBuffW is stubbed out on Win9x platofmrs
  6.   if Win32Platform = VER_PLATFORM_WIN32_NT then
  7.   begin
  8.     Len := Length(S);
  9.     SetString(Result, PWideChar(S), Len);
  10.     if Len > 0 then CharUpperBuffW(Pointer(Result), Len);
  11.   end
  12.   else
  13.     Result := AnsiUpperCase(S);
  14. end;
  15.  
  16. function WideLowerCase(const S: WideString): WideString;
  17. var
  18.   Len: Integer;
  19. begin
  20.   // CharLowerBuffW is stubbed out on Win9x platofmrs
  21.   if Win32Platform = VER_PLATFORM_WIN32_NT then
  22.   begin
  23.     Len := Length(S);
  24.     SetString(Result, PWideChar(S), Len);
  25.     if Len > 0 then CharLowerBuffW(Pointer(Result), Len);
  26.   end
  27.   else
  28.     Result := AnsiLowerCase(S);
  29. end;
  30.  

If you use those instead of "UTF8UpperCase", you can view the characters that i show on the last post with uppercase.
It works only in Windows, but it works.


paweld

  • Hero Member
  • *****
  • Posts: 1561
Re: Multilanguage UpCase and LowerCase
« Reply #7 on: June 09, 2025, 06:47:30 pm »
The only problem I associate with UTF8(Lower/Upper)Case functions is the error that occurs when a lowercase letter (code point) has a different length (number of bytes) than an uppercase letter (code point).
Best regards / Pozdrawiam
paweld

dsiders

  • Hero Member
  • *****
  • Posts: 1510
Re: Multilanguage UpCase and LowerCase
« Reply #8 on: June 09, 2025, 06:57:44 pm »
The only problem I associate with UTF8(Lower/Upper)Case functions is the error that occurs when a lowercase letter (code point) has a different length (number of bytes) than an uppercase letter (code point).

To my knowledge, differing code point lengths has been addressed in the trunk version (main). There are more than likely still mapping issues for upper- and lowercase conversions... but you thank the unicode standard for that over-engineered nonsense.

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #9 on: June 09, 2025, 07:34:20 pm »
The only problem I associate with UTF8(Lower/Upper)Case functions is the error that occurs when a lowercase letter (code point) has a different length (number of bytes) than an uppercase letter (code point).
I dont' use very often UpperCase in languages, but I used them (it's for this reason that I know Lazarus has difficulty performing such operations on the characters shown).
I worked with other tool in Windows, but I never saw different code lenght in UPPERCASE or LOWERCASE with Windows (I use UTF16). I saw totally different code between them, I think 'cause use of extend and supplemental use of BMP.

To my knowledge, differing code point lengths has been addressed in the trunk version (main). There are more than likely still mapping issues for upper- and lowercase conversions... but you thank the unicode standard for that over-engineered nonsense.

May be someone think that unicode is not a good instruments, but I think is the better that we have now. Look at web, only with unicode is possible to saw it in the correct way with every language.

Without unicode we still talk about CP ...

Warfley

  • Hero Member
  • *****
  • Posts: 2037
Re: Multilanguage UpCase and LowerCase
« Reply #10 on: June 09, 2025, 07:36:39 pm »
There is no rules about Upper or Lower case in Unicode. This have no sense in many languages (look chinese for example).

Do not blindly rely on generic conversions, but you should perform the verification on all languages ​​if supported.
Thats not true, the Unicode Consortium defines character properties, including the properties UpperCase and LowrCase. They also provide the Unicode Character Database, which is a collection of text files that contain all that information in machine readable form, containing lists (or more sets of ranges) for all those characters.

All the information is already collected and specified there, because no one can be expected to know all the ins and outs of all those languages, it's all done by the unicode consortium for you.

How well the Lazarus UTF8 functions follow the unicode standard I don't know though, it could very well be that they are not fully implemented

Thaddy

  • Hero Member
  • *****
  • Posts: 18711
  • To Europe: simply sell USA bonds: dollar collapses
Re: Multilanguage UpCase and LowerCase
« Reply #11 on: June 09, 2025, 07:46:22 pm »
Yes, that answer is correct: it is defined in the standard(s).
I believe FPC adheres to them, not 100% sure about Lazarus, though, but probably also yes.
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #12 on: June 09, 2025, 08:11:40 pm »
Thats not true, the Unicode Consortium defines character properties, including the properties UpperCase and LowrCase. They also provide the Unicode Character Database, which is a collection of text files that contain all that information in machine readable form, containing lists (or more sets of ranges) for all those characters.
All the information is already collected and specified there, because no one can be expected to know all the ins and outs of all those languages, it's all done by the unicode consortium for you.
How well the Lazarus UTF8 functions follow the unicode standard I don't know though, it could very well be that they are not fully implemented
You are right, it was lot of time I didn't look at Unicode standards.

These is the reference:https://www.unicode.org/faq/casemap_charprop.html

I'll give a look to the news (especially arabian lang).

I believe FPC adheres to them, not 100% sure about Lazarus, though, but probably also yes.

I already post some characters that aren't converted by Lazarus / FPC.

But sure may be improved in the future.

EDIT: repost the link, it didn't work.
« Last Edit: June 09, 2025, 09:06:16 pm by gues1 »

CM630

  • Hero Member
  • *****
  • Posts: 1581
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Multilanguage UpCase and LowerCase
« Reply #13 on: June 09, 2025, 08:37:19 pm »
Without bug reports things might never get fixed.
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

gues1

  • Guest
Re: Multilanguage UpCase and LowerCase
« Reply #14 on: June 09, 2025, 08:59:25 pm »
Without bug reports things might never get fixed.
A bug report should be filled out by someone who knows the topic very well, in order to provide the right input and information.
In addition, you should also provide a tip or some additional suggestions for solving the problem.
And only a person who knows the development environment and the topic of the problem well can provide this.

I'm OT now, so it's better to go on.

 

TinyPortal © 2005-2018