Recent

Author Topic: Should UpperCase return the same result as UTF8UpperCase?  (Read 7092 times)

RayoGlauco

  • Full Member
  • ***
  • Posts: 176
  • Beers: 1567
Should UpperCase return the same result as UTF8UpperCase?
« on: December 18, 2017, 03:48:42 pm »
Hello!

I have been updating my code to remove references to UTF8 functions like DirectoryExistsUTF8 and replace it by the "standard" functions (like DirectoryExists), but I see that UpperCase does not work the same as UTF8UpperCase. Characters like ñ ç á è ô ü are not converted to Ñ Ç Á È Ô Ü by UpperCase.

Code: Pascal  [Select][+][-]
  1.   UTF8UpperCase('test ñ ç á è ô ü') = 'TEST Ñ Ç Á È Ô Ü'
  2.   UpperCase('test ñ ç á è ô ü') = 'TEST ñ ç á è ô ü'

I think UpperCase should convert these characters. Is this correct?

I made tests on Windows 10 64 bits, Lazarus 1.8 32 bits.
To err is human, but to really mess things up, you need a computer.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #1 on: December 18, 2017, 04:10:38 pm »
I  don't know exactly what happens. It seems that the literal confuses the compiler. This worked for me:

Code: Pascal  [Select][+][-]
  1. var x : ansistring;
  2. begin
  3.    memo1.lines.add(UTF8UpperCase('test ñ ç á è ô ü')); //  'TEST Ñ Ç Á È Ô Ü'
  4.    x:='test ñ ç á è ô ü';
  5.   memo1.lines.add(UnicodeUpperCase(x));
  6. end;
  7.  
  8.  

Please file a bug so that it is not forgotten.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #2 on: December 18, 2017, 04:31:42 pm »
I think UpperCase should convert these characters. Is this correct?
No.
Uppercase is the "dummy" ASCII compatible function. You must use AnsiUpperCase instead.
It is documented. This page :
 http://wiki.freepascal.org/Unicode_Support_in_Lazarus#Coming_from_older_Lazarus_.2B_LCL_versions
says: "Most UTF8...() string functions can be replaced with the Delphi compatible Ansi...() functions. For example UTF8UpperCase() -> AnsiUpperCase()."

Read also Delphi documentation for those functions :
 http://docwiki.embarcadero.com/Libraries/Berlin/en/System.AnsiStrings.UpperCase
 http://docwiki.embarcadero.com/Libraries/Berlin/en/System.SysUtils.AnsiUpperCase

Marcov may have found some other issue. I did not test it now.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #3 on: December 18, 2017, 05:13:38 pm »
My bad. I knew this was the case but couldn't find ansiuppercase where I expected it (in strutils, not sysutils)

Thaddy

  • Hero Member
  • *****
  • Posts: 14201
  • Probably until I exterminate Putin.
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #4 on: December 18, 2017, 05:29:37 pm »
My bad. I knew this was the case but couldn't find ansiuppercase where I expected it (in strutils, not sysutils)
It is also a bit of a misnominer considering it is AnsiString(CP_UTF8);
Specialize a type, not a var.

RayoGlauco

  • Full Member
  • ***
  • Posts: 176
  • Beers: 1567
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #5 on: December 18, 2017, 06:33:14 pm »
"Most UTF8...() string functions can be replaced with the Delphi compatible Ansi...() functions. For example UTF8UpperCase() -> AnsiUpperCase()."

Thanks! It's clear now. :)
To err is human, but to really mess things up, you need a computer.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #6 on: December 18, 2017, 06:42:35 pm »
My bad. I knew this was the case but couldn't find ansiuppercase where I expected it (in strutils, not sysutils)
It is also a bit of a misnominer considering it is AnsiString(CP_UTF8);

Further down the page it says:
Code: Pascal  [Select][+][-]
  1. UTF8String = type AnsiString(CP_UTF8);

Doesn't make it easier to understand the difference.
keep it simple

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #7 on: December 18, 2017, 06:56:45 pm »
My bad. I knew this was the case but couldn't find ansiuppercase where I expected it (in strutils, not sysutils)
It is also a bit of a misnominer considering it is AnsiString(CP_UTF8);

I haven't seen anything with that type. It is barely used in the RTL, all encoding support uses rawbytestring.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #8 on: December 19, 2017, 09:13:01 am »
Ok, so it seems that the programs I write only look like they work, but that they actually don't. Because everything I do with strings seems to be wrong and depreciated.

I read the link, I don't understand.

Can anyone tell me how to write programs that handle (mostly unicode) strings, that actually work as intended with fpc + Lazarus trunk (and work the same way when compiled on Linux and Windows)? Because I lost it.

Thaddy

  • Hero Member
  • *****
  • Posts: 14201
  • Probably until I exterminate Putin.
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #9 on: December 19, 2017, 09:25:41 am »
Problem is that you refer to "mostly unicode" but unicode is a family of encodings and which one do you mean? (utf8 -variable length 1-4, utf16 - variable length 2-4 ,utf32 fixed length 4) and e.g. modern Delphi is supposed to use utf16, where half of its rtl is actually ucs2 -fixed length 2 predecessor of utf16.
The only consolation is that utf8 and utf16 are supposed to convert lossless back and forth.
« Last Edit: December 19, 2017, 09:37:16 am by Thaddy »
Specialize a type, not a var.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #10 on: December 19, 2017, 10:03:24 am »
Yes, I know all that. And I know there is no such thing as a Unicode char anymore, just a variable length Unicode code point, and some encoding, like UTF8 or UTF32.

First (up to Lazarus 1.4) it was Ascii + codepages only, with Widestring conversions and such, with 1.6 it was UTF8 only (and a checkmark somewhere to turn it off), now it is... what? It says string = AnsiString when you hover your mouse over it, but doesn't show the declaration. And no help (but I probably have to install that separately). What type is AnsiString? Rawbyte? Which is typeless (just a byte array)?

So, first I had to use the UTF8* functions, then the normal ones, then the string helpers start at 0, and now the normal ones are replaced by the Ansi* ones? Something like that?

Half my programs consist of manipulating strings, so I would really want to know what I'm doing.

Thaddy

  • Hero Member
  • *****
  • Posts: 14201
  • Probably until I exterminate Putin.
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #11 on: December 19, 2017, 10:12:33 am »
Well, actually Juha convinced me *somewhat* that if you develop using Lazarus all should be OK. That means everything is UTF8.
For console applications you need to pull in lazutf8 from the lazutils package. That *should* always work.
Specialize a type, not a var.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #12 on: December 19, 2017, 11:05:09 am »
Ok, thanks.

But should I use the Ansi* functions? AnsiUpperCase instead of UpperCase? AnsiPos? AnsiCopy?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: Should UpperCase return the same result as UTF8UpperCase?
« Reply #13 on: December 19, 2017, 01:44:01 pm »
But should I use the Ansi* functions? AnsiUpperCase instead of UpperCase?
Yes, just like you would use them in Delphi.
The message carried by this page :
 http://wiki.freepascal.org/Unicode_Support_in_Lazarus
is that things are amazingly compatible with Delphi when you use "String" type, despite their different encodings.

Quote
AnsiPos? AnsiCopy?
Where did you find those? Read:
 http://wiki.freepascal.org/UTF8_strings_and_characters
and
 http://wiki.freepascal.org/Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018