* * *

Author Topic: SelLength incorrect value for text containing characters > $FFFF  (Read 3126 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 4741
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #15 on: December 05, 2017, 09:28:34 pm »
Fix what? Fix to UTF16? Drop UCS? See my remark on the bug report. Note e.g. Win2000 is still UCS and widely used in industrial and banking applications. We can't just drop that.
"Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive."

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3257
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #16 on: December 05, 2017, 11:09:28 pm »
Fixing UTF-16 should be enough. If the OS does not support it then it will not work whatever LCL does.
LCL however should try to pass valid UTF-16 for Windows based on user input.
« Last Edit: December 05, 2017, 11:11:04 pm by JuhaManninen »

tomitomy

  • Full Member
  • ***
  • Posts: 200
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #17 on: December 06, 2017, 06:47:42 am »
I also have some reports that have not been dealt with.
If the reports are important to you change something in them, maybe add a simple comment "ping" or so to bring it up to the top of the report list. If a report has not been assigned to a developer yet it could easily be unnoticed if the developer who normally would take care of it does not see it within the first days. Because later it will be buried by many other reports.

And most important: Reports with poor description of the issue, referring to an old version, missing a demo project. missing exact steps how to reproduce, etc, have a high change of being forgotten.

Maybe it's because my description is too bad, I'm not good at using English. I don't know much about Lazarus, I don't even know what categories to choose when I submit a report. I don't know how to modify the report, I didn't see the "Edit" button. I don't know how to bring my report up to the top of the report list. All I can do is add note. :-[

molly

  • Hero Member
  • *****
  • Posts: 2019
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #18 on: December 06, 2017, 06:56:17 am »
I don't know how to bring my report up to the top of the report list. All I can do is add note. :-[
And adding a note is what wp was reffering to. Just add the word "ping" to that note (or preferably use the words: "any news ? is there anything i am able to do to speed up progress ?") and the report will be listed at the top of the list again  :)

tomitomy

  • Full Member
  • ***
  • Posts: 200
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #19 on: December 06, 2017, 07:06:37 am »
I don't know how to bring my report up to the top of the report list. All I can do is add note. :-[
And adding a note is what wp was reffering to. Just add the word "ping" to that note (or preferably use the words: "any news ? is there anything i am able to do to speed up progress ?") and the report will be listed at the top of the list again  :)

Thank you, molly, I'll try it.  :)

wp

  • Hero Member
  • *****
  • Posts: 3949
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #20 on: December 06, 2017, 10:23:45 am »
I'm not good at using English.
I can assure you, it certainly is not errors in English writing which makes up a poor bug report. Usually people do not provide enough information, they do not describe what happens or how a bug is triggered. Always attach a little demo which shows the bug - this is useful also later if another bug is fixed having side-effects on the current one. But please don't attach your current project in which you see the bug - nobody will work through a foreign program. Just pust yourself into the position of the developer: What you I need to know to understand the issue?

As for the other points that you mention: Yes, I never understood why a normal user does not have the permission to edit his own report. So, you only can add a new comment or add/remove yourself to the list of users monitoring this issue etc to bring a report back up  to the top again. Categorization is less important, developers usually fix it if not correct - one point though: At least the selection between FPC or Lazarus should be correct because these are the two main projects and a developer usually does not work in both, but even this is usually fixed internally.
Lazarus trunk / fpc 3.0.4 / all 32-bit on Win-10

fedkad

  • New member
  • *
  • Posts: 38
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #21 on: December 06, 2017, 10:32:02 am »
Thank you, JuhaManninen!

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3257
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #22 on: December 06, 2017, 12:10:55 pm »
I don't know how to bring my report up to the top of the report list. All I can do is add note. :-[
You can also upload a patch that fixes the issue. :)
It is the best way to draw attention. Remember, this is a FOSS project done by volunteers. There is no guarantee that a particular issue gets fixed even if you report it well.

tomitomy

  • Full Member
  • ***
  • Posts: 200
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #23 on: December 06, 2017, 12:43:32 pm »
I'm not good at using English.
I can assure you, it certainly is not errors in English writing which makes up a poor bug report. Usually people do not provide enough information, they do not describe what happens or how a bug is triggered. Always attach a little demo which shows the bug - this is useful also later if another bug is fixed having side-effects on the current one. But please don't attach your current project in which you see the bug - nobody will work through a foreign program. Just pust yourself into the position of the developer: What you I need to know to understand the issue?

As for the other points that you mention: Yes, I never understood why a normal user does not have the permission to edit his own report. So, you only can add a new comment or add/remove yourself to the list of users monitoring this issue etc to bring a report back up  to the top again. Categorization is less important, developers usually fix it if not correct - one point though: At least the selection between FPC or Lazarus should be correct because these are the two main projects and a developer usually does not work in both, but even this is usually fixed internally.

Thank you, wp, I have added test file to my reports.

I don't know how to bring my report up to the top of the report list. All I can do is add note. :-[
You can also upload a patch that fixes the issue. :)
It is the best way to draw attention. Remember, this is a FOSS project done by volunteers. There is no guarantee that a particular issue gets fixed even if you report it well.

Thank you, JuhaManninen, If I find a way to fix the BUG, I will upload it.  :)

ASerge

  • Sr. Member
  • ****
  • Posts: 470
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #24 on: December 06, 2017, 05:59:09 pm »
You confuse UTF16 with UCS, the latter indeed being limited to exactly two bytes.
What does "confuse UTF16 with UCS"? I said that the UnicodeChar size is 2 bytes. If in doubt, run Writeln(SizeOf(UnicodeChar)). These double-byte characters are used in the Windows API. But logical symbols (CodePoint), of course, can consist of more than one character.
You yourself have come up with an error that is not there, exposed it, but why do you blame me for this?

fedkad

  • New member
  • *
  • Posts: 38
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #25 on: December 06, 2017, 06:39:43 pm »
I am sorry ASerge and Thaddy. But, I cannot understand the thing you are arguing about.

As far as I know, in Lazarus or Free Pascal we are using UTF-8 characters ( https://en.wikipedia.org/wiki/UTF-8 ) which means that whatever the number of bytes (i.e., 1, 2, 3, or 4) used to represent the Unicode "code point" for that characters, the utf8length function should return the number of "code points" (charactes). The utf8length function works correctly in this respect.

However, the problem we are discussing here is that SelLength, SelText, and SelStart work incorrectly when there are characters with Unicode value > $FFFF in the string containing or preceding the selection. And a single character such as 𝛁 (code: $1D6C1) cannot be selected properly by using the keyboard.

ASerge

  • Sr. Member
  • ****
  • Posts: 470
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #26 on: December 06, 2017, 06:56:30 pm »
However, the problem we are discussing here is that SelLength, SelText, and SelStart work incorrectly when there are characters with Unicode value > $FFFF in the string containing or preceding the selection.
Because they rely on the Windows API, which does not count in the "code point", but in double-byte characters. The SelectAll does not access the Windows API, but does the operation itself, in the "code point". Hence the difference.

fedkad

  • New member
  • *
  • Posts: 38
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #27 on: December 07, 2017, 09:25:10 am »
After the new release I tested the code; and saw that the problem still persists on Lazarus 1.8.0.

I will test the code in Windows 10 + Lazarus 1.8.0 (64-bit) also as soon as possible.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3257
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #28 on: December 07, 2017, 10:39:58 am »
The problem now is that people use different words for the same thing, and the same word (character) for different things.
ASerge used "character" for Pascal Char type.
fedkad used it for .. I am not sure what.
In Unicode a "character" can mean about 7 different things. The word must be used very carefully and its meaning should be clarified then.

As far as I know, in Lazarus or Free Pascal we are using UTF-8 characters ( https://en.wikipedia.org/wiki/UTF-8 )
Lazarus uses UTF-8 encoding in AnsiString. Don't use "characters" here, it is wrong.
FPC aims for the Delphi compatible UTF-16 strings with mode DelphiUnicode.
The default encoding of AnsiString is not Unicode at all.
Quote
... which means that whatever the number of bytes (i.e., 1, 2, 3, or 4) used to represent the Unicode "code point" for that characters, the utf8length function should return the number of "code points" (charactes).
"Character" should not be used in the meaning of "codepoint". Unicode is confusing enough without such extra confusion. How do you call combining codepoints then?
What means: the Unicode "code point" for that characters?

I think the best meanings for "character" are:
1. Pascal Char, for historical reasons. In Unicode terms it represents a codeunit and is useful in many situations.
2. User perceived character. This involves combining codepoints, glyphs, ligatures and whatever.

Because they rely on the Windows API, which does not count in the "code point", but in double-byte characters. The SelectAll does not access the Windows API, but does the operation itself, in the "code point". Hence the difference.
Ok, "double-byte character" means here codeunit or Pascal WideChar.
The problem came with codepoints outside Unicode BMP which means surrogate pairs in UTF-16.
The LCL-Win32 binding code should take care of it and pass the values to WinAPI. It does not, somebody should fix it.

Munair

  • Sr. Member
  • ****
  • Posts: 299
  • Keep it simple.
    • Ditrianum
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #29 on: December 08, 2017, 10:01:12 am »
The default encoding of AnsiString is not Unicode at all.
Indeed. Unicode has been a problem with Windows from the beginning. UTF-8 is still not supported by the Win API. Why Microsoft chose UTF-16 over UTF-8 (apart from the fact that it comes closest to UCS2) is still beyond me. And don't come with the crap that UTF-8 wasn't around at the time. Microsoft simply made a choice, and an expensive one if you ask me.

Regarding UTF-16, we can read: "UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode. The encoding is variable-length, as code points are encoded with one or two 16-bit code units." That's two or four bytes.

UTF-8 has two main advantages: it is ASCII compatible and it eliminates the endianess problem.
« Last Edit: December 08, 2017, 10:42:03 am by Munair »
Lazarus 1.6.2, testing 1.8, FPC 3.0.0; Debian 9 KDE 5.8.6 x64; Windows 7 x64, PC-DOS2000

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus