Let's continue looking at how to mess with the text entry components (TEdit, TMemo, TRichMemo if you have it, etc), specifically at the cursors of the components. We'll take TEdit as an example and look at the extra complexities of the memos later.
The TEdit component has a cursor position given by the property SelStart. (The length of the highlighted text, if any, is then recorded by SelLength). You would think that the SelStart property in the TEdits would reflect the number of characters between the start of the line and the cursor, and similarly the length of SelLength would record a number of characters. You would be wrong. Let's look at why.
Originally there was ASCII which encoded characters using the low seven bits of an eight-bit byte — the numbers from 0 to 127. This situation couldn't last.
FPC uses the UTF8 encoding. This represents further characters using one, two, or three bytes. You don't need to know the exact details.
Microsoft went a different way, and since FPC mimics Delphi, which was written for Windows, elements of Microsoft's preferred system still need to be taken into account when writing FPC. And they used the encoding UTF16. This represents every character as either one or two sixteen-bit "words".
So: the cursor position SelStart of a TEdit is given by the number of sixteen-bit words it would take to UTF16-encode the characters between the cursor and the start of the line (and SelLength is given in the same metric). Meanwhile when we ask for the actual Text field of the TEdit we're given a UTF8 string. And the original Pascal function Length, applied to a UTF8-encoded string, returns the number of bytes it takes up. (If it happens to also be ASCII, this will also be the number of characters, otherwise not.)
To help us deal with this situation we have the Pascal libraries LazUTF8 and LazUTF16. These contain the functions utf8length and utf16length. The naming of these functions is sheer madness and confused the heck out of me. Now it's your turn.
utf8length takes a string encoded in utf8 and says how many characters are represented by the encoding.
utf16length takes a string encoded in utf8 and tells you how many sixteen-bit words it would take to represent it if it was encoded in utf16 instead.
This means that if we want to put the cursor at a given position in our (UTF8-encoded) Text string, it's quite easy. Suppose we want to stick it after the 远 in 望远镜座. Then we can set the cursor position to be utf16length('望远').
Going the other way takes a little more work — or at least if there's an easier way than what follows I haven't found it. Maybe it's in a library somewhere.
The two-word UFT16 encodings use exclusively the hexadecimal ranges $D800 - $DBFF for the first of the two words and $DC00 - $DFFF for the second. All we have to do is go through the string one 16-bit word at a time looking for stuff like that, and count each such pair of words as one character. The following function does just that, converting a position in the string in the UTF16 metric to the number of characters to the left of that position.
function utf8cur(x:integer; s:string):integer;
var WS: WideString;
i,j:integer;
begin
utf8cur:=0;
WS:=utf8toutf16(s); // We convert the string to utf16
j:=0;
//As FPC knows a WideString has word elements, length(WS) is the number of words, not bytes.
for i:=1 to length(WS) do
begin
if not ((ord(WS[i])>= $D800) and (ord(WS[i])<$DC00)) then
j:=j+1; // So we count every word except the first word of each two-word pair.
if i=x then utf8cur:=j;
end;
end;
So if we have a TEdit called AEdit, for example, then the substring of characters to the left of the cursor is given by utf8copy(AEdit.Text, 1, utf8cur(AEdit.SelStart, AEdit.Text))
That was a lot of explanation for a few lines of code, but you do understand it now. This is almost everything you need to know to get your own code to work with cursors, except we also need to talk about memos ...