Hereafter, a few simplified generalities concerning Unicode and Free Pascal/Lazarus (sorry if you already know them).
First of all, I don't know what you mean exactly by "Unicode", but Unicode should be considered as a standard with several specifications (see
http://en.wikipedia.org/wiki/Unicode). Basically, it means that Unicode defines several character encoding types, and that all of them are belonging to the 'Unicode' standard.
So, when you say "Unicode", it may refer to different types of character encoding.
The 2 main Unicode character encoding sets concerning Free Pascal are UTF-8 and UTF-16:
- UTF-8: each "character" is coded using 1 to 4 bytes. For compatibility purpose with ASCII, all "character codes" up to 127 are encoded using 1 byte and compatible with the ASCII table: it means that for all these "characters", ASCII = UTF-8.
- UTF-16: each "character" is coded using 1, or eventually 2 couple of bytes (i.e. 2 or 4 bytes).
Note: "character" is an improper term when dealing with Unicode specifications; "code point" should be used instead. I've used "character" for a better comparison purpose, as you are familiar with ASCII/ANSI.
Back to Free Pascal/Lazarus:
. Lazarus and the LCL are using UTF-8 as a standard "everywhere" by default: strings, source code, form source, text control properties, .... In the LCL, by default the 'String' type is identical to the 'UTF8String' type.
. Free Pascal: up to the (current) 2.6.4 version, 'String' means 'AnsiString'. Practically, it means that the RTL functions are using ANSI strings by default; and that you eventually may have to convert your strings when calling them inside Lazarus (i.e. UTF-8->ANSI, ANSI->UTF-8). Free Pascal versions after 2.6.4 (i.e. 2.7.x/2.8.x/3.0) offer a better support of Unicode.
. Windows: the Windows API uses ANSI (system code page, in fact) or WideString types. The current Free Pascal version uses the ANSI API version, while the LCL is able to use both the ANSI and the WideString versions. The LCL is doing internally the conversion UTF-8<-->ANSI or UTF-8<-->WideString when dealing with the Windows API. For simplification, the WideString type may be "assimilated" to the UTF-16 type, though technically it's not exactly true, due to surrogate processing differences (as far as I've understood the trick).
For more pieces of information:
-
http://wiki.freepascal.org/Character_and_string_types-
http://wiki.freepascal.org/UTF8_strings_and_characters-
http://wiki.freepascal.org/FPC_Unicode_support-
http://wiki.freepascal.org/LCL_Unicode_SupportYour case:
Sorry, I can't answer directly to your questions (except for the last one = 4: it would be a very bad idea).
As far as I understand, your problem is coming from the fact that you use "non UTF-8 aware" functions like "Length" with UTF-8 strings. If you have a look at the 2nd of my links (
http://wiki.freepascal.org/UTF8_strings_and_characters), you'll see that you may or may not have to use specific UTF-8 functions (or conversions), depending of what you are doing.
The most incorrect assumption with UTF-8 strings is considering that characters are always 1 byte long in UTF-8 strings. For instance "Pos(searchcharacter, wholestring) + 1" won't give you -necessarily- the position of the next character after searchcharacter in wholestring.
And don't mix "UTF-8 aware" functions/properties (like the "SelStart" property) with "non UTF-8 aware" functions (like "Pos" or "Length") without any precautions.
I guess that using wisely UTF8 versions of the Free Pascal functions you are using, might be a solution for you: like UTF8Pos, UTF8Length, ... Only when it's really of course, you don't have to modify all your code everywhere.
For instance, with a string containing the non ASCII characters "é" and "à" (project having a form with a pushbutton and a memo controls):
uses
..., LazUTF8;
procedure TForm1.Button1Click(Sender: TObject);
var Stru: String; // = UTF8String
var Stra: AnsiString;
var Strw: WideString;
//
var OutRes: String;
begin
Stru := '1234567890éabcdeàfghij';
Stra := UTF8ToAnsi(Stru);
Strw := UTF8Decode(Stru);
//
OutRes := '';
OutRes := OutRes + 'Length Stru=' + IntToStr(Length(Stru)) + ' UTF8Length Stru=' + IntToStr(UTF8Length(Stru)) + sLineBreak;
OutRes := OutRes + 'Length Stra=' + IntToStr(Length(Stra)) + sLineBreak;
OutRes := OutRes + 'Length Strw=' + IntToStr(Length(Strw)) + sLineBreak + sLineBreak;
//
OutRes := OutRes + 'Pos é (non UTF8)=' + IntToStr(Pos('é', Stru)) + sLineBreak;
OutRes := OutRes + 'Pos é (UTF8)=' + IntToStr(UTF8Pos('é', Stru)) + sLineBreak + sLineBreak;
OutRes := OutRes + 'Pos à (non UTF8)=' + IntToStr(Pos('à', Stru)) + sLineBreak;
OutRes := OutRes + 'Pos à (UTF8)=' + IntToStr(UTF8Pos('à', Stru)) + sLineBreak + sLineBreak;
//
OutRes := OutRes + 'Extract (non UTF8)=' + Copy(Stru, Pos('é', Stru) + 1, 4) + sLineBreak;
OutRes := OutRes + 'Extract (non UTF8 OK)=' + Copy(Stru, Pos('é', Stru) + Length('é'), 4) + sLineBreak;
OutRes := OutRes + 'Extract (UTF8)=' + UTF8Copy(Stru, UTF8Pos('é', Stru) + 1, 4) + sLineBreak + sLineBreak;
//
Memo1.Text := OutRes;
end;
The result is:
Length Stru=24 UTF8Length Stru=22
Length Stra=22
Length Strw=22
Pos é (non UTF8)=11
Pos é (UTF8)=11
Pos à (non UTF8)=18
Pos à (UTF8)=17
Extract (non UTF8)=?abc
Extract (non UTF8 OK)=abcd
Extract (UTF8)=abcd
Note the incorrect result for 'Extract (non UTF8)', and the differences of the Length/UTF8Length and Pos/UTF8Pos results.