2. Lazarus saves the information about chosen charsets (per unit) in project1.lpi and/or project1.lps
FPC picks up those files imlicitly (or is explicitly commanded to compile one of hose files instead of project1.lpr) and learns encoding.
3. Both Lazarus and FPC extract charset-related heuristics into a separate package, and both use it.
But that is fragile.
I saved LCL Unit in 1251 charset, compiled - and FPC screwed things yet worse. Notice how it still "converts" by just adding zero byte.
Basically, FPC conversion de facto becomes a loop of "Read byte, Write word".
Delphi EXE shows AnsiString to have letters in $C0 - $DF byte range, and UTF-16 string in $0410 - $042F word range
FPC instead, when fed with ANSI (1251) encoded sources, saves MBCS stream the same, but saves UTF-16 stream in a broken $00C0 - $00DF word range.
It seems to me FPC makes assumption, that "non-marked" source file is Latin1 or Window 1250 or something. Would FPC query the platform's charset - it would perhaps.
I speculate (though it is not hard to check) than loading sources (without BOM) Lazarus first tries to de-code from UTF-8, and if it fails then it assumes platform's current non-Unicode charset. Just like it do TStringList.LoadFromFile and TIniFile
This heuristic seems reasonable and perhaps FPC could adopt it, i think it is based in FCL which is shared by both FPC and Laz.
But a better choice i believe would be for Laz to explicitly store charset info, and for FPC to use it.
Unless that would lead to problems with git or svn.
------
Is it reliable, can LAzarus be tricked into mis-detecting charset? Probably it can. But in some very fringe cases, that no one cares of.
Windows Extended ASCII letters start with $C0 and go until $FF
In UTF-8 those letters should be mapped to something like $d0 $a0 or $d1 $40 - but $Cx can not be "continuation byte", it is decoding error.
So, while telling different MBCS codepages from one another is really problematic, telling UTF-8 from non-UTF is not hard.
but then what is platform-default non-unicode charset? I remember from pre-Unicode Linux, it was not all that easy. There even were special libraries back then to facilitate it :-)
So, if FPC and Lazarus do not want to come together and make explicit specification of charsets in project files, i can suggest the folloin heuristics,for every souce file:
1.1. check BOM, if it is there - assume "authoritative" charset
1.2. try to transcode from UTF-8 or UTF-7 to UTF-16 or anything. If it happenned without errors - assume UTF-8 (or UTF-7) "authoritative" charset
1.3 try to ask platform HAL is there is non-Unicode default charset. If it is, assume "unreliable" charset
The above are executed as short-eval, "first true exits"
2.0. Parse .LPI and .LPS files for charset information. Currently there is nothing, so it is "placeholder for a future hook"
2.1. Look for {$codepage } or {$modeswitch systemcodepage} or other related pragmas. If found, assume one more "authoritative" charset
2.2. Check FPC command line for char-set related options. It is bad place, because it can not specify different per-file charsets, only one global switch. If found - assume "unreliable" charset.
The above are evaluated all.
3.1. if all the detected charsets match - great, happy compiling
3.2. if all "authoritative" charsets (1 or more) match, but "speculative" do not - emit low-priority warning and compile the unit for "authoritative" one
3.3. if no "authoritative" found but "speculative" ones agree - emit Information or low-priority Warning and compile
3.4. if no charset found at all - emit error and stop, but make this error overridable in fpc.cfg. Or the opposite, make it high-profile warning that can be uplifted to error in cfg
3.5. if no "authoritative" found and "speculative" ones disagree - emit error and stop