While testing his program, I found the charset is not converted correctly. So I delved into wp's program a little further.
I think that conversion from Ansi code to UTF-8 is done at character base, based on the following codes. And I'm afraid this causes some problem for multi-byte characters like Korean. CP949 expresses a Korean Character with two bytes but mostly in 3 bytes in UTF8. I think converting the whole chunk of text at once, not char by char, may solve the problem. Am I wrong?
procedure TRtf2HtmlConverter.DoCharSet;
begin
case FParser.rtfMinor of
rtfMacCharSet:
FCodePage := encodingCPMac;
rtfAnsiCharSet:
// FCodePage := encodingCP1252;
FCodePage := encodingANSI; // <== changed here
rtfPCCharSet:
FCodePage := encodingCP437;
rtfPcaCharSet:
FCodePage := encodingCP850;
otherwise
FCodePage := encodingUTF8; // Changed here too, but this does not have any effect.
// FCodePage := encodingCP949;
// there is also a \ansicpgN, but it is not supported by RTFPars
end;
end;
Our Windows system uses CP949, not CP1252. But it should not matter as long as I use encodingANSI, as the system will interpret it as CP949.
But when I put this, English characters are displayed correctly, but not Korean characters --- they disappear (Setting FCodepage := encodingCP949 doesn't work).
So, looking for possible reason, I found followings.
procedure TRtf2HtmlConverter.DoText;
var
c: char;
s: string;
begin
if (FCurrText = '') then
begin
// This is the very first character -- we must write the HTML header.
if FOutput.Count = 0 then
WriteHeader;
if FSpanOpen then
CloseSpan;
end;
c := chr(FParser.RTFMajor);
if c = #0 then // last character
begin
WriteFooter;
exit;
end;
if c > #127 then
begin
s := ConvertEncoding(c, FCodePage, encodingUTF8); // <== I think this matters.
FCurrText := FCurrText + s;
end
else
FCurrText := FCurrText + c;
if FParDelayed then
begin
WritePar(FParAttrib);
FParDelayed := False;
FParOpen := True;
end;
if FSpanDelayed then
begin
WriteSpan(FCurrCharAttrib);
FSpanDelayed := False;
FSpanOpen := True;
end;
end;
I'm not sure I can modify this unit myself (unless this is prohibited), but please comment on the reason I suggested at least.
Other approach that works is as following.
First, I enforce utf8 from the beginning.
procedure TRtf2HtmlConverter.DoCharSet;
begin
case FParser.rtfMinor of
rtfAnsiCharSet:
FCodePage := encodingUTF8; // <== set utf8 here
....
end; // case
end; // procedure
And the created HTML file does not appear correctly, but it contains characters. The content appears correctly in Notepad, but not on webbrowsers -- broken characters.
So, I open the HTML file in NotePad, save it again but with encoding of UTF8, not Ansi. Then webbrowsers show the contents correctly.