Lazarus

Programming => General => Topic started by: BubikolRamios on June 05, 2019, 02:07:30 am

Title: TStringStream encoding
Post by: BubikolRamios on June 05, 2019, 02:07:30 am
Code: Pascal  [Select][+][-]
  1.  
  2.   var
  3.     globalDoc: thtmldocument;    
  4.     AStrStr: TStringStream;
  5.   begin
  6.     AStrStr := TStringStream.Create('');
  7.  
  8.     // thtmldocument to string
  9.     htmwrite.WriteHTML(globalDoc.documentElement,AStrStr);
  10.     globalDocHTML := AStrStr.DataString;    
  11.  
  12.     //does not work
  13.     //checked by viewing saved html file in browser.
  14.    // I guess something is to change at previous 2 lines ?
  15.     convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  16.  
Title: Re: TStringStream encoding
Post by: engkin on June 05, 2019, 02:14:59 am
Your signature is correct?

Are you still using Lazarus-1.6.0/FPC-3.0.0?
Title: Re: TStringStream encoding
Post by: BubikolRamios on June 05, 2019, 07:14:32 am
Upgraded to latest version, updated profile,  same thing.

BTW: this string is good for converting to UTF
Code: Pascal  [Select][+][-]
  1. readhtmlfile(globalDoc,HTTPSender.Document);
  2. //convert to string
  3. SetString(globalDocHTML, PAnsiChar(HTTPSender.Document.Memory), HTTPSender.Document.Size);    
  4.  

But of no use to me, want to first modify thtmldocument (globalDoc) and then covert it to string.
Title: Re: TStringStream encoding
Post by: wp on June 05, 2019, 10:45:39 am
Code: Pascal  [Select][+][-]
  1.  
  2.   var
  3.     globalDoc: thtmldocument;    
  4.     AStrStr: TStringStream;
  5.   begin
  6.     AStrStr := TStringStream.Create('');
  7.  
  8.     // thtmldocument to string
  9.     htmwrite.WriteHTML(globalDoc.documentElement,AStrStr);
  10.     globalDocHTML := AStrStr.DataString;    
  11.  
  12.     //does not work
  13.     //checked by viewing saved html file in browser.
  14.    // I guess something is to change at previous 2 lines ?
  15.     convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  16.  
It is clear that calling "convertEncoding" this way does not work. It is a function and you do not assign the result to a variable. The input string, globalDocHTML here, is not a var parameter.
Title: Re: TStringStream encoding
Post by: BubikolRamios on June 05, 2019, 11:36:40 am
my bad, for some reason pasted only a peace of code; but anyway results is the same, it does not convert.

Code: Pascal  [Select][+][-]
  1.  htmlStr := convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  2.  
Title: Re: TStringStream encoding
Post by: lucamar on June 05, 2019, 11:48:22 am
[...] results is the same, it does not convert.
Code: Pascal  [Select][+][-]
  1.  aStr := convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  2.  

Next step then is to check what is the result of
Code: Pascal  [Select][+][-]
  1. guessEncoding(globalDocHTML)
Title: Re: TStringStream encoding
Post by: wp on June 05, 2019, 11:50:37 am
Post a small demo showing the issue.
Title: Re: TStringStream encoding
Post by: BubikolRamios on June 05, 2019, 04:24:48 pm
Will pick up specific german lang page (windows-1252) and save it's HTML to app folder ('test.html')
Opening that in browser:
OPTION 1: OK
OPTION 2: Not OK

Need OPTION 2 to work (conversion to UTF-8)


Code: Pascal  [Select][+][-]
  1. var
  2.   HTTPSender: THTTPSend;
  3.   HTTPGetResult: Boolean;
  4.   doc: thtmldocument;
  5.   AStrStr:TStringStream;
  6.   htmlStr: String;
  7.   F:TextFile;
  8. begin
  9.  
  10.  
  11.   AStrStr := TStringStream.Create('');
  12.   HTTPSender := THTTPSend.Create;
  13.   try
  14.     HTTPGetResult := HTTPSender.HTTPMethod('GET', 'http://blumeninschwaben.de/Einkeimblaettrige/Suessgraeser/suessgraeser.htm');
  15.  
  16.     if (HTTPSender.ResultCode >= 100) and (HTTPSender.ResultCode<=299) then begin
  17.  
  18.       readhtmlfile(doc,HTTPSender.Document);
  19.       //OPTION 1 getting htmlStr
  20.       //SetString(htmlStr, PAnsiChar(HTTPSender.Document.Memory), HTTPSender.Document.Size);
  21.  
  22.       //OPTION 2 getting htmlStr
  23.       htmwrite.WriteHTML(doc.documentElement,AStrStr);
  24.       htmlStr := AStrStr.DataString;
  25.  
  26.       //modify htmlStr
  27.       htmlStr := convertEncoding(htmlStr, guessEncoding(htmlStr), encodingUTF8);
  28.       htmlStr := StringReplace(htmlStr, 'charset=windows-1252', 'charset=UTF-8',[rfReplaceAll, rfIgnoreCase]);
  29.  
  30.       //save htmlStr to file
  31.       AssignFile(F, 'test.html');
  32.       try
  33.         ReWrite(F);
  34.         Write(F, htmlStr);
  35.       finally
  36.         CloseFile(F);
  37.       end;
  38.  
  39.  
  40.     end;
  41.   finally
  42.     HTTPSender.Free;
  43.   end;
  44.   showmessage('done');
  45.  
  46.  
  47. end;  
  48.  
  49.  
Title: Re: TStringStream encoding
Post by: engkin on June 05, 2019, 05:34:25 pm
You still did not post a small project.  IIRC, WriteHTML gives you WideString. Change your code based on that.
Title: Re: TStringStream encoding
Post by: wp on June 05, 2019, 08:00:19 pm
In such cases, try to determine the encoding immediately after getting the text. In your case, the text after immedate downloading is in HTTPSender.Document which is a TMemoryStream. Temporarily save this stream to file. Open this file in NotePad++, it shows in the statusbar that this is an ANSI file, and the HTML header shows that it is CP1252 encoded (but you already seem to know that).

This must ring an alarm: When you do any further processing the encoding will be lost and you will not be able to convert the special characters of this codepage any more correctly.

So, the first thing to do immediately after reading the file from the internet is to convert the encoding to UTF8. In your code of Option2, you do this at the end, and this is too late!

I don't know of a ready-made way to convert the encoding of data stored in a memory stream on the fly. So you must take intermediate steps. One possibility is to store the memory stream to a string variable and convert the encoding - this is your Option 1. Since you seem to require further processing by the html writer you can create the stringstream with this variable and pass this to the ReadHTMLFile procedure:

Code: Pascal  [Select][+][-]
  1. uses
  2.   httpsend, DOM_Html, SAX_HTML, htmwrite, lconvencoding;
  3.  
  4. { TForm1 }
  5.  
  6. procedure TForm1.Button1Click(Sender: TObject);
  7. var
  8.   HTTPSender: THTTPSend;
  9.   doc: thtmldocument;
  10.   AStrStr:TStringStream;
  11.   htmlStr: String;
  12. begin
  13.   HTTPSender := THTTPSend.Create;
  14.   try
  15.     HTTPSender.HTTPMethod('GET', 'http://blumeninschwaben.de/Einkeimblaettrige/Suessgraeser/suessgraeser.htm');
  16.     if (HTTPSender.ResultCode >= 100) and (HTTPSender.ResultCode <= 299) then begin
  17.       SetString(htmlStr, PAnsiChar(HTTPSender.Document.Memory), HTTPSender.Document.Size);
  18.       AStrStr := TStringStream.Create(ConvertEncoding(htmlStr, EncodingCP1252, EncodingUTF8));
  19.       try
  20.         ReadHtmlFile(doc, AStrStr);
  21.         WriteHTML(doc.DocumentElement, 'test.html');
  22.       finally
  23.         doc.Free;
  24.         AStrStr.Free;
  25.       end;
  26.     end;
  27.   finally
  28.     HTTPSender.Free;
  29.   end;
  30. end;  
Title: Re: TStringStream encoding
Post by: BubikolRamios on June 13, 2019, 06:22:11 am
Thanks. That woks.
Title: Re: TStringStream encoding
Post by: alaa123456789 on February 17, 2021, 09:49:09 pm
hi ,

i have tried this code i still not get any result and always show errors , could please anyone have a look on the attached project

thanks
Title: Re: TStringStream encoding
Post by: winni on February 17, 2021, 10:34:31 pm
Thanks. That woks.

"That woks" says the chinese  man cooking a fine meal in his wok.

SICNR

Winni
TinyPortal © 2005-2018