Recent

Author Topic: TStringStream encoding  (Read 4237 times)

BubikolRamios

  • Sr. Member
  • ****
  • Posts: 258
TStringStream encoding
« on: June 05, 2019, 02:07:30 am »
Code: Pascal  [Select][+][-]
  1.  
  2.   var
  3.     globalDoc: thtmldocument;    
  4.     AStrStr: TStringStream;
  5.   begin
  6.     AStrStr := TStringStream.Create('');
  7.  
  8.     // thtmldocument to string
  9.     htmwrite.WriteHTML(globalDoc.documentElement,AStrStr);
  10.     globalDocHTML := AStrStr.DataString;    
  11.  
  12.     //does not work
  13.     //checked by viewing saved html file in browser.
  14.    // I guess something is to change at previous 2 lines ?
  15.     convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  16.  
lazarus 3.2-fpc-3.2.2-win32/win64

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: TStringStream encoding
« Reply #1 on: June 05, 2019, 02:14:59 am »
Your signature is correct?

Are you still using Lazarus-1.6.0/FPC-3.0.0?

BubikolRamios

  • Sr. Member
  • ****
  • Posts: 258
Re: TStringStream encoding
« Reply #2 on: June 05, 2019, 07:14:32 am »
Upgraded to latest version, updated profile,  same thing.

BTW: this string is good for converting to UTF
Code: Pascal  [Select][+][-]
  1. readhtmlfile(globalDoc,HTTPSender.Document);
  2. //convert to string
  3. SetString(globalDocHTML, PAnsiChar(HTTPSender.Document.Memory), HTTPSender.Document.Size);    
  4.  

But of no use to me, want to first modify thtmldocument (globalDoc) and then covert it to string.
« Last Edit: June 05, 2019, 07:35:50 am by BubikolRamios »
lazarus 3.2-fpc-3.2.2-win32/win64

wp

  • Hero Member
  • *****
  • Posts: 11855
Re: TStringStream encoding
« Reply #3 on: June 05, 2019, 10:45:39 am »
Code: Pascal  [Select][+][-]
  1.  
  2.   var
  3.     globalDoc: thtmldocument;    
  4.     AStrStr: TStringStream;
  5.   begin
  6.     AStrStr := TStringStream.Create('');
  7.  
  8.     // thtmldocument to string
  9.     htmwrite.WriteHTML(globalDoc.documentElement,AStrStr);
  10.     globalDocHTML := AStrStr.DataString;    
  11.  
  12.     //does not work
  13.     //checked by viewing saved html file in browser.
  14.    // I guess something is to change at previous 2 lines ?
  15.     convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  16.  
It is clear that calling "convertEncoding" this way does not work. It is a function and you do not assign the result to a variable. The input string, globalDocHTML here, is not a var parameter.

BubikolRamios

  • Sr. Member
  • ****
  • Posts: 258
Re: TStringStream encoding
« Reply #4 on: June 05, 2019, 11:36:40 am »
my bad, for some reason pasted only a peace of code; but anyway results is the same, it does not convert.

Code: Pascal  [Select][+][-]
  1.  htmlStr := convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  2.  
« Last Edit: June 05, 2019, 11:48:06 am by BubikolRamios »
lazarus 3.2-fpc-3.2.2-win32/win64

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: TStringStream encoding
« Reply #5 on: June 05, 2019, 11:48:22 am »
[...] results is the same, it does not convert.
Code: Pascal  [Select][+][-]
  1.  aStr := convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  2.  

Next step then is to check what is the result of
Code: Pascal  [Select][+][-]
  1. guessEncoding(globalDocHTML)
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

wp

  • Hero Member
  • *****
  • Posts: 11855
Re: TStringStream encoding
« Reply #6 on: June 05, 2019, 11:50:37 am »
Post a small demo showing the issue.

BubikolRamios

  • Sr. Member
  • ****
  • Posts: 258
Re: TStringStream encoding
« Reply #7 on: June 05, 2019, 04:24:48 pm »
Will pick up specific german lang page (windows-1252) and save it's HTML to app folder ('test.html')
Opening that in browser:
OPTION 1: OK
OPTION 2: Not OK

Need OPTION 2 to work (conversion to UTF-8)


Code: Pascal  [Select][+][-]
  1. var
  2.   HTTPSender: THTTPSend;
  3.   HTTPGetResult: Boolean;
  4.   doc: thtmldocument;
  5.   AStrStr:TStringStream;
  6.   htmlStr: String;
  7.   F:TextFile;
  8. begin
  9.  
  10.  
  11.   AStrStr := TStringStream.Create('');
  12.   HTTPSender := THTTPSend.Create;
  13.   try
  14.     HTTPGetResult := HTTPSender.HTTPMethod('GET', 'http://blumeninschwaben.de/Einkeimblaettrige/Suessgraeser/suessgraeser.htm');
  15.  
  16.     if (HTTPSender.ResultCode >= 100) and (HTTPSender.ResultCode<=299) then begin
  17.  
  18.       readhtmlfile(doc,HTTPSender.Document);
  19.       //OPTION 1 getting htmlStr
  20.       //SetString(htmlStr, PAnsiChar(HTTPSender.Document.Memory), HTTPSender.Document.Size);
  21.  
  22.       //OPTION 2 getting htmlStr
  23.       htmwrite.WriteHTML(doc.documentElement,AStrStr);
  24.       htmlStr := AStrStr.DataString;
  25.  
  26.       //modify htmlStr
  27.       htmlStr := convertEncoding(htmlStr, guessEncoding(htmlStr), encodingUTF8);
  28.       htmlStr := StringReplace(htmlStr, 'charset=windows-1252', 'charset=UTF-8',[rfReplaceAll, rfIgnoreCase]);
  29.  
  30.       //save htmlStr to file
  31.       AssignFile(F, 'test.html');
  32.       try
  33.         ReWrite(F);
  34.         Write(F, htmlStr);
  35.       finally
  36.         CloseFile(F);
  37.       end;
  38.  
  39.  
  40.     end;
  41.   finally
  42.     HTTPSender.Free;
  43.   end;
  44.   showmessage('done');
  45.  
  46.  
  47. end;  
  48.  
  49.  
lazarus 3.2-fpc-3.2.2-win32/win64

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: TStringStream encoding
« Reply #8 on: June 05, 2019, 05:34:25 pm »
You still did not post a small project.  IIRC, WriteHTML gives you WideString. Change your code based on that.

wp

  • Hero Member
  • *****
  • Posts: 11855
Re: TStringStream encoding
« Reply #9 on: June 05, 2019, 08:00:19 pm »
In such cases, try to determine the encoding immediately after getting the text. In your case, the text after immedate downloading is in HTTPSender.Document which is a TMemoryStream. Temporarily save this stream to file. Open this file in NotePad++, it shows in the statusbar that this is an ANSI file, and the HTML header shows that it is CP1252 encoded (but you already seem to know that).

This must ring an alarm: When you do any further processing the encoding will be lost and you will not be able to convert the special characters of this codepage any more correctly.

So, the first thing to do immediately after reading the file from the internet is to convert the encoding to UTF8. In your code of Option2, you do this at the end, and this is too late!

I don't know of a ready-made way to convert the encoding of data stored in a memory stream on the fly. So you must take intermediate steps. One possibility is to store the memory stream to a string variable and convert the encoding - this is your Option 1. Since you seem to require further processing by the html writer you can create the stringstream with this variable and pass this to the ReadHTMLFile procedure:

Code: Pascal  [Select][+][-]
  1. uses
  2.   httpsend, DOM_Html, SAX_HTML, htmwrite, lconvencoding;
  3.  
  4. { TForm1 }
  5.  
  6. procedure TForm1.Button1Click(Sender: TObject);
  7. var
  8.   HTTPSender: THTTPSend;
  9.   doc: thtmldocument;
  10.   AStrStr:TStringStream;
  11.   htmlStr: String;
  12. begin
  13.   HTTPSender := THTTPSend.Create;
  14.   try
  15.     HTTPSender.HTTPMethod('GET', 'http://blumeninschwaben.de/Einkeimblaettrige/Suessgraeser/suessgraeser.htm');
  16.     if (HTTPSender.ResultCode >= 100) and (HTTPSender.ResultCode <= 299) then begin
  17.       SetString(htmlStr, PAnsiChar(HTTPSender.Document.Memory), HTTPSender.Document.Size);
  18.       AStrStr := TStringStream.Create(ConvertEncoding(htmlStr, EncodingCP1252, EncodingUTF8));
  19.       try
  20.         ReadHtmlFile(doc, AStrStr);
  21.         WriteHTML(doc.DocumentElement, 'test.html');
  22.       finally
  23.         doc.Free;
  24.         AStrStr.Free;
  25.       end;
  26.     end;
  27.   finally
  28.     HTTPSender.Free;
  29.   end;
  30. end;  

BubikolRamios

  • Sr. Member
  • ****
  • Posts: 258
Re: TStringStream encoding
« Reply #10 on: June 13, 2019, 06:22:11 am »
Thanks. That woks.
lazarus 3.2-fpc-3.2.2-win32/win64

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: TStringStream encoding
« Reply #11 on: February 17, 2021, 09:49:09 pm »
hi ,

i have tried this code i still not get any result and always show errors , could please anyone have a look on the attached project

thanks

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: TStringStream encoding
« Reply #12 on: February 17, 2021, 10:34:31 pm »
Thanks. That woks.

"That woks" says the chinese  man cooking a fine meal in his wok.

SICNR

Winni

 

TinyPortal © 2005-2018