Recent

Author Topic: TStringStream encoding  (Read 974 times)

BubikolRamios

  • Full Member
  • ***
  • Posts: 190
TStringStream encoding
« on: June 05, 2019, 02:07:30 am »
Code: Pascal  [Select]
  1.  
  2.   var
  3.     globalDoc: thtmldocument;    
  4.     AStrStr: TStringStream;
  5.   begin
  6.     AStrStr := TStringStream.Create('');
  7.  
  8.     // thtmldocument to string
  9.     htmwrite.WriteHTML(globalDoc.documentElement,AStrStr);
  10.     globalDocHTML := AStrStr.DataString;    
  11.  
  12.     //does not work
  13.     //checked by viewing saved html file in browser.
  14.    // I guess something is to change at previous 2 lines ?
  15.     convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  16.  
lazarus-2.0.2-fpc-3.0.4-win32

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: TStringStream encoding
« Reply #1 on: June 05, 2019, 02:14:59 am »
Your signature is correct?

Are you still using Lazarus-1.6.0/FPC-3.0.0?

BubikolRamios

  • Full Member
  • ***
  • Posts: 190
Re: TStringStream encoding
« Reply #2 on: June 05, 2019, 07:14:32 am »
Upgraded to latest version, updated profile,  same thing.

BTW: this string is good for converting to UTF
Code: Pascal  [Select]
  1. readhtmlfile(globalDoc,HTTPSender.Document);
  2. //convert to string
  3. SetString(globalDocHTML, PAnsiChar(HTTPSender.Document.Memory), HTTPSender.Document.Size);    
  4.  

But of no use to me, want to first modify thtmldocument (globalDoc) and then covert it to string.
« Last Edit: June 05, 2019, 07:35:50 am by BubikolRamios »
lazarus-2.0.2-fpc-3.0.4-win32

wp

  • Hero Member
  • *****
  • Posts: 6499
Re: TStringStream encoding
« Reply #3 on: June 05, 2019, 10:45:39 am »
Code: Pascal  [Select]
  1.  
  2.   var
  3.     globalDoc: thtmldocument;    
  4.     AStrStr: TStringStream;
  5.   begin
  6.     AStrStr := TStringStream.Create('');
  7.  
  8.     // thtmldocument to string
  9.     htmwrite.WriteHTML(globalDoc.documentElement,AStrStr);
  10.     globalDocHTML := AStrStr.DataString;    
  11.  
  12.     //does not work
  13.     //checked by viewing saved html file in browser.
  14.    // I guess something is to change at previous 2 lines ?
  15.     convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  16.  
It is clear that calling "convertEncoding" this way does not work. It is a function and you do not assign the result to a variable. The input string, globalDocHTML here, is not a var parameter.
Lazarus trunk / fpc 3.0.4 / all 32-bit on Win-10

BubikolRamios

  • Full Member
  • ***
  • Posts: 190
Re: TStringStream encoding
« Reply #4 on: June 05, 2019, 11:36:40 am »
my bad, for some reason pasted only a peace of code; but anyway results is the same, it does not convert.

Code: Pascal  [Select]
  1.  htmlStr := convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  2.  
« Last Edit: June 05, 2019, 11:48:06 am by BubikolRamios »
lazarus-2.0.2-fpc-3.0.4-win32

lucamar

  • Hero Member
  • *****
  • Posts: 2148
Re: TStringStream encoding
« Reply #5 on: June 05, 2019, 11:48:22 am »
[...] results is the same, it does not convert.
Code: Pascal  [Select]
  1.  aStr := convertEncoding(globalDocHTML, guessEncoding(globalDocHTML), encodingUTF8);
  2.  

Next step then is to check what is the result of
Code: Pascal  [Select]
  1. guessEncoding(globalDocHTML)
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.4/2.0.6  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

wp

  • Hero Member
  • *****
  • Posts: 6499
Re: TStringStream encoding
« Reply #6 on: June 05, 2019, 11:50:37 am »
Post a small demo showing the issue.
Lazarus trunk / fpc 3.0.4 / all 32-bit on Win-10

BubikolRamios

  • Full Member
  • ***
  • Posts: 190
Re: TStringStream encoding
« Reply #7 on: June 05, 2019, 04:24:48 pm »
Will pick up specific german lang page (windows-1252) and save it's HTML to app folder ('test.html')
Opening that in browser:
OPTION 1: OK
OPTION 2: Not OK

Need OPTION 2 to work (conversion to UTF-8)


Code: Pascal  [Select]
  1. var
  2.   HTTPSender: THTTPSend;
  3.   HTTPGetResult: Boolean;
  4.   doc: thtmldocument;
  5.   AStrStr:TStringStream;
  6.   htmlStr: String;
  7.   F:TextFile;
  8. begin
  9.  
  10.  
  11.   AStrStr := TStringStream.Create('');
  12.   HTTPSender := THTTPSend.Create;
  13.   try
  14.     HTTPGetResult := HTTPSender.HTTPMethod('GET', 'http://blumeninschwaben.de/Einkeimblaettrige/Suessgraeser/suessgraeser.htm');
  15.  
  16.     if (HTTPSender.ResultCode >= 100) and (HTTPSender.ResultCode<=299) then begin
  17.  
  18.       readhtmlfile(doc,HTTPSender.Document);
  19.       //OPTION 1 getting htmlStr
  20.       //SetString(htmlStr, PAnsiChar(HTTPSender.Document.Memory), HTTPSender.Document.Size);
  21.  
  22.       //OPTION 2 getting htmlStr
  23.       htmwrite.WriteHTML(doc.documentElement,AStrStr);
  24.       htmlStr := AStrStr.DataString;
  25.  
  26.       //modify htmlStr
  27.       htmlStr := convertEncoding(htmlStr, guessEncoding(htmlStr), encodingUTF8);
  28.       htmlStr := StringReplace(htmlStr, 'charset=windows-1252', 'charset=UTF-8',[rfReplaceAll, rfIgnoreCase]);
  29.  
  30.       //save htmlStr to file
  31.       AssignFile(F, 'test.html');
  32.       try
  33.         ReWrite(F);
  34.         Write(F, htmlStr);
  35.       finally
  36.         CloseFile(F);
  37.       end;
  38.  
  39.  
  40.     end;
  41.   finally
  42.     HTTPSender.Free;
  43.   end;
  44.   showmessage('done');
  45.  
  46.  
  47. end;  
  48.  
  49.  
lazarus-2.0.2-fpc-3.0.4-win32

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: TStringStream encoding
« Reply #8 on: June 05, 2019, 05:34:25 pm »
You still did not post a small project.  IIRC, WriteHTML gives you WideString. Change your code based on that.

wp

  • Hero Member
  • *****
  • Posts: 6499
Re: TStringStream encoding
« Reply #9 on: June 05, 2019, 08:00:19 pm »
In such cases, try to determine the encoding immediately after getting the text. In your case, the text after immedate downloading is in HTTPSender.Document which is a TMemoryStream. Temporarily save this stream to file. Open this file in NotePad++, it shows in the statusbar that this is an ANSI file, and the HTML header shows that it is CP1252 encoded (but you already seem to know that).

This must ring an alarm: When you do any further processing the encoding will be lost and you will not be able to convert the special characters of this codepage any more correctly.

So, the first thing to do immediately after reading the file from the internet is to convert the encoding to UTF8. In your code of Option2, you do this at the end, and this is too late!

I don't know of a ready-made way to convert the encoding of data stored in a memory stream on the fly. So you must take intermediate steps. One possibility is to store the memory stream to a string variable and convert the encoding - this is your Option 1. Since you seem to require further processing by the html writer you can create the stringstream with this variable and pass this to the ReadHTMLFile procedure:

Code: Pascal  [Select]
  1. uses
  2.   httpsend, DOM_Html, SAX_HTML, htmwrite, lconvencoding;
  3.  
  4. { TForm1 }
  5.  
  6. procedure TForm1.Button1Click(Sender: TObject);
  7. var
  8.   HTTPSender: THTTPSend;
  9.   doc: thtmldocument;
  10.   AStrStr:TStringStream;
  11.   htmlStr: String;
  12. begin
  13.   HTTPSender := THTTPSend.Create;
  14.   try
  15.     HTTPSender.HTTPMethod('GET', 'http://blumeninschwaben.de/Einkeimblaettrige/Suessgraeser/suessgraeser.htm');
  16.     if (HTTPSender.ResultCode >= 100) and (HTTPSender.ResultCode <= 299) then begin
  17.       SetString(htmlStr, PAnsiChar(HTTPSender.Document.Memory), HTTPSender.Document.Size);
  18.       AStrStr := TStringStream.Create(ConvertEncoding(htmlStr, EncodingCP1252, EncodingUTF8));
  19.       try
  20.         ReadHtmlFile(doc, AStrStr);
  21.         WriteHTML(doc.DocumentElement, 'test.html');
  22.       finally
  23.         doc.Free;
  24.         AStrStr.Free;
  25.       end;
  26.     end;
  27.   finally
  28.     HTTPSender.Free;
  29.   end;
  30. end;  
Lazarus trunk / fpc 3.0.4 / all 32-bit on Win-10

BubikolRamios

  • Full Member
  • ***
  • Posts: 190
Re: TStringStream encoding
« Reply #10 on: June 13, 2019, 06:22:11 am »
Thanks. That woks.
lazarus-2.0.2-fpc-3.0.4-win32