Recent

Author Topic: Problems reading the HTML of a page encoded in ISO 8859-1 with TidHTTP  (Read 9485 times)

DelphiLazus

  • New Member
  • *
  • Posts: 34
I am a newbie using Indy and do not know how to solve my problem.
Im Not sure if publish my doubt in the Lazarus forum or contact to the Indy Forum.

I need to read the contents of an HTML page (requesting 3 parameters) to extract some data and I am using TidHTTP for it.

The code I have is like this:

Code: Pascal  [Select][+][-]
  1.  
  2.   param := TStringList.Create;
  3.   param.Add('Param1=value1');
  4.   param.Add('Param2=value2');
  5.   param.Add('Param3=value3');
  6.  
  7.   Memo1.Text := idHTTP1.Post(MYURL, param);
  8.  
  9.   FreeAndNil(param);

For security reasons I am not able to put both the true parameters as the URL.

The code works. I get response, and receive HTML. But in some tests I noticed something strange. I need some data who have Latin characters, such as Ñ, but in HTML instead of Ñ appearing the question mark ?. I am Argentine, and therefore it is important to me regain Ñ as other "special" characters that are used in Spanish.

For example:

Code: Text  [Select][+][-]
  1. <td align="center" bgColor="ghostwhite" colspan="5"><b><font face="Arial, Helvetica, sans-serif" size="1" > VILLAFA?E BLANCA                                    </font> </b></td>


And should read VILLAFAÑE BLANCA.

This data is extracted from a database, and its encoding is unknown to me.

In other parts of HTML, the text is fixed and not armed dynamically I see the entities HTML perfectly. As for example:

Code: Text  [Select][+][-]
  1. <td align="center" bgColor="gray"><font face="Arial, Helvetica, sans-serif" size="1" color="#FFFFFF">Tipo Tr&aacute;mite  </font></td>

In the META section of the page I read this:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

A colleague from another forum suggests to me that the problem is possibly on the web and that is in the database using UTF8 and in the page use ISO 8859-1. This causes the data read is doubly coded.

The strange thing is that I also did a test trying to convert the HTML returned to UTF8, but my test shows me that is already encoded in UTF8 and needless conversion:

Code: Pascal  [Select][+][-]
  1. var param, html1, html2: TStringList;
  2.     encode: string;
  3. begin
  4.   param := TStringList.Create;
  5.   html1 := TStringList.Create;
  6.  
  7.   param.Add(xxx1);
  8.   param.Add(xxx2);
  9.   param.Add(xxx3);
  10.  
  11.   html1.Text := idHTTP1.Post(MYURL, param);
  12.   encode := GuessEncoding(html1.Text);
  13.   if encode <> EncodingUTF8
  14.      then begin
  15.             html2 := TStringList.Create;
  16.             html2.Text := ConvertEncoding(html1.Text, encode, EncodingUTF8);
  17.             html2.SaveToFile(MYTEST);
  18.             ShowMessage('Convert to a UTF8');
  19.  
  20.             FreeAndNil(html2);
  21.           end
  22.      else begin
  23.             html1.SaveToFile(MYTEST);
  24.             ShowMessage('Already is UTF8');
  25.           end;
  26.  
  27.   freeAndNil(param);
  28.   freeAndNil(html1);
  29.  


I've tried to convert Windows-1252 to UTF8 and from UTF8 to ISO 8859-1 to no avail.

If I access the page from Firefox or Chrome and see in the source code HTML the letter Ñ appears! I do not understand what happens in TidHTTP. If it is a bug, or am I doing wrong.

Actually use CodeTyphon version 5.1 and the version of Indy is 24/10/2014 (format dd/mm/aaaa) SVN Rev 5201. In Windows 8.1.


It could also be a problem of encoding the OS? The code:
Code: Pascal  [Select][+][-]
  1. ShowMessage(GetDefaultTextEncoding);

shows that I use cp1252.

I hope that with this information you can understand my problem.
If necessary explain or give further details please them to indicate I would be grateful.

Regards,

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Problems reading the HTML of a page encoded in ISO 8859-1 with TidHTTP
« Reply #1 on: October 15, 2015, 04:20:28 pm »
Eventually (with the usual try/except/finally block):

Code: Pascal  [Select][+][-]
  1. var MyTSStream: TStringStream;
  2. ...
  3.  
  4.   param := TStringList.Create;
  5.   param.Add('Param1=value1');
  6.   param.Add('Param2=value2');
  7.   param.Add('Param3=value3');
  8.  
  9.   MyTSStream := TStringStream.Create('');
  10.   idHTTP1.Post(MYURL, param, MyTSStream);
  11.   Memo1.Text := MyTSStream.DataString;
  12.   MyTSStream.Free;
  13.  
  14.   FreeAndNil(param);

DelphiLazus

  • New Member
  • *
  • Posts: 34
Re: Problems reading the HTML of a page encoded in ISO 8859-1 with TidHTTP
« Reply #2 on: October 15, 2015, 07:15:38 pm »
Eventually (with the usual try/except/finally block):

Code: Pascal  [Select][+][-]
  1. var MyTSStream: TStringStream;
  2. ...
  3.  
  4.   param := TStringList.Create;
  5.   param.Add('Param1=value1');
  6.   param.Add('Param2=value2');
  7.   param.Add('Param3=value3');
  8.  
  9.   MyTSStream := TStringStream.Create('');
  10.   idHTTP1.Post(MYURL, param, MyTSStream);
  11.   Memo1.Text := MyTSStream.DataString;
  12.   MyTSStream.Free;
  13.  
  14.   FreeAndNil(param);

Thank ChrisF for help.
In a few minutes I test your proposal. As I'm still doing tests I do not worry too try/finally. For the final application if I will apply secure code.

EDIT:
Unfortunately the problem persists. Still appears ? and not Ñ, as in the example I gave in my initial post.
This error can also be seen in other cases: "Direcci?on" instead of "Dirección", or "P?agina" when it should read "Página".  :'(

Regards,
« Last Edit: October 15, 2015, 07:48:45 pm by DelphiLazus »

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Problems reading the HTML of a page encoded in ISO 8859-1 with TidHTTP
« Reply #3 on: October 15, 2015, 08:29:48 pm »
It's not necessarily the same error.

For instance, a quick test (URL taken from the Google news Argentina for today):

Sample 1:
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. begin
  3.   Memo1.Text:=IdHTTP1.Get('http://www.clarin.com/politica/pami_0_1449455331.html');
  4. end;

Sample 2:
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button2Click(Sender: TObject);
  2. var MyTSStream: TStringStream;
  3. begin
  4.   MyTSStream := TStringStream.Create('');
  5.   IdHTTP1.Get('http://www.clarin.com/politica/pami_0_1449455331.html', MyTSStream);
  6.   Memo1.Text := MyTSStream.DataString;
  7.   MyTSStream.Free;
  8. end;

The second sample is OK, while the first is not (look at title of the article, for instance).

In the first case, "unknown" characters have already been replaced with a '?' character (code $3F). So, it's too late to convert them; the original characters are lost.

There is probably a way to parametrize Indy in order to get the correct encoding, though I don't know it (as I never use this method)

In the second case, these same characters are UTF-8 encoded for my sample (at least, for me): so no conversion is needed here for this sample (as the LCL also uses natively UTF-8).

The advantage here is that you've got exactly the data sent by the web site: no intermediate conversion.

So, I guess that you're getting non UTF-8 data in you own case. You'd be now able to convert them, using the second case.

To know exactly what kind of encoding you're receiving, you might:
  • use a TFileStream instead of a TStringStream, and look at the produced file with an hexadecimal tool,
  • or look directly at the binary data in the TStringStream variable,
  • or eventually use any other Lazarus tool (like GuessEncoding, ...).

Once you know the data encoding type, you'll be then able to convert your data (providing they have not been "altered" first, for instance by Indy -i.e. first case-).

*** Edit ***
BTW, on which versions of Free Pascal is your Code Typhon version based ? Because, if you're using Free Pascal 2.7+/3.x my former remarks might have to be amended.
« Last Edit: October 15, 2015, 09:25:08 pm by ChrisF »

DelphiLazus

  • New Member
  • *
  • Posts: 34
Re: Problems reading the HTML of a page encoded in ISO 8859-1 with TidHTTP
« Reply #4 on: October 15, 2015, 09:27:35 pm »
Thanks for helping me so quickly!

From what I'm understanding, then it boils down to determine the type of encoding I'm getting. And so should the conversion of that encoding to UTF8. I'm good?

I think part of the problem is that the web site is dynamic (designed in ASP) and the need to extract data from HTML brings from a database, and I'm not sure what type of encoding that has the database.
My friend from another forum told me that the problem is between the encoding from the database, and the encoding on the website.
That is, encoding DB -> encoding web page -> encoding Indy

Is it possible that the error will also be for this?

If you please, I would ask how should analyze the binary content? Or how watching from the hexadecimal editor could infer that type of encoding is?

I use:
CodeTyphon 5.1. Date: 2014-12-09
Version FPC: 2.7.1
Revision SVN: 46696

Regards,

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Problems reading the HTML of a page encoded in ISO 8859-1 with TidHTTP
« Reply #5 on: October 15, 2015, 09:58:50 pm »
[...]
I think part of the problem is that the web site is dynamic (designed in ASP) and the need to extract data from HTML brings from a database, and I'm not sure what type of encoding that has the database.
My friend from another forum told me that the problem is between the encoding from the database, and the encoding on the website.
That is, encoding DB -> encoding web page -> encoding Indy

Is it possible that the error will also be for this?
[...]

It's not impossible, indeed. I've seen so many web sites that don't respect the rules...

First of all, you first need to know exactly what kind of data the web site is sending to you. And for that, you'll have to look at the "raw" data received (i.e. without any Indy conversion).

For instance, still with my URL sample:
Code: [Select]
procedure TForm1.Button3Click(Sender: TObject);
var MyTFStream: TFileStream;
begin
  MyTFStream := TFileStream.Create('.\test.dat', fmCreate);
  IdHTTP1.Get('http://www.clarin.com/politica/pami_0_1449455331.html', MyTFStream);
  MyTFStream.Free;
end;

If you look at its raw data with an hexadecimal tool (see attached image), clearily these data are UTF-8 encoded. Which is confirmed by the content type tag:
Quote
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />


But it seems yours are 'iso-8859-1' (according to your content type tag). So make a try by converting the string data received in the TStringStream (see my former sample code) to UTF-8. This time, it should work.


DelphiLazus

  • New Member
  • *
  • Posts: 34
Re: Problems reading the HTML of a page encoded in ISO 8859-1 with TidHTTP
« Reply #6 on: October 15, 2015, 10:14:23 pm »
Many thanks! Your assumption was quite accurate. Apparently enough to convert the ISO 8859-1 datastring the TStringStream.
I did a quick test:

Code: Pascal  [Select][+][-]
  1. param := TStringList.Create;
  2.   param.Add(xx1);
  3.   param.Add(xx2);
  4.   param.Add(xx3);
  5.  
  6.   stream := TStringStream.Create('');
  7.   idHTTP1.Post(MYURL, param, stream);
  8.  
  9.   memo1.Text := ISO_8859_1ToUTF8(Stream.DataString);
  10.  
  11.   FreeAndNil(param);
  12.   FreeAndNil(stream);
  13.  

And with that the characters have been corrected.
I'll have to do more tests but at first glance view everything seems fine.

Regards,

 

TinyPortal © 2005-2018