* * *

Author Topic: How to get a string data from a web page??  (Read 4181 times)

wp

  • Hero Member
  • *****
  • Posts: 3741
Re: How to get a string data from a web page??
« Reply #15 on: May 09, 2017, 05:58:56 pm »
What is your OS? I am assuming that it is Windows. If you use the 32-bit version of Lazarus (and don't cross-compile) then you must use the 32-bit versions of the two dlls (https://indy.fulgan.com/SSL/openssl-1.0.2k-i386-win32.zip). If you use Lazarus-64bit then you must use the 64-bit-versions of the dlls as well (https://indy.fulgan.com/SSL/openssl-1.0.2k-x64_86-win64.zip).

My program was tested with 32bit on Windows 10.

Alternatively download the html file from the Forex site with the browser (in Firefox: right-click on loaded site and download it ("Save site to", or similar, from the context menu)). Then modify the code of BtnExecuteClick in my program as follows to read and analyze the downloaded file:

Code: Pascal  [Select]
  1. procedure TForm1.BtnExecuteClick(Sender: TObject);
  2. var
  3.   stream: TMemoryStream;
  4.   err: String;
  5. begin
  6.   Screen.Cursor := crHourGlass;
  7.   stream := TMemoryStream.Create;
  8.   try
  9.     stream.LoadFromFile('D:\Download\Forex Technical Analysis - Investing.com.htm');  // YOUR FILENAME HERE!
  10.     {
  11.     if not DownloadHTTP(EdURL.Text, stream, err) then
  12.     begin
  13.       MessageDlg(err, mtError, [mbOK], 0);
  14.       exit;
  15.     end;
  16.     }
  17.     stream.Position := 0;
  18.     SynEdit1.Lines.LoadFromStream(stream);
  19.  
  20.     stream.Position := 0;
  21.     SynEdit2.Lines.Clear;
  22.     ExtractFromHtml(stream);
  23.  
  24.   finally
  25.     stream.Free;
  26.     Screen.Cursor := crDefault;
  27.   end;
  28. end;
Lazarus trunk / fpc 3.0.0 / Win32

lestroso

  • Jr. Member
  • **
  • Posts: 53
Re: (Solved)How to get a string data from a web page??
« Reply #16 on: May 09, 2017, 06:22:32 pm »
Dear wp,

thanks again for your time dedicated  to me!!!

yes, i'm working with lazarus 64 bit i think ,on windows 10....but i've tryed your software with either32 and 64 bit of openssl dlls ... it worked fine...

Now , i must learn from your code a lot... i need only to fetch in the summary of eur usd  on 5 minute the string...: neutral, buy, strong buy,strong sell,sell...that's all...
Best  regards,

lestroso :D

valdir.marcos

  • Full Member
  • ***
  • Posts: 245
Re: (Solved)How to get a string data from a web page??
« Reply #17 on: May 09, 2017, 06:47:22 pm »
i need only to fetch in the summary of eur usd  on 5 minute the string...: neutral, buy, strong buy,strong sell,sell...that's all...

Try regular expressions:
http://www.regular-expressions.info
http://wiki.freepascal.org/Regexpr

avra

  • Hero Member
  • *****
  • Posts: 1085
    • Additional info
ct2laz - Easily convert components and projects between Lazarus and CodeTyphon

wp

  • Hero Member
  • *****
  • Posts: 3741
Re: (Solved)How to get a string data from a web page??
« Reply #19 on: May 10, 2017, 10:59:02 am »
Try regular expressions
The syntax of regular expressions always has been one of the great mysteries of programming to me...

How would you formulate a regular expression which extracts from a html table the contents of a cell which has the text "5 Minutes" in the column header and "EUR/USD" in the row header, but the row header is in a row-spanned cell two rows above the row of interest? The html is at https://www.investing.com/technical/technical-summary
Lazarus trunk / fpc 3.0.0 / Win32

Leledumbo

  • Hero Member
  • *****
  • Posts: 7651
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: (Solved)How to get a string data from a web page??
« Reply #20 on: May 10, 2017, 12:20:28 pm »
How would you formulate a regular expression which extracts from a html table the contents of a cell which has the text "5 Minutes" in the column header and "EUR/USD" in the row header, but the row header is in a row-spanned cell two rows above the row of interest? The html is at https://www.investing.com/technical/technical-summary
A direct regex would be too complex (though I believe it's possible), better parse the HTML table into DOM then just traverse the DOM (since you know the row and column number of interest).

z505

  • New member
  • *
  • Posts: 38
  • think first, code after
Re: How to get a string data from a web page??
« Reply #21 on: May 11, 2017, 06:02:42 pm »
Use the fasthtmlparser which comes with fpc. See http://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199 for an example of its usage which you certainly can generalize to your application

And once you have the tags, you can analyze them further with another unit included in addition to fasthtmlparser... Htmlutil.pas
Code: Pascal  [Select]
  1. function GetVal(tag, attribname_ci: string): string;
  2. function GetTagName(Tag: string): string;
  3.  
  4. function GetUpTagName(tag: string): string;
  5. function GetNameValPair(tag, attribname_ci: string): string;
  6. function GetValFromNameVal(namevalpair: string): string;
  7.  
Or these units:
https://github.com/z505/powtils/blob/master/dev/main/pwhtmtils.pas
https://github.com/z505/powtils/blob/master/dev/main/pwhtmtool.pas

Which include these additional functions to analyze html tags:
Code: Pascal  [Select]
  1. function IsTag(TagType: string; Tag: string): boolean;
  2. function IsCloseTag(TagType: string; Tag: string): boolean;
  3. function Substr(sub, s: string): boolean;
  4. function StripTabs(s: string): string;
  5. function ReturnsToSpaces(s: string): string;
  6. function LessenSpaces(s: string): string;
  7. function CleanHtm1(s: string): string;

Some people use fasthtmlparser without realizing there are more tools to further analyze the tags instead of just using pos() all the time to search raw tags
think first, code after

Thaddy

  • Hero Member
  • *****
  • Posts: 4439
Re: (Solved)How to get a string data from a web page??
« Reply #22 on: May 11, 2017, 10:16:29 pm »
A direct regex would be too complex (though I believe it's possible), better parse the HTML table into DOM then just traverse the DOM (since you know the row and column number of interest).
How complex is this ?  8-) O:-)
Code: Pascal  [Select]
  1. program retest;
  2. {$apptype console}
  3. {$mode delphi}{$H+}
  4. uses classes,regexpr;
  5. var
  6.  List:TStrings;
  7. begin
  8.  List := TStringlist.Create;
  9.  try
  10.    list.LoadfromFile('freepascal.html');
  11.    {css and script} List.Text:=ReplaceRegExpr('<(script|style).*?</\1>',List.Text,'',false);
  12.    {tags} List.Text := ReplaceRegExpr('<.*?>',List.Text,'',false);
  13.    writeln(List.Text);
  14.  finally
  15.    list.free;
  16.  end;
  17.  readln;
  18. end.

Maybe needs some tidying, but it's half the job done.
« Last Edit: May 11, 2017, 10:43:19 pm by Thaddy »
"Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive."

z505

  • New member
  • *
  • Posts: 38
  • think first, code after
Re: (Solved)How to get a string data from a web page??
« Reply #23 on: May 12, 2017, 12:49:03 pm »
well at least you didn't have to free and create the regexpr object, is that using freepascal's own regexpr unit rather than the other TRegexpr?

think first, code after

Thaddy

  • Hero Member
  • *****
  • Posts: 4439
Re: (Solved)How to get a string data from a web page??
« Reply #24 on: May 12, 2017, 01:21:21 pm »
well at least you didn't have to free and create the regexpr object, is that using freepascal's own regexpr unit rather than the other TRegexpr?

In case of interest see how ReplaceRegExpr in  the RexExpr unit wraps TRegExpr  O:-)
Also note this code is much faster (factor 100 not 2) than using fasthtmlparser. (But it needs work for e.g. &nbsp)
« Last Edit: May 12, 2017, 01:24:54 pm by Thaddy »
"Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive."

wp

  • Hero Member
  • *****
  • Posts: 3741
Re: (Solved)How to get a string data from a web page??
« Reply #25 on: May 14, 2017, 05:23:49 pm »
Also note this code is much faster (factor 100 not 2) than using fasthtmlparser. (But it needs work for e.g. &nbsp)
Factor 100? Thaddy, this is cheating...  ;D

I assume that you are referring to my old demo from http://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199. I wrote this not with a speed test in mind. You should know that adding strings to Memo.Lines.Text is extremely expensive. If I modify the demo to write the found text nodes to a memorystream then there is a tremendous speed increase by a factor 50. The remaining speed disadvantage to regexpr is probably due to the fact that fasthtmlparser unnecessarily converts all tags to upper case.

See the modified demo in the attachment. It also contains a comparison with the htmlutil functions as proposed by z505. Since these functions do more than just comparing two strings this version is slower almost by a factor 2 than the dumb-old "pos" method used by "some people".
Lazarus trunk / fpc 3.0.0 / Win32

z505

  • New member
  • *
  • Posts: 38
  • think first, code after
Re: (Solved)How to get a string data from a web page??
« Reply #26 on: May 17, 2017, 09:14:38 am »
Also note this code is much faster (factor 100 not 2) than using fasthtmlparser. (But it needs work for e.g. &nbsp)
Factor 100? Thaddy, this is cheating...  ;D

I assume that you are referring to my old demo from http://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199. I wrote this not with a speed test in mind. You should know that adding strings to Memo.Lines.Text is extremely expensive. If I modify the demo to write the found text nodes to a memorystream then there is a tremendous speed increase by a factor 50. The remaining speed disadvantage to regexpr is probably due to the fact that fasthtmlparser unnecessarily converts all tags to upper case.

See the modified demo in the attachment. It also contains a comparison with the htmlutil functions as proposed by z505. Since these functions do more than just comparing two strings this version is slower almost by a factor 2 than the dumb-old "pos" method used by "some people".

Well, the functions in htmlutil could be optimized, I never worked on optimizing them because it was fast enough for my needs (parsing thousands of ebay pages and yahoo stock pages)... The htmlutil unit could even use Pos() internally itself, instead of other things..

But indeed, with a Memo, you can I believe speed it up by doing a trick... Lazarus does this trick when compiling. It might be BeginUpdate/EndUpdate or something like it, or like you say a memory stream. If it's not using memo.lines.add and actually modifying the memo text itself as a string, then the solution to that is a CapString algorithm by myself too, which GoLang uses in its cap() array/slice... It grows the string in chunks instead of thousands of small memory allocations. Similar to buffered writeln

The Uppercase is for ease of use. Case sensitivity in html tags is annoying to the end user of the library. Because if he is parsing for <strong> and it in the sources is actually <STRONG> then his whole parser is broken based on case sensitivity, so he has to pollute his own source code with upcase() functions instead of it being in the library itself.

An uppercase boolean could be used so the function does not call it automatically, but then the person may end up calling it themselves anyway 1000's of times in his application code in a loop.

If you don't upcase each and every tag, how do you know for sure a page doesn't have a <stRanGe> case tag.. the code you write may be relying on case sensitivity in html which is a bad thing, IMO, unless you know the website will never be modified and you are deailing with permanent fixed html that will not change in the future (some developer could modify the html and make a <miStake or <UPPERCASE one tag but not others. Reliability, vs performance in the parser ;-)

« Last Edit: May 17, 2017, 09:24:26 am by z505 »
think first, code after

wp

  • Hero Member
  • *****
  • Posts: 3741
Re: (Solved)How to get a string data from a web page??
« Reply #27 on: May 17, 2017, 10:28:20 am »
fasthtmlparser being a very basic class I would have preferred if the UpperCase were optional. Suppose you search for '<a ...' tags. Then it will be much faster to initially compare the second character of the tag against 'a' or 'A' and the third against ' ' than uppercasing the entire string.

GetUpTagName of the htmlutil unit gets close to what I mean, but it does an initial UpperCase of the entire tag string as well. This is faster:

Code: Pascal  [Select]
  1. function MyGetUpTagName(tag: string): string;
  2. var
  3.   P : Pchar;
  4.   S : Pchar;
  5. begin
  6. //  P := Pchar(uppercase(Tag));
  7.   while P^ in ['<',' ',#9] do inc(P);
  8.   S := P;
  9.   while Not (P^ in [' ','>',#0]) do inc(P);      
  10.   if P > S then
  11.     Result := Uppercase(CopyBuffer( S, P-S))
  12.     //Result := CopyBuffer( S, P-S)
  13.   else
  14.    Result := '';
  15. end;
« Last Edit: May 17, 2017, 10:33:47 am by wp »
Lazarus trunk / fpc 3.0.0 / Win32

z505

  • New member
  • *
  • Posts: 38
  • think first, code after
Re: (Solved)How to get a string data from a web page??
« Reply #28 on: May 17, 2017, 10:59:01 am »
I can add it to github where fasthtmlparser is updated and stored right now, and it will have to be added to fpc project also. But the problem becomes what happens if you have a tag like this:

<a HREF

and

<A href

and

<a href

Not just the tag name could be case insensitive but the tag name/value pairs (name mostly).

There are so many variations... not just the tag name being case sensitive, but also the tag attribute NAMES... The values, should likely be preserved in some cases, if values should be case sensitive.. i.e. name=value pairs in the html tags, what to do with them?

<a href="" style="Should This be Case Sensitive?"

Or javascript

onClick vs ONCLICK vs OnClick, etc. and the function itself

<button onClick="caseSensitive()" 


That's why sometimes I just gave up and  upcased it all, and was sick of it...

« Last Edit: May 17, 2017, 11:03:05 am by z505 »
think first, code after

wp

  • Hero Member
  • *****
  • Posts: 3741
Re: (Solved)How to get a string data from a web page??
« Reply #29 on: May 17, 2017, 11:49:04 am »
I can imagine... But the correct decision would have been to let the user decide: he still can apply Uppercase on his own if he needs to (e.g. for pure html), or skip it if he does not (e.g. if attribute names are case sensitive).

The reason why I am dwelling on avoiding Uppercase is that I tried to parse xml from huge xlsx files in a sax-like way using fasthtmlparser and found a significant speed loss due to the unnecessary uppercasing.
Lazarus trunk / fpc 3.0.0 / Win32

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus