How to get a string data from a web page??

wp

Hero Member
Posts: 11916

Re: How to get a string data from a web page??

« Reply #15 on: May 09, 2017, 05:58:56 pm »

What is your OS? I am assuming that it is Windows. If you use the 32-bit version of Lazarus (and don't cross-compile) then you must use the 32-bit versions of the two dlls (https://indy.fulgan.com/SSL/openssl-1.0.2k-i386-win32.zip). If you use Lazarus-64bit then you must use the 64-bit-versions of the dlls as well (https://indy.fulgan.com/SSL/openssl-1.0.2k-x64_86-win64.zip).

My program was tested with 32bit on Windows 10.

Alternatively download the html file from the Forex site with the browser (in Firefox: right-click on loaded site and download it ("Save site to", or similar, from the context menu)). Then modify the code of BtnExecuteClick in my program as follows to read and analyze the downloaded file:

Code: Pascal [Select][+]

procedure TForm1.BtnExecuteClick(Sender: TObject);
var
  stream: TMemoryStream;
  err: String;
begin
  Screen.Cursor := crHourGlass;
  stream := TMemoryStream.Create;
  try
    stream.LoadFromFile('D:\Download\Forex Technical Analysis - Investing.com.htm');  // YOUR FILENAME HERE!
    {
    if not DownloadHTTP(EdURL.Text, stream, err) then
    begin
      MessageDlg(err, mtError, [mbOK], 0);
      exit;
    end;
    }
    stream.Position := 0;
    SynEdit1.Lines.LoadFromStream(stream);
 
    stream.Position := 0;
    SynEdit2.Lines.Clear;
    ExtractFromHtml(stream);
 
  finally
    stream.Free;
    Screen.Cursor := crDefault;
  end;
end;

Logged

lestroso

Full Member
Posts: 134

Re: (Solved)How to get a string data from a web page??

« Reply #16 on: May 09, 2017, 06:22:32 pm »

Dear wp,

thanks again for your time dedicated to me!!!

yes, i'm working with lazarus 64 bit i think ,on windows 10....but i've tryed your software with either32 and 64 bit of openssl dlls ... it worked fine...

Now , i must learn from your code a lot... i need only to fetch in the summary of eur usd on 5 minute the string...: neutral, buy, strong buy,strong sell,sell...that's all...
Best regards,

lestroso

Logged

valdir.marcos

Hero Member
Posts: 1106

Re: (Solved)How to get a string data from a web page??

« Reply #17 on: May 09, 2017, 06:47:22 pm »

Quote from: lestroso on May 09, 2017, 06:22:32 pm

i need only to fetch in the summary of eur usd on 5 minute the string...: neutral, buy, strong buy,strong sell,sell...that's all...

Try regular expressions:
http://www.regular-expressions.info
http://wiki.freepascal.org/Regexpr

Logged

avra

Hero Member
Posts: 2514

Re: (Solved)How to get a string data from a web page??

« Reply #18 on: May 10, 2017, 10:34:41 am »

Alternatively XPath or HTML parser can be used:
http://www.benibela.de/documentation/internettools/
http://www.benibela.de/documentation/internettools/extendedhtmlparser.THtmlTemplateParser.html

Logged

ct2laz - Conversion between Lazarus and CodeTyphon
bithelpers - Bit manipulation for standard types
pasettimino - Siemens S7 PLC lib

wp

Hero Member
Posts: 11916

Re: (Solved)How to get a string data from a web page??

« Reply #19 on: May 10, 2017, 10:59:02 am »

Quote from: valdir.marcos on May 09, 2017, 06:47:22 pm

Try regular expressions

The syntax of regular expressions always has been one of the great mysteries of programming to me...

How would you formulate a regular expression which extracts from a html table the contents of a cell which has the text "5 Minutes" in the column header and "EUR/USD" in the row header, but the row header is in a row-spanned cell two rows above the row of interest? The html is at https://www.investing.com/technical/technical-summary

Logged

Leledumbo

Hero Member
Posts: 8757
Programming + Glam Metal + Tae Kwon Do = Me

Re: (Solved)How to get a string data from a web page??

« Reply #20 on: May 10, 2017, 12:20:28 pm »

Quote from: wp on May 10, 2017, 10:59:02 am

How would you formulate a regular expression which extracts from a html table the contents of a cell which has the text "5 Minutes" in the column header and "EUR/USD" in the row header, but the row header is in a row-spanned cell two rows above the row of interest? The html is at https://www.investing.com/technical/technical-summary

A direct regex would be too complex (though I believe it's possible), better parse the HTML table into DOM then just traverse the DOM (since you know the row and column number of interest).

Logged

Follow this if you want me to answer: http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

http://pascalgeek.blogspot.com
https://bitbucket.org/leledumbo
https://github.com/leledumbo
Code first, think later - Natural programmer B)

z505

New Member
Posts: 38
think first, code after

Re: How to get a string data from a web page??

« Reply #21 on: May 11, 2017, 06:02:42 pm »

Quote from: wp on May 05, 2017, 06:42:38 pm

Use the fasthtmlparser which comes with fpc. See http://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199 for an example of its usage which you certainly can generalize to your application

And once you have the tags, you can analyze them further with another unit included in addition to fasthtmlparser... Htmlutil.pas

Code: Pascal [Select][+]

function GetVal(tag, attribname_ci: string): string;
function GetTagName(Tag: string): string;
 
function GetUpTagName(tag: string): string;
function GetNameValPair(tag, attribname_ci: string): string;
function GetValFromNameVal(namevalpair: string): string;
 

Or these units:
https://github.com/z505/powtils/blob/master/dev/main/pwhtmtils.pas
https://github.com/z505/powtils/blob/master/dev/main/pwhtmtool.pas

Which include these additional functions to analyze html tags:

Code: Pascal [Select][+]

function IsTag(TagType: string; Tag: string): boolean;
function IsCloseTag(TagType: string; Tag: string): boolean;
function Substr(sub, s: string): boolean;
function StripTabs(s: string): string;
function ReturnsToSpaces(s: string): string;
function LessenSpaces(s: string): string;
function CleanHtm1(s: string): string;

Some people use fasthtmlparser without realizing there are more tools to further analyze the tags instead of just using pos() all the time to search raw tags

Logged

think first, code after

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: (Solved)How to get a string data from a web page??

« Reply #22 on: May 11, 2017, 10:16:29 pm »

Quote from: Leledumbo on May 10, 2017, 12:20:28 pm

A direct regex would be too complex (though I believe it's possible), better parse the HTML table into DOM then just traverse the DOM (since you know the row and column number of interest).

How complex is this ?

Code: Pascal [Select][+]

program retest;
{$apptype console}
{$mode delphi}{$H+}
uses classes,regexpr;
var
 List:TStrings;
begin
 List := TStringlist.Create;
 try
   list.LoadfromFile('freepascal.html');
   {css and script} List.Text:=ReplaceRegExpr('<(script|style).*?</\1>',List.Text,'',false);
   {tags} List.Text := ReplaceRegExpr('<.*?>',List.Text,'',false);
   writeln(List.Text);
 finally
   list.free;
 end;
 readln;
end.

Maybe needs some tidying, but it's half the job done.

« Last Edit: May 11, 2017, 10:43:19 pm by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

z505

New Member
Posts: 38
think first, code after

Re: (Solved)How to get a string data from a web page??

« Reply #23 on: May 12, 2017, 12:49:03 pm »

well at least you didn't have to free and create the regexpr object, is that using freepascal's own regexpr unit rather than the other TRegexpr?

Logged

think first, code after

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: (Solved)How to get a string data from a web page??

« Reply #24 on: May 12, 2017, 01:21:21 pm »

Quote from: z505 on May 12, 2017, 12:49:03 pm

well at least you didn't have to free and create the regexpr object, is that using freepascal's own regexpr unit rather than the other TRegexpr?

In case of interest see how ReplaceRegExpr in the RexExpr unit wraps TRegExpr

Also note this code is much faster (factor 100 not 2) than using fasthtmlparser. (But it needs work for e.g. &nbsp)

« Last Edit: May 12, 2017, 01:24:54 pm by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

wp

Hero Member
Posts: 11916

Re: (Solved)How to get a string data from a web page??

« Reply #25 on: May 14, 2017, 05:23:49 pm »

Quote from: Thaddy on May 12, 2017, 01:21:21 pm

Also note this code is much faster (factor 100 not 2) than using fasthtmlparser. (But it needs work for e.g. &nbsp)

Factor 100? Thaddy, this is cheating...

I assume that you are referring to my old demo from http://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199. I wrote this not with a speed test in mind. You should know that adding strings to Memo.Lines.Text is extremely expensive. If I modify the demo to write the found text nodes to a memorystream then there is a tremendous speed increase by a factor 50. The remaining speed disadvantage to regexpr is probably due to the fact that fasthtmlparser unnecessarily converts all tags to upper case.

See the modified demo in the attachment. It also contains a comparison with the htmlutil functions as proposed by z505. Since these functions do more than just comparing two strings this version is slower almost by a factor 2 than the dumb-old "pos" method used by "some people".

htmlextractor_speed.png (1.62 kB, 623x37 - viewed 343 times.)

HTMLExtractor_SpeedTest.zip (5.52 kB - downloaded 145 times.)

Logged

z505

New Member
Posts: 38
think first, code after

Re: (Solved)How to get a string data from a web page??

« Reply #26 on: May 17, 2017, 09:14:38 am »

Quote from: wp on May 14, 2017, 05:23:49 pm

Quote from: Thaddy on May 12, 2017, 01:21:21 pm
Also note this code is much faster (factor 100 not 2) than using fasthtmlparser. (But it needs work for e.g. &nbsp)
Factor 100? Thaddy, this is cheating...

I assume that you are referring to my old demo from http://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199. I wrote this not with a speed test in mind. You should know that adding strings to Memo.Lines.Text is extremely expensive. If I modify the demo to write the found text nodes to a memorystream then there is a tremendous speed increase by a factor 50. The remaining speed disadvantage to regexpr is probably due to the fact that fasthtmlparser unnecessarily converts all tags to upper case.

See the modified demo in the attachment. It also contains a comparison with the htmlutil functions as proposed by z505. Since these functions do more than just comparing two strings this version is slower almost by a factor 2 than the dumb-old "pos" method used by "some people".

Well, the functions in htmlutil could be optimized, I never worked on optimizing them because it was fast enough for my needs (parsing thousands of ebay pages and yahoo stock pages)... The htmlutil unit could even use Pos() internally itself, instead of other things..

But indeed, with a Memo, you can I believe speed it up by doing a trick... Lazarus does this trick when compiling. It might be BeginUpdate/EndUpdate or something like it, or like you say a memory stream. If it's not using memo.lines.add and actually modifying the memo text itself as a string, then the solution to that is a CapString algorithm by myself too, which GoLang uses in its cap() array/slice... It grows the string in chunks instead of thousands of small memory allocations. Similar to buffered writeln

The Uppercase is for ease of use. Case sensitivity in html tags is annoying to the end user of the library. Because if he is parsing for <strong> and it in the sources is actually <STRONG> then his whole parser is broken based on case sensitivity, so he has to pollute his own source code with upcase() functions instead of it being in the library itself.

An uppercase boolean could be used so the function does not call it automatically, but then the person may end up calling it themselves anyway 1000's of times in his application code in a loop.

If you don't upcase each and every tag, how do you know for sure a page doesn't have a <stRanGe> case tag.. the code you write may be relying on case sensitivity in html which is a bad thing, IMO, unless you know the website will never be modified and you are deailing with permanent fixed html that will not change in the future (some developer could modify the html and make a <miStake or <UPPERCASE one tag but not others. Reliability, vs performance in the parser ;-)

« Last Edit: May 17, 2017, 09:24:26 am by z505 »

Logged

think first, code after

wp

Hero Member
Posts: 11916

Re: (Solved)How to get a string data from a web page??

« Reply #27 on: May 17, 2017, 10:28:20 am »

fasthtmlparser being a very basic class I would have preferred if the UpperCase were optional. Suppose you search for '<a ...' tags. Then it will be much faster to initially compare the second character of the tag against 'a' or 'A' and the third against ' ' than uppercasing the entire string.

GetUpTagName of the htmlutil unit gets close to what I mean, but it does an initial UpperCase of the entire tag string as well. This is faster:

Code: Pascal [Select][+]

function MyGetUpTagName(tag: string): string;
var
  P : Pchar;
  S : Pchar;
begin
//  P := Pchar(uppercase(Tag));
  while P^ in ['<',' ',#9] do inc(P);
  S := P;
  while Not (P^ in [' ','>',#0]) do inc(P);      
  if P > S then
    Result := Uppercase(CopyBuffer( S, P-S))
    //Result := CopyBuffer( S, P-S)
  else
   Result := '';
end; 

« Last Edit: May 17, 2017, 10:33:47 am by wp »

Logged

z505

New Member
Posts: 38
think first, code after

Re: (Solved)How to get a string data from a web page??

« Reply #28 on: May 17, 2017, 10:59:01 am »

I can add it to github where fasthtmlparser is updated and stored right now, and it will have to be added to fpc project also. But the problem becomes what happens if you have a tag like this:

<a HREF

and

<A href

and

<a href

Not just the tag name could be case insensitive but the tag name/value pairs (name mostly).

There are so many variations... not just the tag name being case sensitive, but also the tag attribute NAMES... The values, should likely be preserved in some cases, if values should be case sensitive.. i.e. name=value pairs in the html tags, what to do with them?

<a href="" style="Should This be Case Sensitive?"

Or javascript

onClick vs ONCLICK vs OnClick, etc. and the function itself

<button onClick="caseSensitive()"

That's why sometimes I just gave up and upcased it all, and was sick of it...

« Last Edit: May 17, 2017, 11:03:05 am by z505 »

Logged

think first, code after

wp

Hero Member
Posts: 11916

Re: (Solved)How to get a string data from a web page??

« Reply #29 on: May 17, 2017, 11:49:04 am »

I can imagine... But the correct decision would have been to let the user decide: he still can apply Uppercase on his own if he needs to (e.g. for pure html), or skip it if he does not (e.g. if attribute names are case sensitive).

The reason why I am dwelling on avoiding Uppercase is that I tried to parse xml from huge xlsx files in a sax-like way using fasthtmlparser and found a significant speed loss due to the unnecessary uppercasing.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: How to get a string data from a web page?? (Read 18814 times)

wp

Re: How to get a string data from a web page??

lestroso

Re: (Solved)How to get a string data from a web page??

valdir.marcos

Re: (Solved)How to get a string data from a web page??

avra

Re: (Solved)How to get a string data from a web page??

wp

Re: (Solved)How to get a string data from a web page??

Leledumbo

Re: (Solved)How to get a string data from a web page??

z505

Re: How to get a string data from a web page??

Thaddy

Re: (Solved)How to get a string data from a web page??

z505

Re: (Solved)How to get a string data from a web page??

Thaddy

Re: (Solved)How to get a string data from a web page??

wp

Re: (Solved)How to get a string data from a web page??

z505

Re: (Solved)How to get a string data from a web page??

wp

Re: (Solved)How to get a string data from a web page??

z505

Re: (Solved)How to get a string data from a web page??

wp

Re: (Solved)How to get a string data from a web page??

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook