Recent

Author Topic: HTML files get values  (Read 6048 times)

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
HTML files get values
« on: May 23, 2020, 08:42:26 am »
Hey Guys,
wish everyone is okay , i am suffering with lazarus to find information about units -uses
there is less information
i am trying to read below html file and get the value (20) -last page number from it

please help me

Regards

Code: HTML5  [Select][+][-]
  1. <div class="pagination">
  2. <div class="Zebra_Pagination">
  3. <ul>
  4. <li>
  5. <a rel="nofollow" href="javascript:void(0)" class="navigation previous disabled">Previous
  6. </a>
  7. </li>
  8. <li>
  9. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=" class="current">1
  10. </a>
  11. </li>
  12. <li>
  13. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2">2
  14. </a>
  15. </li>
  16. <li>
  17. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=3">3
  18. </a>
  19. </li>
  20. <li>
  21. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=4">4
  22. </a>
  23. </li>
  24. <li>
  25. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=5">5
  26. </a>
  27. </li>
  28. <li>
  29. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=6">6
  30. </a>
  31. </li>
  32. <li>
  33. <span></span>
  34. </li>
  35. <li>
  36. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=20">20
  37. </a>
  38. </li>
  39. <li>
  40. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2" class="navigation next">Next
  41. </a>
  42. </li>
  43. </ul>
  44. </div>
  45.  </div>

trev

  • Hero Member
  • *****
  • Posts: 714
  • Former Delphi 1-7 and 10.2 User
Re: HTML files get values
« Reply #1 on: May 23, 2020, 09:25:37 am »
Show us your code for this so far and tell us what happens.

[You might benefit from reading How to use the Forum too.]
o Lazarus v2.1.0 r63272, FPC v3.3.1 r45525, macOS 10.14.6 (with sup update), Xcode 11.3.1
o Lazarus v2.1.0 r61574, FPC v3.3.1 r42318, FreeBSD 12.1 amd64 (Parallels VM)
o FPC 3.0.4, FreeBSD 12-STABLE r361007 amd64
o Lazarus v2.1.0 r61574, FPC v3.0.4, Ubuntu 18.04 (Parallels VM)

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #2 on: May 23, 2020, 09:53:37 am »

here is the code , i tried many times but couldn't get any results

Code: Pascal  [Select][+][-]
  1. uses
  2.   Classes, SysUtils, SAX_HTML, DOM, DOM_HTML, fphttpclient, Forms, Controls,
  3.   Graphics, Dialogs, StdCtrls;
  4.   var
  5.     doc:THTMLDocument;
  6.     els:TDOMNodeList;
  7.     f:THTMLElement;  

Code: Pascal  [Select][+][-]
  1.  ReadHTMLFile(doc,tstringstream.create(s))//s is html content string;
  2.       //f:=THTMLElement(doc.GetElementsByTagName('Zebra_Pagination'));
  3.       //Memo1.Lines.add(f.FirstChild.NodeValue);
  4.       els:=doc.GetElementsByTagName('div');
  5.       if (els.Count) > 0 then begin
  6.        Memo1.Lines.add(tdomelement(els[0]).getattribute('class'));
  7.        Memo1.Lines.add(tdomelement(els[0]).getattribute('Zebra_Pagination'));
  8.         Memo1.Lines.add((tdomelement(els[0]).FirstChild.TextContent));  

kqha

  • New Member
  • *
  • Posts: 12
Re: HTML files get values
« Reply #3 on: May 26, 2020, 02:33:30 pm »
I'm not really sure about TDOMNodeList, but if GetElementsByTagName works just like javascript and PHP DOM does, it seems your logic in handling GetElementsByTagName to find the "20" textcontent is wrong. Regardless, if I were you I would getElementsByTagName('a') instead and loop through it to find an element which has "navigation next" in class attribute, and return the previous element. Some pseudo code:

Code: Pascal  [Select][+][-]
  1. els := doc.GetElementsByTagName('a');
  2. n := -1;  //index of els which has last page number, default -1 means not found
  3. for a:=0 to els.Count-1 do
  4. begin
  5.   if Pos('next',els[a].getAttribute('class'))>=0 then
  6.   begin
  7.     n := a-1;
  8.     Break;
  9.   end;
  10. end;
  11.  

howardpc

  • Hero Member
  • *****
  • Posts: 3443
Re: HTML files get values
« Reply #4 on: May 26, 2020, 03:19:09 pm »
When parsing text files, and only one or two values are of interest, it is sometimes easier just to brute force a hacked solution, rather than spend time trying to understand how to use a complex library you did not write and are unfamiliar with, particularly if it has limited documentation.
Say, something along these lines:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. const
  6.   HTMLText = '    <div class="pagination">'+
  7.     '<div class="Zebra_Pagination">'+
  8.     '<ul>'+
  9.     '<li>'+
  10.     '<a rel="nofollow" href="javascript:void(0)" class="navigation previous disabled">Previous'+
  11.     '</a>'+
  12.     '</li>'+
  13.     '<li>'+
  14.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=" class="current">1'+
  15.     '</a>'+
  16.     '</li>'+
  17.     '<li>'+
  18.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2">2'+
  19.     '</a>'+
  20.     '</li>'+
  21.     '<li>'+
  22.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=3">3'+
  23.     '</a>'+
  24.     '</li>'+
  25.     '<li>'+
  26.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=4">4'+
  27.     '</a>'+
  28.     '</li>'+
  29.     '<li>'+
  30.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=5">5'+
  31.     '</a>'+
  32.     '</li>'+
  33.     '<li>'+
  34.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=6">6'+
  35.     '</a>'+
  36.     '</li>'+
  37.     '<li>'+
  38.     '<span>…</span>'+
  39.     '</li>'+
  40.     '<li>'+
  41.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=20">20'+
  42.     '</a>'+
  43.     '</li>'+
  44.     '<li>'+
  45.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2" class="navigation next">Next'+
  46.     '</a>'+
  47.     '</li>'+
  48.     '</ul>'+
  49.     '</div>'+
  50.      '</div>';
  51.  
  52.   function GetLastPageNo(const anHTMLText: String): Integer;
  53.   var
  54.     p, pStart, pEnd: PChar;
  55.     arr: array of LongInt = Nil;
  56.     s: String;
  57.     idx: Integer = -1;
  58.   begin
  59.     Result := -1;
  60.     p := PChar(anHTMLText);
  61.     pEnd := p;
  62.     SetLength(arr, Length(anHTMLText) shr 4);
  63.     Inc(pEnd, Length(anHTMLText));
  64.     while p < pEnd do
  65.       begin
  66.         Inc(p);
  67.         if (p^ = 'p') and (p[1] = 'a') and (p[2] = 'g') and (p[3] = 'e') and
  68.           (p[4] = '=') and (p[5] in ['0'..'9']) then
  69.             begin
  70.               Inc(p, 5);
  71.               pStart := p;
  72.               while p[1] in ['0'..'9'] do
  73.                 Inc(p);
  74.                 SetString(s, pStart, Succ(p - pStart));
  75.                 Inc(idx);
  76.                 ReadStr(s, arr[idx]);
  77.             end;
  78.       end;
  79.  
  80.     for idx in arr do
  81.       if idx > Result then
  82.         Result := idx;
  83.   end;
  84.  
  85. begin
  86.   WriteLn('Last page number is ',GetLastPageNo(HTMLText));
  87.   ReadLn;
  88. end.

rvk

  • Hero Member
  • *****
  • Posts: 4143
Re: HTML files get values
« Reply #5 on: May 26, 2020, 04:39:28 pm »
When parsing text files, and only one or two values are of interest, it is sometimes easier just to brute force a hacked solution, rather than spend time trying to understand how to use a complex library you did not write and are unfamiliar with, particularly if it has limited documentation.
Indeed. And in that case using regexpr would be even more simple  :D

(I know, I know.... never ever parse HTML with regexpr  :P )

And this only works if there are not any other page=x expressions on the page (at least not with a higher number you want).

Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4. uses
  5.   regexpr, sysutils;
  6.  
  7. const
  8.   HTMLText = '    <div class="pagination">' + '<div class="Zebra_Pagination">' +
  9.     '<ul>' + '<li>' +
  10.     '<a rel="nofollow" href="javascript:void(0)" class="navigation previous disabled">Previous'
  11.     + '</a>' + '</li>' + '<li>' +
  12.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=" class="current">1'
  13.     + '</a>' + '</li>' + '<li>' +
  14.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2">2'
  15.     + '</a>' + '</li>' + '<li>' +
  16.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=3">3'
  17.     + '</a>' + '</li>' + '<li>' +
  18.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=4">4'
  19.     + '</a>' + '</li>' + '<li>' +
  20.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=5">5'
  21.     + '</a>' + '</li>' + '<li>' +
  22.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=6">6'
  23.     + '</a>' + '</li>' + '<li>' + '<span>…</span>' + '</li>' +
  24.     '<li>' +
  25.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=20">20'
  26.     +
  27.     '</a>' + '</li>' + '<li>' +
  28.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2" class="navigation next">Next'
  29.     +
  30.     '</a>' + '</li>' + '</ul>' + '</div>' + '</div>';
  31.  
  32.   function GetLastPageNo(const anHTMLText: string): integer;
  33.   var
  34.     re: TRegExpr;
  35.   begin
  36.     Result := -1;
  37.     re := TRegExpr.Create('&amp;page=(.*?)">');
  38.     if re.Exec(anHTMLText) then
  39.       while re.ExecNext do
  40.         if StrToIntDef(re.Match[1], 0) > Result then Result := StrToIntDef(re.Match[1], 0);
  41.     re.Free;
  42.   end;
  43.  
  44. begin
  45.   WriteLn('Last page number is ', GetLastPageNo(HTMLText));
  46.   ReadLn;
  47. end.

trev

  • Hero Member
  • *****
  • Posts: 714
  • Former Delphi 1-7 and 10.2 User
Re: HTML files get values
« Reply #6 on: May 27, 2020, 07:45:51 am »
Indeed. And in that case using regexpr would be even more simple  :D

(I know, I know.... never ever parse HTML with regexpr  :P )

You beat me to it - a regex would be my suggestion too :)

Except my regex would have been:
Code: Pascal  [Select][+][-]
  1. re := TRegExpr.Create('&amp;page=([0-9]*)">');
.
o Lazarus v2.1.0 r63272, FPC v3.3.1 r45525, macOS 10.14.6 (with sup update), Xcode 11.3.1
o Lazarus v2.1.0 r61574, FPC v3.3.1 r42318, FreeBSD 12.1 amd64 (Parallels VM)
o FPC 3.0.4, FreeBSD 12-STABLE r361007 amd64
o Lazarus v2.1.0 r61574, FPC v3.0.4, Ubuntu 18.04 (Parallels VM)

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #7 on: May 29, 2020, 09:07:36 am »
 :D :D :D thanks all for reply
it was useful information , i tried to call the function
Code: Pascal  [Select][+][-]
  1. memo1.lines.add(getlastpageno(s));
i got this error getlastpageno not identifier

another thing i have text like this
/python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-python-e158003322.html
 how to get only text without '-' and without 'e158003322.html'
second thing how to replace 'e' in 'e158003322.html' with 'd'

thanks
Alaa

howardpc

  • Hero Member
  • *****
  • Posts: 3443
Re: HTML files get values
« Reply #8 on: May 29, 2020, 09:12:43 am »
GetLastPageNo returns an integer. You can't add that to a TStrings instance without using a conversion function.
But the error you got suggests you have not implemented the function in your code, or you spelled it differently.

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #9 on: May 29, 2020, 09:17:48 am »
i tried to us inttostr with it didnt work also , i have copied the code itself and worked when used without function
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   s1:String;
  4.   re:TRegExpr;
  5.   result:Integer;
  6. begin
  7.       With TFPHttpClient.Create(Nil) do
  8.     try
  9.       S1 := Get('link');
  10.     Result := -1;
  11.     re := TRegExpr.Create('&amp;page=([0-9]*)">');
  12.     if re.Exec(s1) then
  13.       while re.ExecNext do
  14.         if StrToIntDef(re.Match[1], 0) > Result then Result := StrToIntDef(re.Match[1], 0);
  15.            re.Free;
  16.       Memo1.Lines.add(inttostr(result));
  17.     finally
  18.       Free;
  19.     end;
  20. end;
  21.  

can you help me with the other question in my previous post
 
Quote
another thing i have text like this
/python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-python-e158003322.html
 how to get only text without '-' and without 'e158003322.html'
second thing how to replace 'e' in 'e158003322.html' with 'd'

thanks

trev

  • Hero Member
  • *****
  • Posts: 714
  • Former Delphi 1-7 and 10.2 User
Re: HTML files get values
« Reply #10 on: May 29, 2020, 09:53:21 am »
Second question regex:

Code: Pascal  [Select][+][-]
  1. re := TRegExpr.Create('\(.*\)-e[0-9]*.html/\1d/');

Should work, but only tested with sed.
o Lazarus v2.1.0 r63272, FPC v3.3.1 r45525, macOS 10.14.6 (with sup update), Xcode 11.3.1
o Lazarus v2.1.0 r61574, FPC v3.3.1 r42318, FreeBSD 12.1 amd64 (Parallels VM)
o FPC 3.0.4, FreeBSD 12-STABLE r361007 amd64
o Lazarus v2.1.0 r61574, FPC v3.0.4, Ubuntu 18.04 (Parallels VM)

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #11 on: May 29, 2020, 02:54:04 pm »
thanks but how i use this
Quote
re := TRegExpr.Create('\(.*\)-e[0-9]*.html/\1d/');
in this
Code: Pascal  [Select][+][-]
  1. re:=TRegExpr.Create('<a href="(.*?)"'); // this one original
  2.  
  3.   //re := TRegExpr.Create('li <a href="([/w]+)/"');
  4.   try
  5. if re.Exec(page) then begin
  6.      bookname := re.Match[1];
  7.      memo1.Append(bookname);
  8.     //listbox1.items.add(bookname);
  9.    while re.ExecNext do begin
  10.        bookname := re.Match[1];
  11.         if (RightStr(bookname,4)='html') and (LeftStr(bookname,1)='/') then    //filter all links which has "html"at the end  and "/" at the begining
  12.         memo1.Append(bookname);
  13.  
  14.        //listbox1.items.add(bookname);
  15.        Application.ProcessMessages;
  16.     end;
  17. end;//Memo1.Append('');        
as i am trying to extract only links which has "/" at the start and ".html" at the end

actually my code worked good for me but i don't understand why i got each link two times

regards

trev

  • Hero Member
  • *****
  • Posts: 714
  • Former Delphi 1-7 and 10.2 User
Re: HTML files get values
« Reply #12 on: May 29, 2020, 04:08:22 pm »
I doubt you showed us all your code for the repeated book name issue! Anyway, here's an FPC program to do what you said you wanted to do.

Code: Pascal  [Select][+][-]
  1. Program regex;
  2.  
  3. uses
  4.    RegExpr;
  5.  
  6. var
  7.    re   : TRegExpr;
  8.    page : String = '<a href="/python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-python-e158003322.html">' +
  9.                    LineEnding + '<a href="/python-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-c-e158002111.html">';
  10.    bookname : string;
  11.  
  12. begin
  13.    re:=TRegExpr.Create('<a href="/(.*?)-e[0-9]*\.html"');    
  14.  
  15.    if re.Exec(page) then
  16.      begin
  17.        bookname := re.Match[1];
  18.          writeLn(bookname + 'd');
  19.  
  20.        while re.ExecNext do
  21.          begin
  22.            bookname := re.Match[1];
  23.              writeLn(bookname + 'd');
  24.          end;
  25.      end;
  26.  
  27.      re.free;
  28. end.

Outputs:

Quote
python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-pythond
python-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-cd
o Lazarus v2.1.0 r63272, FPC v3.3.1 r45525, macOS 10.14.6 (with sup update), Xcode 11.3.1
o Lazarus v2.1.0 r61574, FPC v3.3.1 r42318, FreeBSD 12.1 amd64 (Parallels VM)
o FPC 3.0.4, FreeBSD 12-STABLE r361007 amd64
o Lazarus v2.1.0 r61574, FPC v3.0.4, Ubuntu 18.04 (Parallels VM)

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #13 on: May 29, 2020, 05:17:51 pm »
thanks trev for your support
what i meant with duplicated name , that page have href links -duplicated
but i found a way to load them to list box then removed duplicated items
if you have easier way you could share

i am trying solving a problems one by one and thanks for each one supporting me

regards
alaa

rvk

  • Hero Member
  • *****
  • Posts: 4143
Re: HTML files get values
« Reply #14 on: May 29, 2020, 05:22:05 pm »
what i meant with duplicated name , that page have href links -duplicated
but i found a way to load them to list box then removed duplicated items
if you have easier way you could share
If you don't need the list visible it's better to just use tstringlist.
Set duplicates to dupIgnore and add them to the list (and gone are the duplicates).
https://www.freepascal.org/docs-html/rtl/classes/tstringlist.duplicates.html

 

TinyPortal © 2005-2018