Recent

Author Topic: HTML files get values  (Read 20087 times)

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
HTML files get values
« on: May 23, 2020, 08:42:26 am »
Hey Guys,
wish everyone is okay , i am suffering with lazarus to find information about units -uses
there is less information
i am trying to read below html file and get the value (20) -last page number from it

please help me

Regards

Code: HTML5  [Select][+][-]
  1. <div class="pagination">
  2. <div class="Zebra_Pagination">
  3. <ul>
  4. <li>
  5. <a rel="nofollow" href="javascript:void(0)" class="navigation previous disabled">Previous
  6. </a>
  7. </li>
  8. <li>
  9. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=" class="current">1
  10. </a>
  11. </li>
  12. <li>
  13. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2">2
  14. </a>
  15. </li>
  16. <li>
  17. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=3">3
  18. </a>
  19. </li>
  20. <li>
  21. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=4">4
  22. </a>
  23. </li>
  24. <li>
  25. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=5">5
  26. </a>
  27. </li>
  28. <li>
  29. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=6">6
  30. </a>
  31. </li>
  32. <li>
  33. <span></span>
  34. </li>
  35. <li>
  36. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=20">20
  37. </a>
  38. </li>
  39. <li>
  40. <a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2" class="navigation next">Next
  41. </a>
  42. </li>
  43. </ul>
  44. </div>
  45.  </div>

trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2020
  • Former Delphi 1-7, 10.2 user
Re: HTML files get values
« Reply #1 on: May 23, 2020, 09:25:37 am »
Show us your code for this so far and tell us what happens.

[You might benefit from reading How to use the Forum too.]

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #2 on: May 23, 2020, 09:53:37 am »

here is the code , i tried many times but couldn't get any results

Code: Pascal  [Select][+][-]
  1. uses
  2.   Classes, SysUtils, SAX_HTML, DOM, DOM_HTML, fphttpclient, Forms, Controls,
  3.   Graphics, Dialogs, StdCtrls;
  4.   var
  5.     doc:THTMLDocument;
  6.     els:TDOMNodeList;
  7.     f:THTMLElement;  

Code: Pascal  [Select][+][-]
  1.  ReadHTMLFile(doc,tstringstream.create(s))//s is html content string;
  2.       //f:=THTMLElement(doc.GetElementsByTagName('Zebra_Pagination'));
  3.       //Memo1.Lines.add(f.FirstChild.NodeValue);
  4.       els:=doc.GetElementsByTagName('div');
  5.       if (els.Count) > 0 then begin
  6.        Memo1.Lines.add(tdomelement(els[0]).getattribute('class'));
  7.        Memo1.Lines.add(tdomelement(els[0]).getattribute('Zebra_Pagination'));
  8.         Memo1.Lines.add((tdomelement(els[0]).FirstChild.TextContent));  

kqha

  • New Member
  • *
  • Posts: 23
Re: HTML files get values
« Reply #3 on: May 26, 2020, 02:33:30 pm »
I'm not really sure about TDOMNodeList, but if GetElementsByTagName works just like javascript and PHP DOM does, it seems your logic in handling GetElementsByTagName to find the "20" textcontent is wrong. Regardless, if I were you I would getElementsByTagName('a') instead and loop through it to find an element which has "navigation next" in class attribute, and return the previous element. Some pseudo code:

Code: Pascal  [Select][+][-]
  1. els := doc.GetElementsByTagName('a');
  2. n := -1;  //index of els which has last page number, default -1 means not found
  3. for a:=0 to els.Count-1 do
  4. begin
  5.   if Pos('next',els[a].getAttribute('class'))>=0 then
  6.   begin
  7.     n := a-1;
  8.     Break;
  9.   end;
  10. end;
  11.  

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: HTML files get values
« Reply #4 on: May 26, 2020, 03:19:09 pm »
When parsing text files, and only one or two values are of interest, it is sometimes easier just to brute force a hacked solution, rather than spend time trying to understand how to use a complex library you did not write and are unfamiliar with, particularly if it has limited documentation.
Say, something along these lines:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. const
  6.   HTMLText = '    <div class="pagination">'+
  7.     '<div class="Zebra_Pagination">'+
  8.     '<ul>'+
  9.     '<li>'+
  10.     '<a rel="nofollow" href="javascript:void(0)" class="navigation previous disabled">Previous'+
  11.     '</a>'+
  12.     '</li>'+
  13.     '<li>'+
  14.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=" class="current">1'+
  15.     '</a>'+
  16.     '</li>'+
  17.     '<li>'+
  18.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2">2'+
  19.     '</a>'+
  20.     '</li>'+
  21.     '<li>'+
  22.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=3">3'+
  23.     '</a>'+
  24.     '</li>'+
  25.     '<li>'+
  26.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=4">4'+
  27.     '</a>'+
  28.     '</li>'+
  29.     '<li>'+
  30.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=5">5'+
  31.     '</a>'+
  32.     '</li>'+
  33.     '<li>'+
  34.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=6">6'+
  35.     '</a>'+
  36.     '</li>'+
  37.     '<li>'+
  38.     '<span>…</span>'+
  39.     '</li>'+
  40.     '<li>'+
  41.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=20">20'+
  42.     '</a>'+
  43.     '</li>'+
  44.     '<li>'+
  45.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2" class="navigation next">Next'+
  46.     '</a>'+
  47.     '</li>'+
  48.     '</ul>'+
  49.     '</div>'+
  50.      '</div>';
  51.  
  52.   function GetLastPageNo(const anHTMLText: String): Integer;
  53.   var
  54.     p, pStart, pEnd: PChar;
  55.     arr: array of LongInt = Nil;
  56.     s: String;
  57.     idx: Integer = -1;
  58.   begin
  59.     Result := -1;
  60.     p := PChar(anHTMLText);
  61.     pEnd := p;
  62.     SetLength(arr, Length(anHTMLText) shr 4);
  63.     Inc(pEnd, Length(anHTMLText));
  64.     while p < pEnd do
  65.       begin
  66.         Inc(p);
  67.         if (p^ = 'p') and (p[1] = 'a') and (p[2] = 'g') and (p[3] = 'e') and
  68.           (p[4] = '=') and (p[5] in ['0'..'9']) then
  69.             begin
  70.               Inc(p, 5);
  71.               pStart := p;
  72.               while p[1] in ['0'..'9'] do
  73.                 Inc(p);
  74.                 SetString(s, pStart, Succ(p - pStart));
  75.                 Inc(idx);
  76.                 ReadStr(s, arr[idx]);
  77.             end;
  78.       end;
  79.  
  80.     for idx in arr do
  81.       if idx > Result then
  82.         Result := idx;
  83.   end;
  84.  
  85. begin
  86.   WriteLn('Last page number is ',GetLastPageNo(HTMLText));
  87.   ReadLn;
  88. end.

rvk

  • Hero Member
  • *****
  • Posts: 6162
Re: HTML files get values
« Reply #5 on: May 26, 2020, 04:39:28 pm »
When parsing text files, and only one or two values are of interest, it is sometimes easier just to brute force a hacked solution, rather than spend time trying to understand how to use a complex library you did not write and are unfamiliar with, particularly if it has limited documentation.
Indeed. And in that case using regexpr would be even more simple  :D

(I know, I know.... never ever parse HTML with regexpr  :P )

And this only works if there are not any other page=x expressions on the page (at least not with a higher number you want).

Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4. uses
  5.   regexpr, sysutils;
  6.  
  7. const
  8.   HTMLText = '    <div class="pagination">' + '<div class="Zebra_Pagination">' +
  9.     '<ul>' + '<li>' +
  10.     '<a rel="nofollow" href="javascript:void(0)" class="navigation previous disabled">Previous'
  11.     + '</a>' + '</li>' + '<li>' +
  12.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=" class="current">1'
  13.     + '</a>' + '</li>' + '<li>' +
  14.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2">2'
  15.     + '</a>' + '</li>' + '<li>' +
  16.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=3">3'
  17.     + '</a>' + '</li>' + '<li>' +
  18.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=4">4'
  19.     + '</a>' + '</li>' + '<li>' +
  20.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=5">5'
  21.     + '</a>' + '</li>' + '<li>' +
  22.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=6">6'
  23.     + '</a>' + '</li>' + '<li>' + '<span>…</span>' + '</li>' +
  24.     '<li>' +
  25.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=20">20'
  26.     +
  27.     '</a>' + '</li>' + '<li>' +
  28.     '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2" class="navigation next">Next'
  29.     +
  30.     '</a>' + '</li>' + '</ul>' + '</div>' + '</div>';
  31.  
  32.   function GetLastPageNo(const anHTMLText: string): integer;
  33.   var
  34.     re: TRegExpr;
  35.   begin
  36.     Result := -1;
  37.     re := TRegExpr.Create('&amp;page=(.*?)">');
  38.     if re.Exec(anHTMLText) then
  39.       while re.ExecNext do
  40.         if StrToIntDef(re.Match[1], 0) > Result then Result := StrToIntDef(re.Match[1], 0);
  41.     re.Free;
  42.   end;
  43.  
  44. begin
  45.   WriteLn('Last page number is ', GetLastPageNo(HTMLText));
  46.   ReadLn;
  47. end.

trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2020
  • Former Delphi 1-7, 10.2 user
Re: HTML files get values
« Reply #6 on: May 27, 2020, 07:45:51 am »
Indeed. And in that case using regexpr would be even more simple  :D

(I know, I know.... never ever parse HTML with regexpr  :P )

You beat me to it - a regex would be my suggestion too :)

Except my regex would have been:
Code: Pascal  [Select][+][-]
  1. re := TRegExpr.Create('&amp;page=([0-9]*)">');
.

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #7 on: May 29, 2020, 09:07:36 am »
 :D :D :D thanks all for reply
it was useful information , i tried to call the function
Code: Pascal  [Select][+][-]
  1. memo1.lines.add(getlastpageno(s));
i got this error getlastpageno not identifier

another thing i have text like this
/python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-python-e158003322.html
 how to get only text without '-' and without 'e158003322.html'
second thing how to replace 'e' in 'e158003322.html' with 'd'

thanks
Alaa

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: HTML files get values
« Reply #8 on: May 29, 2020, 09:12:43 am »
GetLastPageNo returns an integer. You can't add that to a TStrings instance without using a conversion function.
But the error you got suggests you have not implemented the function in your code, or you spelled it differently.

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #9 on: May 29, 2020, 09:17:48 am »
i tried to us inttostr with it didnt work also , i have copied the code itself and worked when used without function
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   s1:String;
  4.   re:TRegExpr;
  5.   result:Integer;
  6. begin
  7.       With TFPHttpClient.Create(Nil) do
  8.     try
  9.       S1 := Get('link');
  10.     Result := -1;
  11.     re := TRegExpr.Create('&amp;page=([0-9]*)">');
  12.     if re.Exec(s1) then
  13.       while re.ExecNext do
  14.         if StrToIntDef(re.Match[1], 0) > Result then Result := StrToIntDef(re.Match[1], 0);
  15.            re.Free;
  16.       Memo1.Lines.add(inttostr(result));
  17.     finally
  18.       Free;
  19.     end;
  20. end;
  21.  

can you help me with the other question in my previous post
 
Quote
another thing i have text like this
/python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-python-e158003322.html
 how to get only text without '-' and without 'e158003322.html'
second thing how to replace 'e' in 'e158003322.html' with 'd'

thanks

trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2020
  • Former Delphi 1-7, 10.2 user
Re: HTML files get values
« Reply #10 on: May 29, 2020, 09:53:21 am »
Second question regex:

Code: Pascal  [Select][+][-]
  1. re := TRegExpr.Create('\(.*\)-e[0-9]*.html/\1d/');

Should work, but only tested with sed.

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #11 on: May 29, 2020, 02:54:04 pm »
thanks but how i use this
Quote
re := TRegExpr.Create('\(.*\)-e[0-9]*.html/\1d/');
in this
Code: Pascal  [Select][+][-]
  1. re:=TRegExpr.Create('<a href="(.*?)"'); // this one original
  2.  
  3.   //re := TRegExpr.Create('li <a href="([/w]+)/"');
  4.   try
  5. if re.Exec(page) then begin
  6.      bookname := re.Match[1];
  7.      memo1.Append(bookname);
  8.     //listbox1.items.add(bookname);
  9.    while re.ExecNext do begin
  10.        bookname := re.Match[1];
  11.         if (RightStr(bookname,4)='html') and (LeftStr(bookname,1)='/') then    //filter all links which has "html"at the end  and "/" at the begining
  12.         memo1.Append(bookname);
  13.  
  14.        //listbox1.items.add(bookname);
  15.        Application.ProcessMessages;
  16.     end;
  17. end;//Memo1.Append('');        
as i am trying to extract only links which has "/" at the start and ".html" at the end

actually my code worked good for me but i don't understand why i got each link two times

regards

trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2020
  • Former Delphi 1-7, 10.2 user
Re: HTML files get values
« Reply #12 on: May 29, 2020, 04:08:22 pm »
I doubt you showed us all your code for the repeated book name issue! Anyway, here's an FPC program to do what you said you wanted to do.

Code: Pascal  [Select][+][-]
  1. Program regex;
  2.  
  3. uses
  4.    RegExpr;
  5.  
  6. var
  7.    re   : TRegExpr;
  8.    page : String = '<a href="/python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-python-e158003322.html">' +
  9.                    LineEnding + '<a href="/python-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-c-e158002111.html">';
  10.    bookname : string;
  11.  
  12. begin
  13.    re:=TRegExpr.Create('<a href="/(.*?)-e[0-9]*\.html"');    
  14.  
  15.    if re.Exec(page) then
  16.      begin
  17.        bookname := re.Match[1];
  18.          writeLn(bookname + 'd');
  19.  
  20.        while re.ExecNext do
  21.          begin
  22.            bookname := re.Match[1];
  23.              writeLn(bookname + 'd');
  24.          end;
  25.      end;
  26.  
  27.      re.free;
  28. end.

Outputs:

Quote
python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-pythond
python-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-cd

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #13 on: May 29, 2020, 05:17:51 pm »
thanks trev for your support
what i meant with duplicated name , that page have href links -duplicated
but i found a way to load them to list box then removed duplicated items
if you have easier way you could share

i am trying solving a problems one by one and thanks for each one supporting me

regards
alaa

rvk

  • Hero Member
  • *****
  • Posts: 6162
Re: HTML files get values
« Reply #14 on: May 29, 2020, 05:22:05 pm »
what i meant with duplicated name , that page have href links -duplicated
but i found a way to load them to list box then removed duplicated items
if you have easier way you could share
If you don't need the list visible it's better to just use tstringlist.
Set duplicates to dupIgnore and add them to the list (and gone are the duplicates).
https://www.freepascal.org/docs-html/rtl/classes/tstringlist.duplicates.html

 

TinyPortal © 2005-2018