HTML files get values

alaa123456789

Sr. Member
Posts: 260
Try your Best to learn & help others

Hey Guys,
wish everyone is okay , i am suffering with lazarus to find information about units -uses
there is less information
i am trying to read below html file and get the value (20) -last page number from it

please help me

Regards

Code: HTML5 [Select][+]

<div class="pagination">
<div class="Zebra_Pagination">
<ul>
<li>
<a rel="nofollow" href="javascript:void(0)" class="navigation previous disabled">Previous
</a>
</li>
<li>
<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=" class="current">1
</a>
</li>
<li>
<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2">2
</a>
</li>
<li>
<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=3">3
</a>
</li>
<li>
<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=4">4
</a>
</li>
<li>
<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=5">5
</a>
</li>
<li>
<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=6">6
</a>
</li>
<li>
<span>…</span>
</li>
<li>
<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=20">20
</a>
</li>
<li>
<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2" class="navigation next">Next
</a>
</li>
</ul>
</div>
 </div>

Logged

https://www.youtube.com/user/178alaa/videos

trev

Global Moderator
Hero Member
Posts: 2020
Former Delphi 1-7, 10.2 user

Re: HTML files get values

« Reply #1 on: May 23, 2020, 09:25:37 am »

Show us your code for this so far and tell us what happens.

[You might benefit from reading How to use the Forum too.]

Logged

alaa123456789

Sr. Member
Posts: 260
Try your Best to learn & help others

Re: HTML files get values

« Reply #2 on: May 23, 2020, 09:53:37 am »

here is the code , i tried many times but couldn't get any results

Code: Pascal [Select][+]

uses
  Classes, SysUtils, SAX_HTML, DOM, DOM_HTML, fphttpclient, Forms, Controls,
  Graphics, Dialogs, StdCtrls;
  var
    doc:THTMLDocument;
    els:TDOMNodeList;
    f:THTMLElement;  

Code: Pascal [Select][+]

 ReadHTMLFile(doc,tstringstream.create(s))//s is html content string;
      //f:=THTMLElement(doc.GetElementsByTagName('Zebra_Pagination'));
      //Memo1.Lines.add(f.FirstChild.NodeValue);
      els:=doc.GetElementsByTagName('div');
      if (els.Count) > 0 then begin
       Memo1.Lines.add(tdomelement(els[0]).getattribute('class'));
       Memo1.Lines.add(tdomelement(els[0]).getattribute('Zebra_Pagination'));
        Memo1.Lines.add((tdomelement(els[0]).FirstChild.TextContent));  

Logged

https://www.youtube.com/user/178alaa/videos

kqha

New Member
Posts: 23

Re: HTML files get values

« Reply #3 on: May 26, 2020, 02:33:30 pm »

I'm not really sure about TDOMNodeList, but if GetElementsByTagName works just like javascript and PHP DOM does, it seems your logic in handling GetElementsByTagName to find the "20" textcontent is wrong. Regardless, if I were you I would getElementsByTagName('a') instead and loop through it to find an element which has "navigation next" in class attribute, and return the previous element. Some pseudo code:

Code: Pascal [Select][+]

els := doc.GetElementsByTagName('a');
n := -1;  //index of els which has last page number, default -1 means not found
for a:=0 to els.Count-1 do
begin
  if Pos('next',els[a].getAttribute('class'))>=0 then
  begin
    n := a-1;
    Break;
  end;
end;
 

Logged

howardpc

Hero Member
Posts: 4144

Re: HTML files get values

« Reply #4 on: May 26, 2020, 03:19:09 pm »

When parsing text files, and only one or two values are of interest, it is sometimes easier just to brute force a hacked solution, rather than spend time trying to understand how to use a complex library you did not write and are unfamiliar with, particularly if it has limited documentation.
Say, something along these lines:

Code: Pascal [Select][+]

program Project1;
 
{$mode objfpc}{$H+}
 
const
  HTMLText = '    <div class="pagination">'+
    '<div class="Zebra_Pagination">'+
    '<ul>'+
    '<li>'+
    '<a rel="nofollow" href="javascript:void(0)" class="navigation previous disabled">Previous'+
    '</a>'+
    '</li>'+
    '<li>'+
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=" class="current">1'+
    '</a>'+
    '</li>'+
    '<li>'+
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2">2'+
    '</a>'+
    '</li>'+
    '<li>'+
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=3">3'+
    '</a>'+
    '</li>'+
    '<li>'+
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=4">4'+
    '</a>'+
    '</li>'+
    '<li>'+
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=5">5'+
    '</a>'+
    '</li>'+
    '<li>'+
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=6">6'+
    '</a>'+
    '</li>'+
    '<li>'+
    '<span>…</span>'+
    '</li>'+
    '<li>'+
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=20">20'+
    '</a>'+
    '</li>'+
    '<li>'+
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2" class="navigation next">Next'+
    '</a>'+
    '</li>'+
    '</ul>'+
    '</div>'+
     '</div>';
 
  function GetLastPageNo(const anHTMLText: String): Integer;
  var
    p, pStart, pEnd: PChar;
    arr: array of LongInt = Nil;
    s: String;
    idx: Integer = -1;
  begin
    Result := -1;
    p := PChar(anHTMLText);
    pEnd := p;
    SetLength(arr, Length(anHTMLText) shr 4);
    Inc(pEnd, Length(anHTMLText));
    while p < pEnd do
      begin
        Inc(p);
        if (p^ = 'p') and (p[1] = 'a') and (p[2] = 'g') and (p[3] = 'e') and
          (p[4] = '=') and (p[5] in ['0'..'9']) then
            begin
              Inc(p, 5);
              pStart := p;
              while p[1] in ['0'..'9'] do
                Inc(p);
                SetString(s, pStart, Succ(p - pStart));
                Inc(idx);
                ReadStr(s, arr[idx]);
            end;
      end;
 
    for idx in arr do
      if idx > Result then
        Result := idx;
  end;
 
begin
  WriteLn('Last page number is ',GetLastPageNo(HTMLText));
  ReadLn;
end.

Logged

rvk

Hero Member
Posts: 6162

Re: HTML files get values

« Reply #5 on: May 26, 2020, 04:39:28 pm »

Quote from: howardpc on May 26, 2020, 03:19:09 pm

When parsing text files, and only one or two values are of interest, it is sometimes easier just to brute force a hacked solution, rather than spend time trying to understand how to use a complex library you did not write and are unfamiliar with, particularly if it has limited documentation.

Indeed. And in that case using regexpr would be even more simple

(I know, I know.... never ever parse HTML with regexpr

)

And this only works if there are not any other page=x expressions on the page (at least not with a higher number you want).

Code: Pascal [Select][+]

program Project1;
 
{$mode objfpc}{$H+}
uses
  regexpr, sysutils;
 
const
  HTMLText = '    <div class="pagination">' + '<div class="Zebra_Pagination">' +
    '<ul>' + '<li>' +
    '<a rel="nofollow" href="javascript:void(0)" class="navigation previous disabled">Previous'
    + '</a>' + '</li>' + '<li>' +
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=" class="current">1'
    + '</a>' + '</li>' + '<li>' +
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2">2'
    + '</a>' + '</li>' + '<li>' +
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=3">3'
    + '</a>' + '</li>' + '<li>' +
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=4">4'
    + '</a>' + '</li>' + '<li>' +
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=5">5'
    + '</a>' + '</li>' + '<li>' +
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=6">6'
    + '</a>' + '</li>' + '<li>' + '<span>…</span>' + '</li>' +
    '<li>' +
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=20">20'
    +
    '</a>' + '</li>' + '<li>' +
    '<a rel="nofollow" href="search?q=python&amp;pagecount=&amp;pubyear=&amp;searchin=&amp;page=2" class="navigation next">Next'
    +
    '</a>' + '</li>' + '</ul>' + '</div>' + '</div>';
 
  function GetLastPageNo(const anHTMLText: string): integer;
  var
    re: TRegExpr;
  begin
    Result := -1;
    re := TRegExpr.Create('&amp;page=(.*?)">');
    if re.Exec(anHTMLText) then
      while re.ExecNext do
        if StrToIntDef(re.Match[1], 0) > Result then Result := StrToIntDef(re.Match[1], 0);
    re.Free;
  end;
 
begin
  WriteLn('Last page number is ', GetLastPageNo(HTMLText));
  ReadLn;
end.

Logged

trev

Global Moderator
Hero Member
Posts: 2020
Former Delphi 1-7, 10.2 user

Re: HTML files get values

« Reply #6 on: May 27, 2020, 07:45:51 am »

Quote from: rvk on May 26, 2020, 04:39:28 pm

Indeed. And in that case using regexpr would be even more simple

(I know, I know.... never ever parse HTML with regexpr )

You beat me to it - a regex would be my suggestion too

Except my regex would have been:

Code: Pascal [Select][+]

re := TRegExpr.Create('&amp;page=([0-9]*)">');

Logged

alaa123456789

Sr. Member
Posts: 260
Try your Best to learn & help others

Re: HTML files get values

« Reply #7 on: May 29, 2020, 09:07:36 am »

thanks all for reply
it was useful information , i tried to call the function

Code: Pascal [Select][+]

memo1.lines.add(getlastpageno(s));

i got this error getlastpageno not identifier

another thing i have text like this
/python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-python-e158003322.html
how to get only text without '-' and without 'e158003322.html'
second thing how to replace 'e' in 'e158003322.html' with 'd'

thanks
Alaa

Logged

https://www.youtube.com/user/178alaa/videos

howardpc

Hero Member
Posts: 4144

Re: HTML files get values

« Reply #8 on: May 29, 2020, 09:12:43 am »

GetLastPageNo returns an integer. You can't add that to a TStrings instance without using a conversion function.
But the error you got suggests you have not implemented the function in your code, or you spelled it differently.

Logged

alaa123456789

Sr. Member
Posts: 260
Try your Best to learn & help others

Re: HTML files get values

« Reply #9 on: May 29, 2020, 09:17:48 am »

i tried to us inttostr with it didnt work also , i have copied the code itself and worked when used without function

Code: Pascal [Select][+]

procedure TForm1.Button1Click(Sender: TObject);
var
  s1:String;
  re:TRegExpr;
  result:Integer;
begin
      With TFPHttpClient.Create(Nil) do
    try
      S1 := Get('link');
    Result := -1;
    re := TRegExpr.Create('&amp;page=([0-9]*)">');
    if re.Exec(s1) then
      while re.ExecNext do
        if StrToIntDef(re.Match[1], 0) > Result then Result := StrToIntDef(re.Match[1], 0);
           re.Free;
      Memo1.Lines.add(inttostr(result));
    finally
      Free;
    end;
end;
 

can you help me with the other question in my previous post

Quote

another thing i have text like this
/python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-python-e158003322.html
how to get only text without '-' and without 'e158003322.html'
second thing how to replace 'e' in 'e158003322.html' with 'd'

thanks

Logged

https://www.youtube.com/user/178alaa/videos

trev

Global Moderator
Hero Member
Posts: 2020
Former Delphi 1-7, 10.2 user

Re: HTML files get values

« Reply #10 on: May 29, 2020, 09:53:21 am »

Second question regex:

Code: Pascal [Select][+]

re := TRegExpr.Create('\(.*\)-e[0-9]*.html/\1d/');

Should work, but only tested with sed.

Logged

alaa123456789

Sr. Member
Posts: 260
Try your Best to learn & help others

Re: HTML files get values

« Reply #11 on: May 29, 2020, 02:54:04 pm »

thanks but how i use this

Quote

re := TRegExpr.Create('\(.*\)-e[0-9]*.html/\1d/');

in this

Code: Pascal [Select][+]

re:=TRegExpr.Create('<a href="(.*?)"'); // this one original
 
  //re := TRegExpr.Create('li <a href="([/w]+)/"');
  try
if re.Exec(page) then begin
     bookname := re.Match[1];
     memo1.Append(bookname);
    //listbox1.items.add(bookname);
   while re.ExecNext do begin
       bookname := re.Match[1];
        if (RightStr(bookname,4)='html') and (LeftStr(bookname,1)='/') then    //filter all links which has "html"at the end  and "/" at the begining
        memo1.Append(bookname);
 
       //listbox1.items.add(bookname);
       Application.ProcessMessages;
    end;
end;//Memo1.Append('');         

as i am trying to extract only links which has "/" at the start and ".html" at the end

actually my code worked good for me but i don't understand why i got each link two times

regards

Capture.PNG (15.23 kB, 710x165 - viewed 176 times.)

Logged

https://www.youtube.com/user/178alaa/videos

trev

Global Moderator
Hero Member
Posts: 2020
Former Delphi 1-7, 10.2 user

Re: HTML files get values

« Reply #12 on: May 29, 2020, 04:08:22 pm »

I doubt you showed us all your code for the repeated book name issue! Anyway, here's an FPC program to do what you said you wanted to do.

Code: Pascal [Select][+]

Program regex;
 
uses
   RegExpr;
 
var
   re   : TRegExpr;
   page : String = '<a href="/python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-python-e158003322.html">' +
                   LineEnding + '<a href="/python-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-c-e158002111.html">';
   bookname : string;
 
begin
   re:=TRegExpr.Create('<a href="/(.*?)-e[0-9]*\.html"');    
 
   if re.Exec(page) then
     begin
       bookname := re.Match[1];
         writeLn(bookname + 'd');
 
       while re.ExecNext do
         begin
           bookname := re.Match[1];
             writeLn(bookname + 'd');
         end;
     end;
 
     re.free;
end.

Outputs:

Quote

python-data-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-pythond
python-analytics-data-analysis-and-science-using-pandas-matplotlib-and-the-cd

Logged

alaa123456789

Sr. Member
Posts: 260
Try your Best to learn & help others

Re: HTML files get values

« Reply #13 on: May 29, 2020, 05:17:51 pm »

thanks trev for your support
what i meant with duplicated name , that page have href links -duplicated
but i found a way to load them to list box then removed duplicated items
if you have easier way you could share

i am trying solving a problems one by one and thanks for each one supporting me

regards
alaa

Logged

https://www.youtube.com/user/178alaa/videos

rvk

Hero Member
Posts: 6162

Re: HTML files get values

« Reply #14 on: May 29, 2020, 05:22:05 pm »

Quote from: alaa123456789 on May 29, 2020, 05:17:51 pm

what i meant with duplicated name , that page have href links -duplicated
but i found a way to load them to list box then removed duplicated items
if you have easier way you could share

If you don't need the list visible it's better to just use tstringlist.
Set duplicates to dupIgnore and add them to the list (and gone are the duplicates).
https://www.freepascal.org/docs-html/rtl/classes/tstringlist.duplicates.html

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: HTML files get values (Read 20087 times)

alaa123456789

HTML files get values

trev

Re: HTML files get values

alaa123456789

Re: HTML files get values

kqha

Re: HTML files get values

howardpc

Re: HTML files get values

rvk

Re: HTML files get values

trev

Re: HTML files get values

alaa123456789

Re: HTML files get values

howardpc

Re: HTML files get values

alaa123456789

Re: HTML files get values

trev

Re: HTML files get values

alaa123456789

Re: HTML files get values

trev

Re: HTML files get values

alaa123456789

Re: HTML files get values

rvk

Re: HTML files get values

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook