Recent

Author Topic: HTML starter  (Read 489 times)

pcurtis

  • Sr. Member
  • ****
  • Posts: 383
HTML starter
« on: October 07, 2020, 07:00:22 am »
Hi All,

can anyone point me in the right direction. I want to learn about parsing a html document.

In particular I want to find, a node "<div class="float-left pl-2 d-none d-lg-block">" and parse all siblings of this node. See below.

Quote
<div class="float-left pl-2 d-none d-lg-block">
    <div class="card">
        <div class="card-header font-weight-bold">
            Countries
        </div>
        <div class="card-body small">
                            <span class="flag flag-af"></span>
                                    <a href="http://online-radio.eu/country/Afghanistan">Afghanistan (18)</a><br/>
                                            <span class="flag flag-al"></span>
                                    <a href="http://online-radio.eu/country/Albania">Albania (92)</a><br/>
                                            <span class="flag flag-dz"></span>
                                    <a href="http://online-radio.eu/country/Algeria">Algeria (46)</a><br/>
                                            <span class="flag flag-ad"></span>
                                    <a href="http://online-radio.eu/country/Andorra">Andorra (10)</a><br/>
                                            <span class="flag flag-ao"></span>
                                    ...

Thanks in advance.
Windows 10 / Linux Mint 20
Laz 2.10.0
FPC 3.2.0

trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1209
  • Former Delphi 1-7, 10.2 User
Re: HTML starter
« Reply #1 on: October 07, 2020, 11:43:06 am »
See fcl-xml - no idea if the comments from 2012 are up to date or still accurate.
o Lazarus 2.1.0 r64368, FPC 3.3.1 r48100, macOS 10.14.6, Xcode 11.3.1
o Lazarus 2.1.0 r64392, FPC 3.3.1 Jan 13 21:24, macOS 11.1 (aarch64), Xcode 12.3
o Lazarus 2.1.0 r61574, FPC 3.3.1 r42318, FreeBSD 12.1 amd64 (VMware VM)
o Lazarus 2.1.0 r61574, FPC 3.0.4, Ubuntu 20.04 (Parallels VM)

krolikbest

  • Full Member
  • ***
  • Posts: 145
Re: HTML starter
« Reply #2 on: October 07, 2020, 02:45:22 pm »
or fasthtmlparser. I do parse datas from html tables.

pcurtis

  • Sr. Member
  • ****
  • Posts: 383
Re: HTML starter
« Reply #3 on: October 07, 2020, 03:03:07 pm »
Thanks All.
Windows 10 / Linux Mint 20
Laz 2.10.0
FPC 3.2.0

BobDog

  • Jr. Member
  • **
  • Posts: 69
Re: HTML starter
« Reply #4 on: October 07, 2020, 03:13:36 pm »

Here is a little html extractor, gets the html code of a web page of your choice and the equivalent text.
Windows only, uses the vbscript engine.
Code: Pascal  [Select][+][-]
  1.  
  2. program WebPageToText;
  3.  function  system(s:pchar):integer ; cdecl external 'msvcrt.dll' name 'system';
  4.  
  5.  var
  6.  g:ansistring;
  7.  kill:integer=1;
  8.  
  9.  procedure savefile(fname:string ;text:ansistring;killflag:integer=0);
  10.  label
  11.  kill;
  12. Var
  13.  T:TextFile;
  14. begin
  15. if killflag<>0 then goto kill;
  16.    AssignFile(T,fname);
  17.    {$I-}
  18.    try
  19.    Rewrite(T);
  20.    Writeln(T,text);
  21.    finally
  22.    CloseFile(T);
  23.    {$I+}
  24.    end;
  25.    kill:
  26.   if killflag<>0 then erase(T);
  27. end;
  28.  
  29. procedure runscript(filename:ansiString);
  30. begin
  31.   system(pchar('cscript.exe /Nologo '+ filename) );
  32. End;
  33.  
  34. begin
  35. g += 'Const TriStateTrue = -1 '+chr(10);
  36. g += 'URL = InputBox("Enter (or paste) the URL to extract the Code "&vbcr&vbcr&_'+chr(10);
  37. g += '"Exemple ""https://www.freebasic.net""","Extraction of Source text and html  ","https://forum.lazarus.freepascal.org/index.php?action=forum")'+chr(10);
  38. g += 'If URL = "" Then WScript.Quit'+chr(10);
  39. g += 'Titre = "Extraction du Code Source de " & URL'+chr(10);
  40. g += 'Set ie = CreateObject("InternetExplorer.Application")'+chr(10);
  41. g += 'Set objFSO = CreateObject("Scripting.FileSystemObject")'+chr(10);
  42. g += 'ie.Navigate(URL)'+chr(10);
  43. g += 'ie.Visible=false'+chr(10);
  44. g += 'DO WHILE ie.busy'+chr(10);
  45. g += 'LOOP'+chr(10);
  46. g += 'DataHTML = ie.document.documentElement.innerHTML'+chr(10);
  47. g += 'DataTxt = ie.document.documentElement.innerText'+chr(10);
  48. g += 'strFileHTML = "CodeSourceHTML.txt"'+chr(10);
  49. g += 'strFileTxt = "CodeSourceTxt.txt"'+chr(10);
  50. g += 'Set objHTMLFile = objFSO.OpenTextFile(strFileHTML,2,True, TriStateTrue)'+chr(10);
  51. g += 'objHTMLFile.WriteLine(DataHTML)'+chr(10);
  52. g += 'objHTMLFile.Close'+chr(10);
  53. g += 'Set objTxtFile = objFSO.OpenTextFile(strFileTxt,2,True, TriStateTrue)'+chr(10);
  54. g += 'objTxtFile.WriteLine(DataTxt)'+chr(10);
  55. g += 'objTxtFile.Close'+chr(10);
  56. g += 'ie.Quit'+chr(10);
  57. g += 'Set ie=Nothing'+chr(10);
  58. g += ' Ouvrir(strFileHTML)'+chr(10);
  59. g += ' Ouvrir(strFileTxt)'+chr(10);
  60. g += 'wscript.Quit'+chr(10);
  61. g += 'Function Ouvrir(File)'+chr(10);
  62. g += '    Set ws=CreateObject("wscript.shell")'+chr(10);
  63. g += '    ws.run "Notepad.exe "& File,1,False'+chr(10);
  64. g += 'end Function'+chr(10);
  65.  
  66.  
  67.  
  68. savefile('script.vbs',g) ;
  69. runscript('script.vbs');
  70.  
  71. writeln('Press enter to end . . .');
  72. readln;
  73. savefile('script.vbs','',kill);
  74. end.
  75.  
Originally freebasic.
It might be useful -- who knows?
Auld Lang Syne


wp

  • Hero Member
  • *****
  • Posts: 7957
Re: HTML starter
« Reply #5 on: October 07, 2020, 04:41:49 pm »
In particular I want to find, a node "<div class="float-left pl-2 d-none d-lg-block">" and parse all siblings of this node.
What exactly do you want to get? Do you want a list of the country names and the corresponding URLs?
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

 

TinyPortal © 2005-2018