Recent

Author Topic: Trying not to use Regular Expression...  (Read 1202 times)

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 757
  • Professional amateur ;-P
Trying not to use Regular Expression...
« on: September 17, 2021, 05:22:40 pm »
Hey Y'all,

Let's say I have this string:
Code: Pascal  [Select][+][-]
  1. var
  2.   line: String;
  3. begin
  4.   line:= '<tr><th colspan=''5'' class=''bighead''><a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a></th></tr>';
  5. end;

Quote
Note:
Yes, I know, the author of the HTML opted for single quotes, not double quotes, the biatch!! ;)
So inconsiderate to us Pascal peeps! :P

As the maxim goes: I have an HTML parsing problem and I want to solve it with RegExps... Now you have 2 problems!!

So I want to extract the text Smileys &amp; Emotion that sits between the <a>..</a> tag.

How would I go about it without using Regular Expressions?

The best I can come up with is:
Code: Pascal  [Select][+][-]
  1. var
  2.   left, right: Integer;
  3.   line, extract: String;
  4. begin
  5.   line:= '<tr><th colspan=''5'' class=''bighead''><a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a></th></tr>';
  6.   left:= Pos('''>');
  7.   Delete(line, 1, left+2); // Deletes the TH
  8.   left:= Pos('''>');
  9.   Delete(line, 1, left+2); // Deletes the A
  10.   right:= Pos('</a>', line);
  11.   extract:= Copy(line, 1, right);
  12. end;

But this is messy and I'm prone to mess up those indexes and counts on both the Delete and Copy.

Can anyone suggest a better way than mine but with the caveat that we don't use RegExps?
Is there another pattern matching thingamaboob that can manage this, hidden inside an obscure String manipulation unit that I'm unaware of?

Many, MANY thanks in advance!!

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.04 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.04 64b Dark Theme
http://github.com/gcarreno

winni

  • Hero Member
  • *****
  • Posts: 2703
Re: Trying not to use Regular Expression...
« Reply #1 on: September 17, 2021, 05:40:27 pm »
Hi!

very simple  code without error checking:

Code: Pascal  [Select][+][-]
  1. var p,q : Integer;
  2. sl : TStringList;
  3. begin
  4. sl := TStringList.create;
  5. repeat
  6. p := pos('<',line);
  7. if p  > 0 then
  8.   begin
  9.   delete(line,1,p);
  10.   q := pos ('>',line);
  11.   if  q > 0 then sl.add (copy (line,1,q-1));
  12. end;
  13. until (p=0) or (q=0);
  14. ShowMessage (sl.text);
  15. sl.free;
  16.  

Winni
« Last Edit: September 17, 2021, 05:48:13 pm by winni »

winni

  • Hero Member
  • *****
  • Posts: 2703
Re: Trying not to use Regular Expression...
« Reply #2 on: September 17, 2021, 06:15:23 pm »
Hi!

Ooops - that was the wrong answer.
You want the text between the tags.

Here we go:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button3Click(Sender: TObject);
  2. var line : String;
  3.     sl: TStringList;
  4.     p,q,r : integer;
  5. begin
  6.    line:= '<tr><th colspan=''5'' class=''bighead''><a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a></th></tr>';
  7.    sl := TStringList.create;
  8.    repeat
  9.    p := pos ('<a',line);
  10.    if p > 0 then
  11.        begin
  12.        delete (line,1,p+1);
  13.        q := pos ('>',line);
  14.        if q > 0 then
  15.          begin
  16.          delete (line,1,q);
  17.          r := pos ('</a>',line);
  18.          if r > 0 then sl.add (copy (line,1,r-1));
  19.          end; // q
  20.        end; // p
  21.          until (p=0) or (q=0);
  22.    showMessage (sl.text);
  23.    end;


Winni

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 757
  • Professional amateur ;-P
Re: Trying not to use Regular Expression...
« Reply #3 on: September 17, 2021, 06:17:32 pm »
Hey Winni,

Hi!

very simple  code without error checking:

Code: Pascal  [Select][+][-]
  1. var p,q : Integer;
  2. sl : TStringList;
  3. begin
  4. sl := TStringList.create;
  5. repeat
  6. p := pos('<',line);
  7. if p  > 0 then
  8.   begin
  9.   delete(line,1,p);
  10.   q := pos ('>',line);
  11.   if  q > 0 then sl.add (copy (line,1,q-1));
  12. end;
  13. until (p=0) or (q=0);
  14. ShowMessage (sl.text);
  15. sl.free;
  16.  

Winni

From just looking at the code, I'm guessing that this will return tr, which is the first tag.

I guess I can use a modified approach to eliminate all the tags and be left with only the string I'm looking for.

Not a complete solution but it does point me to another solution I haven't thought of. For that alone, MANY thanks Winni!!!

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.04 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.04 64b Dark Theme
http://github.com/gcarreno

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 757
  • Professional amateur ;-P
Re: Trying not to use Regular Expression...
« Reply #4 on: September 17, 2021, 06:22:37 pm »
Hey Winni,

Hi!

Ooops - that was the wrong answer.
You want the text between the tags.

Here we go:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button3Click(Sender: TObject);
  2. var line : String;
  3.     sl: TStringList;
  4.     p,q,r : integer;
  5. begin
  6.    line:= '<tr><th colspan=''5'' class=''bighead''><a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a></th></tr>';
  7.    sl := TStringList.create;
  8.    repeat
  9.    p := pos ('<a',line);
  10.    if p > 0 then
  11.        begin
  12.        delete (line,1,p+1);
  13.        q := pos ('>',line);
  14.        if q > 0 then
  15.          begin
  16.          delete (line,1,q);
  17.          r := pos ('</a>',line);
  18.          if r > 0 then sl.add (copy (line,1,r-1));
  19.          end; // q
  20.        end; // p
  21.          until (p=0) or (q=0);
  22.    showMessage (sl.text);
  23.    end;


Winni

Yup, this makes a bit more sense, THANKS!!!

I'll have a better look at this and generalize it for the other 3 lines I have to parse.

And I'll also wait for more people to chime in with some other approach. I like variety ;)

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.04 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.04 64b Dark Theme
http://github.com/gcarreno

wp

  • Hero Member
  • *****
  • Posts: 8895
Re: Trying not to use Regular Expression...
« Reply #5 on: September 17, 2021, 06:40:35 pm »
My standard solution in such cases is the fasthtmlparser of FCL. Search the forum, I presented worked-out solutions here several times already. The basic idea is that this parser runs through the text and fires an event OnFoundTag for every tag (part between '<' and '>') and OnFoundText for every text (part between '>' and '<').  Super easy to extract text or table content, but maybe a bit overkill for such a short text phrase as requested here...
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 757
  • Professional amateur ;-P
Re: Trying not to use Regular Expression...
« Reply #6 on: September 17, 2021, 06:57:54 pm »
Hey WP,

My standard solution in such cases is the fasthtmlparser of FCL. Search the forum, I presented worked-out solutions here several times already. The basic idea is that this parser runs through the text and fires an event OnFoundTag for every tag (part between '<' and '>') and OnFoundText for every text (part between '>' and '<').  Super easy to extract text or table content, but maybe a bit overkill for such a short text phrase as requested here...

Ooooooohhhhh, I'm always in the looks for a fine HTML parser lib. I don't think I've stumbled on this one!!!

Thanks WP, I'll have a search for your posts!!!

And yeah, a bit overkill since one file is ~6MB and the other file is ~33MB. Having events fired for each tag and text, OUCH :)

These are the Emoji lists that I mentioned in another post.
I'm trying to extract info from them and I'm gonna put them in a JSON container, maybe...
Still need to suss out how I'm gonna include such big datasets in my EmojiMap application.
Will need to see how phat is the JSON that I'll extract and if not that big, maybe just add it as a Resource.

Nonetheless, many thanks for another approach!!

C'mon my fellow Pascalites, keep'em coming :)

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.04 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.04 64b Dark Theme
http://github.com/gcarreno

BobDog

  • Full Member
  • ***
  • Posts: 140
Re: Trying not to use Regular Expression...
« Reply #7 on: September 18, 2021, 12:45:52 am »

A no frills way:
Code: Pascal  [Select][+][-]
  1.  
  2. Type  
  3.   intArray = Array of int32;
  4.   AOS=array of ansistring;
  5.  
  6. // =========  number of partstring in somestring and fill array =============//
  7.  function tally(somestring:pchar;partstring:pchar;var arr: intarray ):integer;
  8. var
  9. i,j,ln,lnp,count,num:integer ;
  10. filler:boolean;
  11. label
  12. skip ,start,return;
  13. begin
  14. setlength(arr,0);
  15.  ln:=length(somestring);
  16. lnp:=length(partstring);
  17. filler:=false;
  18. start:
  19. count:=0;
  20. i:=-1;
  21. repeat
  22. i:=i+1;
  23.    if somestring[i] <> partstring[0] then goto skip ;
  24.      if somestring[i] = partstring[0] then
  25.      begin
  26.      for j:=0 to lnp-1 do
  27.      begin
  28.      if somestring[j+i]<>partstring[j] then goto skip;
  29.      end;
  30.       count:=count+1;
  31.       if filler = true then arr[count]:=i+1 ;
  32.       i:=i+lnp-1;
  33.      end ;
  34.    skip:
  35.    until i>=ln-1 ;
  36. SetLength(arr,count); // size is now known, repeat the operation to fil arr
  37. arr[0]:=count;        // save tally in [0]
  38. num:=count;
  39. if filler=true then goto return;
  40. filler:=true;
  41.   goto start;
  42.    return:
  43.   result:=num;
  44. end;
  45.  
  46. procedure getstrings(a:ansistring;chars1:ansistring;chars2:ansistring;var ans:AOS);
  47. var
  48. f,s:intarray;
  49. i,j,count:int32;
  50. label
  51. lbl;
  52. begin
  53. count:=0;
  54. if (tally(pchar(a),pchar(chars1),f)=0) then exit;
  55. if (tally(pchar(a),pchar(chars2),s)=0) then exit;
  56.   for i:=1 to f[0]  do
  57.   begin
  58.  for j:=1 to s[0] do
  59.  begin
  60.  if ((i>f[0]) or (j>s[0])) then goto lbl;
  61.  if i=j then
  62.    begin
  63.    count:=count+1;
  64.    setlength(ans,count);
  65.    ans[count-1]:= (a[f[i]+length(chars1) .. s[j]-length(chars2)+1]);
  66.    end;
  67.  
  68.  end;
  69.  end;
  70.  lbl:
  71. end;
  72.  
  73.  
  74. var
  75. a:ansistring;
  76. ans:AOS;
  77. i:int32;
  78.  
  79. begin
  80.  
  81. a:= 'starters <a first bit a> qwerty <a second bit a> tree <a third bit a> <a fourth bit a> tail end ';
  82. a:=a+'<a second last bit a><a last bita>  <aOUT, Press return to end . . .a>' ;
  83.  
  84.  getstrings(a,'<a','a>',ans);
  85.  for i:=low(ans) to high(ans) do writeln(ans[i]);
  86.  readln;
  87.  
  88. end.
  89.  
  90. end.
  91.  
  92.  
  93.  

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 757
  • Professional amateur ;-P
Re: Trying not to use Regular Expression...
« Reply #8 on: September 18, 2021, 12:50:50 am »
Hey BobDog,

A no frills way:

WOW, that's a big hunk'a code to digest, LOL!!

Let me have some time to have a good look at it :)

Many thanks for the contribution, nonetheless!!!

I'm quite grateful that I have so many options to choose from!!!

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.04 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.04 64b Dark Theme
http://github.com/gcarreno

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1053
Re: Trying not to use Regular Expression...
« Reply #9 on: September 18, 2021, 02:32:46 am »
hello,
you can also use the units sax_html, dom_html from fpc.
Here is an example to use them ( from a Leledumbo's source code) :
Code: Pascal  [Select][+][-]
  1. Program ParseHtml;
  2. uses
  3.   classes,sax_html,dom_html,dom;
  4. const
  5.   testdata =  '<tr><th colspan=''5'' class=''bighead''>' +
  6.   '<a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a>' +
  7.   '</th></tr>';
  8. var
  9.   doc: thtmldocument;
  10.   els: tdomnodelist;
  11.   elm: tdomelement;
  12. begin
  13.   readhtmlfile(doc,tstringstream.create(testdata));
  14.   els := doc.GetElementsByTagName('a');
  15.   // display first tag found
  16.   if els.Count > 0 then writeln(tdomelement(els[0]).textcontent);
  17.   readln;
  18. end.


Friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 757
  • Professional amateur ;-P
Re: Trying not to use Regular Expression...
« Reply #10 on: September 18, 2021, 03:29:04 am »
Hey J.P.,

you can also use the units sax_html, dom_html from fpc.

Many thanks for another HTML parser suggestion!!

I'll have a look at this one also!!

But for the sizes that I'm looking at: ~6MB and ~33MB, having the whole DOM in mem is a bit too much :)

Nonetheless I welcome another parser to my list. Like I said, I'm always looking for these gems!!

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.04 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.04 64b Dark Theme
http://github.com/gcarreno

 

TinyPortal © 2005-2018