### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Author Topic: Trying not to use Regular Expression...  (Read 1270 times)

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 793
• Professional amateur ;-P
##### Trying not to use Regular Expression...
« on: September 17, 2021, 05:22:40 pm »
Hey Y'all,

Let's say I have this string:
Code: Pascal  [Select][+][-]
1. var
2.   line: String;
3. begin
4.   line:= '<tr><th colspan=''5'' class=''bighead''><a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a></th></tr>';
5. end;

Quote
Note:
Yes, I know, the author of the HTML opted for single quotes, not double quotes, the biatch!!
So inconsiderate to us Pascal peeps!

As the maxim goes: I have an HTML parsing problem and I want to solve it with RegExps... Now you have 2 problems!!

So I want to extract the text Smileys &amp; Emotion that sits between the <a>..</a> tag.

How would I go about it without using Regular Expressions?

The best I can come up with is:
Code: Pascal  [Select][+][-]
1. var
2.   left, right: Integer;
3.   line, extract: String;
4. begin
5.   line:= '<tr><th colspan=''5'' class=''bighead''><a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a></th></tr>';
6.   left:= Pos('''>');
7.   Delete(line, 1, left+2); // Deletes the TH
8.   left:= Pos('''>');
9.   Delete(line, 1, left+2); // Deletes the A
10.   right:= Pos('</a>', line);
11.   extract:= Copy(line, 1, right);
12. end;

But this is messy and I'm prone to mess up those indexes and counts on both the Delete and Copy.

Can anyone suggest a better way than mine but with the caveat that we don't use RegExps?
Is there another pattern matching thingamaboob that can manage this, hidden inside an obscure String manipulation unit that I'm unaware of?

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.10 64b Dark Theme
http://github.com/gcarreno

#### winni

• Hero Member
• Posts: 2793
##### Re: Trying not to use Regular Expression...
« Reply #1 on: September 17, 2021, 05:40:27 pm »
Hi!

very simple  code without error checking:

Code: Pascal  [Select][+][-]
1. var p,q : Integer;
2. sl : TStringList;
3. begin
4. sl := TStringList.create;
5. repeat
6. p := pos('<',line);
7. if p  > 0 then
8.   begin
9.   delete(line,1,p);
10.   q := pos ('>',line);
11.   if  q > 0 then sl.add (copy (line,1,q-1));
12. end;
13. until (p=0) or (q=0);
14. ShowMessage (sl.text);
15. sl.free;
16.

Winni
« Last Edit: September 17, 2021, 05:48:13 pm by winni »

#### winni

• Hero Member
• Posts: 2793
##### Re: Trying not to use Regular Expression...
« Reply #2 on: September 17, 2021, 06:15:23 pm »
Hi!

Ooops - that was the wrong answer.
You want the text between the tags.

Here we go:

Code: Pascal  [Select][+][-]
1. procedure TForm1.Button3Click(Sender: TObject);
2. var line : String;
3.     sl: TStringList;
4.     p,q,r : integer;
5. begin
6.    line:= '<tr><th colspan=''5'' class=''bighead''><a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a></th></tr>';
7.    sl := TStringList.create;
8.    repeat
9.    p := pos ('<a',line);
10.    if p > 0 then
11.        begin
12.        delete (line,1,p+1);
13.        q := pos ('>',line);
14.        if q > 0 then
15.          begin
16.          delete (line,1,q);
17.          r := pos ('</a>',line);
18.          if r > 0 then sl.add (copy (line,1,r-1));
19.          end; // q
20.        end; // p
21.          until (p=0) or (q=0);
22.    showMessage (sl.text);
23.    end;

Winni

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 793
• Professional amateur ;-P
##### Re: Trying not to use Regular Expression...
« Reply #3 on: September 17, 2021, 06:17:32 pm »
Hey Winni,

Hi!

very simple  code without error checking:

Code: Pascal  [Select][+][-]
1. var p,q : Integer;
2. sl : TStringList;
3. begin
4. sl := TStringList.create;
5. repeat
6. p := pos('<',line);
7. if p  > 0 then
8.   begin
9.   delete(line,1,p);
10.   q := pos ('>',line);
11.   if  q > 0 then sl.add (copy (line,1,q-1));
12. end;
13. until (p=0) or (q=0);
14. ShowMessage (sl.text);
15. sl.free;
16.

Winni

From just looking at the code, I'm guessing that this will return tr, which is the first tag.

I guess I can use a modified approach to eliminate all the tags and be left with only the string I'm looking for.

Not a complete solution but it does point me to another solution I haven't thought of. For that alone, MANY thanks Winni!!!

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.10 64b Dark Theme
http://github.com/gcarreno

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 793
• Professional amateur ;-P
##### Re: Trying not to use Regular Expression...
« Reply #4 on: September 17, 2021, 06:22:37 pm »
Hey Winni,

Hi!

Ooops - that was the wrong answer.
You want the text between the tags.

Here we go:

Code: Pascal  [Select][+][-]
1. procedure TForm1.Button3Click(Sender: TObject);
2. var line : String;
3.     sl: TStringList;
4.     p,q,r : integer;
5. begin
6.    line:= '<tr><th colspan=''5'' class=''bighead''><a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a></th></tr>';
7.    sl := TStringList.create;
8.    repeat
9.    p := pos ('<a',line);
10.    if p > 0 then
11.        begin
12.        delete (line,1,p+1);
13.        q := pos ('>',line);
14.        if q > 0 then
15.          begin
16.          delete (line,1,q);
17.          r := pos ('</a>',line);
18.          if r > 0 then sl.add (copy (line,1,r-1));
19.          end; // q
20.        end; // p
21.          until (p=0) or (q=0);
22.    showMessage (sl.text);
23.    end;

Winni

Yup, this makes a bit more sense, THANKS!!!

I'll have a better look at this and generalize it for the other 3 lines I have to parse.

And I'll also wait for more people to chime in with some other approach. I like variety

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.10 64b Dark Theme
http://github.com/gcarreno

#### wp

• Hero Member
• Posts: 9017
##### Re: Trying not to use Regular Expression...
« Reply #5 on: September 17, 2021, 06:40:35 pm »
My standard solution in such cases is the fasthtmlparser of FCL. Search the forum, I presented worked-out solutions here several times already. The basic idea is that this parser runs through the text and fires an event OnFoundTag for every tag (part between '<' and '>') and OnFoundText for every text (part between '>' and '<').  Super easy to extract text or table content, but maybe a bit overkill for such a short text phrase as requested here...
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 793
• Professional amateur ;-P
##### Re: Trying not to use Regular Expression...
« Reply #6 on: September 17, 2021, 06:57:54 pm »
Hey WP,

My standard solution in such cases is the fasthtmlparser of FCL. Search the forum, I presented worked-out solutions here several times already. The basic idea is that this parser runs through the text and fires an event OnFoundTag for every tag (part between '<' and '>') and OnFoundText for every text (part between '>' and '<').  Super easy to extract text or table content, but maybe a bit overkill for such a short text phrase as requested here...

Ooooooohhhhh, I'm always in the looks for a fine HTML parser lib. I don't think I've stumbled on this one!!!

Thanks WP, I'll have a search for your posts!!!

And yeah, a bit overkill since one file is ~6MB and the other file is ~33MB. Having events fired for each tag and text, OUCH

These are the Emoji lists that I mentioned in another post.
I'm trying to extract info from them and I'm gonna put them in a JSON container, maybe...
Still need to suss out how I'm gonna include such big datasets in my EmojiMap application.
Will need to see how phat is the JSON that I'll extract and if not that big, maybe just add it as a Resource.

Nonetheless, many thanks for another approach!!

C'mon my fellow Pascalites, keep'em coming

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.10 64b Dark Theme
http://github.com/gcarreno

#### BobDog

• Full Member
• Posts: 168
##### Re: Trying not to use Regular Expression...
« Reply #7 on: September 18, 2021, 12:45:52 am »

A no frills way:
Code: Pascal  [Select][+][-]
1.
2. Type
3.   intArray = Array of int32;
4.   AOS=array of ansistring;
5.
6. // =========  number of partstring in somestring and fill array =============//
7.  function tally(somestring:pchar;partstring:pchar;var arr: intarray ):integer;
8. var
9. i,j,ln,lnp,count,num:integer ;
10. filler:boolean;
11. label
12. skip ,start,return;
13. begin
14. setlength(arr,0);
15.  ln:=length(somestring);
16. lnp:=length(partstring);
17. filler:=false;
18. start:
19. count:=0;
20. i:=-1;
21. repeat
22. i:=i+1;
23.    if somestring[i] <> partstring[0] then goto skip ;
24.      if somestring[i] = partstring[0] then
25.      begin
26.      for j:=0 to lnp-1 do
27.      begin
28.      if somestring[j+i]<>partstring[j] then goto skip;
29.      end;
30.       count:=count+1;
31.       if filler = true then arr[count]:=i+1 ;
32.       i:=i+lnp-1;
33.      end ;
34.    skip:
35.    until i>=ln-1 ;
36. SetLength(arr,count); // size is now known, repeat the operation to fil arr
37. arr[0]:=count;        // save tally in [0]
38. num:=count;
39. if filler=true then goto return;
40. filler:=true;
41.   goto start;
42.    return:
43.   result:=num;
44. end;
45.
46. procedure getstrings(a:ansistring;chars1:ansistring;chars2:ansistring;var ans:AOS);
47. var
48. f,s:intarray;
49. i,j,count:int32;
50. label
51. lbl;
52. begin
53. count:=0;
54. if (tally(pchar(a),pchar(chars1),f)=0) then exit;
55. if (tally(pchar(a),pchar(chars2),s)=0) then exit;
56.   for i:=1 to f[0]  do
57.   begin
58.  for j:=1 to s[0] do
59.  begin
60.  if ((i>f[0]) or (j>s[0])) then goto lbl;
61.  if i=j then
62.    begin
63.    count:=count+1;
64.    setlength(ans,count);
65.    ans[count-1]:= (a[f[i]+length(chars1) .. s[j]-length(chars2)+1]);
66.    end;
67.
68.  end;
69.  end;
70.  lbl:
71. end;
72.
73.
74. var
75. a:ansistring;
76. ans:AOS;
77. i:int32;
78.
79. begin
80.
81. a:= 'starters <a first bit a> qwerty <a second bit a> tree <a third bit a> <a fourth bit a> tail end ';
82. a:=a+'<a second last bit a><a last bita>  <aOUT, Press return to end . . .a>' ;
83.
84.  getstrings(a,'<a','a>',ans);
85.  for i:=low(ans) to high(ans) do writeln(ans[i]);
87.
88. end.
89.
90. end.
91.
92.
93.

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 793
• Professional amateur ;-P
##### Re: Trying not to use Regular Expression...
« Reply #8 on: September 18, 2021, 12:50:50 am »
Hey BobDog,

A no frills way:

WOW, that's a big hunk'a code to digest, LOL!!

Let me have some time to have a good look at it

Many thanks for the contribution, nonetheless!!!

I'm quite grateful that I have so many options to choose from!!!

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.10 64b Dark Theme
http://github.com/gcarreno

#### Jurassic Pork

• Hero Member
• Posts: 1063
##### Re: Trying not to use Regular Expression...
« Reply #9 on: September 18, 2021, 02:32:46 am »
hello,
you can also use the units sax_html, dom_html from fpc.
Here is an example to use them ( from a Leledumbo's source code) :
Code: Pascal  [Select][+][-]
1. Program ParseHtml;
2. uses
3.   classes,sax_html,dom_html,dom;
4. const
5.   testdata =  '<tr><th colspan=''5'' class=''bighead''>' +
6.   '<a href=''#smileys_&amp;_emotion'' name=''smileys_&amp;_emotion''>Smileys &amp; Emotion</a>' +
7.   '</th></tr>';
8. var
9.   doc: thtmldocument;
10.   els: tdomnodelist;
11.   elm: tdomelement;
12. begin
14.   els := doc.GetElementsByTagName('a');
15.   // display first tag found
16.   if els.Count > 0 then writeln(tdomelement(els[0]).textcontent);
18. end.

Friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 793
• Professional amateur ;-P
##### Re: Trying not to use Regular Expression...
« Reply #10 on: September 18, 2021, 03:29:04 am »
Hey J.P.,

you can also use the units sax_html, dom_html from fpc.

Many thanks for another HTML parser suggestion!!

I'll have a look at this one also!!

But for the sizes that I'm looking at: ~6MB and ~33MB, having the whole DOM in mem is a bit too much

Nonetheless I welcome another parser to my list. Like I said, I'm always looking for these gems!!

Cheers,
Gus
Lazarus 2.3.0(trunk) FPC 3.3.1(trunk) Ubuntu 21.10 64b Dark Theme
Lazarus 2.0.12(stable) FPC 3.2.2(stable) Ubuntu 21.10 64b Dark Theme
http://github.com/gcarreno