Recent

Author Topic: strip text  (Read 3675 times)

Hansvb

  • Hero Member
  • *****
  • Posts: 860
strip text
« on: September 20, 2015, 11:47:15 am »
What is the best way to strip text between tags?

<tag> remove this text<tag>

result: <tag ><tag>

wp

  • Hero Member
  • *****
  • Posts: 13199
Re: strip text
« Reply #1 on: September 20, 2015, 12:07:24 pm »
Use fasthtmlparser. Write an event handler for OnFoundTag which concatenates all tags provided as a parameter to a single string. Something like this, not tested:

Code: [Select]
uses
  fasthtmlparser;

type
  TTextStripper = class(THTMLParser)
  private
    FTags: String;
    procedure TagFoundHandler(NoCaseTag, ActualTag: string);
  public
    constructor Create(AText: String);
    property AllTags: String read FTags;
  end;

constructor TTextStripper.Create(AText: String);
begin
  inherited Create(AText);
  FTags := '';
  OnFoundTag := @TagFoundHandler;
end;

procedure TTextStripper.TagFoundHandler(NoCaseTag, ActualTag: String);
begin
  FTags := FTags + ActualTag;
end;

-----------

var
  textStripper: TTextStripper;

begin
  textStripper := TTextStripper.Create(text_with_html_tags);
  try
    textStripper.Execute;
    stripped_tags := textStripper.AllTags;
  finally
    textStripper.Free;
  end;
end;

Roland57

  • Hero Member
  • *****
  • Posts: 527
    • msegui.net
Re: strip text
« Reply #2 on: September 20, 2015, 12:27:57 pm »
Hello!

Maybe something like this?

Code: [Select]
uses
  SysUtils, RegExpr;

function MyFunction(s: string): string;
const
  PATTERN = '(<.+>).*(</.+>)';
var
  r: TRegExpr;
begin
  r := TRegExpr.Create;
  r.Expression := PATTERN;
  if r.Exec(s) then
    result := r.Match[1] + r.Match[2]
  else
    result := '';
  r.Free;
end;

const
  SAMPLE: array[0..1] of string = (
    '<h1>heading</h1>',
    '<p>paragraph</p>'
  );

var
  s: string;
 
begin
  for s in SAMPLE do
    WriteLn(MyFunction(s));
  ReadLn;
end.

Quote
<h1></h1>
<p></p>
My projects are on Codeberg.

Hansvb

  • Hero Member
  • *****
  • Posts: 860
Re: strip text
« Reply #3 on: September 20, 2015, 09:18:00 pm »
My question was not good enough.
I do not want to clean op all tags only one kind of tag

So if there are 2 types of tags like in youre example:
  '<h1>heading</h1>',
    '<p>paragraph</p>'

I only want to clean up the H1 line.
The result must be:

  '<h1></h1>',
    '<p>paragraph</p>'


I changed the pattern to:
Code: [Select]
PATTERN = '(<tab_nr>).*(<tab_nr>)';That did not work
My actual tag name = <tab_nr>delete this tekst...<tab_nr>

Roland57

  • Hero Member
  • *****
  • Posts: 527
    • msegui.net
Re: strip text
« Reply #4 on: September 20, 2015, 09:40:00 pm »
Is this better ?

Code: [Select]
uses
  SysUtils, RegExpr;

function MyFunction(s: string): string;
const
  PATTERN = '(<tab_nr>).*(<tab_nr>)';
var
  r: TRegExpr;
begin
  r := TRegExpr.Create;
  r.Expression := PATTERN;
  if r.Exec(s) then
    result := r.Match[1] + r.Match[2]
  else
    result := s; // <---
  r.Free;
end;

const
  SAMPLE: array[0..1] of string = (
    '<xxx>bla<xxx>',
    '<tab_nr>bla<tab_nr>'
  );

var
  s: string;
 
begin
  for s in SAMPLE do
    WriteLn(MyFunction(s));
end.

Quote
<xxx>bla<xxx>
<tab_nr><tab_nr>
My projects are on Codeberg.

BeniBela

  • Hero Member
  • *****
  • Posts: 947
    • homepage
Re: strip text
« Reply #5 on: September 21, 2015, 11:46:51 pm »
Recently I added a function to map HTML elements through a lambda function to my Internet Tools. That would do it in one line:

Code: [Select]
process('/tmp/test.html', 'outer-html(transform(/, function ($e) { typeswitch($e) case element(h1) return <h1/> default return $e}))').toString

Although it is slow. As a side effect, it normalizes the HTML

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1290
Re: strip text
« Reply #6 on: September 22, 2015, 01:38:31 am »
hello,
if your input string is multiline like this :
Code: [Select]
const StringTestA =   '<first>first<first>' + #13#10 +
                     '<second>second<second>' +  #13#10 +
                     '<tag_nr>blabla<tag_nr>' +   #13#10 +
                     '<third>third<third>'  +   #13#10 +
                     '<tag_nr>dummy text<tag_nr>';

you can use the replace function of the BeRo's regex engine FLRE :
Code: [Select]
var
 ResultStr : String;
 regexFLRE : TFLRE;
  pattern: TFLRERawByteString;
begin
writeln('=========   Test Hansvb   ==========');
pattern := '^(<tag_nr>).*(<tag_nr>)$';
regexFLRE := TFLRE.Create(pattern, [rfMULTILINE]);
ResultStr := regexFLRE.Replace(StringTestA,'$1$2');
writeln( ResultStr);   
end;

Result :
Quote
=========   Test Hansvb   ==========
<first>first<first>
<second>second<second>
<tag_nr><tag_nr>
<third>third<third>
<tag_nr><tag_nr>
   ;)

Friendly, J.P
« Last Edit: September 22, 2015, 01:54:51 am by Jurassic Pork »
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

 

TinyPortal © 2005-2018