Recent

Author Topic: [Solved] Trouble with .xml file - Invalid xml characters...  (Read 3768 times)

Robert.Thompson

  • Jr. Member
  • **
  • Posts: 56
  • "A very bad coder."
    • Google Voice for Canadians
[Solved] Trouble with .xml file - Invalid xml characters...
« on: August 01, 2015, 08:26:59 pm »
Hello:

I have a xml file that is 'polluted' with invalid characters - ascii characters which are not allowed in xml.

To remove these characters, I do this:

<snip>

procedure TForm1.Button1Click(Sender: TObject);
begin
  AssignFile(InFile, 'test.xml');
  AssignFile(OutFile, 'NewTest.xml');
  Reset(Infile);
  Rewrite(OutFile);
  while not EOF(InFile) do
  begin
    readln(InFile, LineRead);
    // list of invalid characters
    LineRead := StringReplace(LineRead, #0, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #1, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #2, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #3, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #4, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #5, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #6, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #7, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #8, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #11, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #12, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #14, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #15, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #16, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #17, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #18, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #19, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #20, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #21, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #22, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #23, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #24, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #25, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #26, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #27, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #28, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #29, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #30, '', [rfReplaceAll, rfIgnoreCase]);
    LineRead := StringReplace(LineRead, #223, '', [rfReplaceAll, rfIgnoreCase]);
    writeln(OutFile, LineRead);
  end;
  closefile(InFile);
  CloseFile(OutFile);

  Edit1.Text := 'Done!';

end;                         

<snip>

The problem is that if new, different character slips in, the program will not handle it because it will not be in my "List of invalid characters"

Does anyone know of a way to get rid of all possible invalid charaters?

Thanks,
« Last Edit: August 12, 2015, 05:27:28 pm by Robert.Thompson »
Lazarus:  1.8.4  2018-11-17
FPC:   3.0.4 x86_64-linux-gtk2
System:   Kernel: 4.15.0-39-generic x86_64 bits: 64 gcc: 7.3.0 Cinnamon 3.8.9 Linux Mint 19 Tara
              Phoenix v: 11JB.M044.20100622.hkk date: 06/22/2010
Intel Core i5 M 460 (-MT-MCP-) arch: Nehalem rev.5 cache: 3072 KB
NVIDIA GeForce 310M

Blestan

  • Sr. Member
  • ****
  • Posts: 461
Re: Trouble with .xml file - Invalid xml characters...
« Reply #1 on: August 01, 2015, 09:40:00 pm »
hi!
this is very bad code.
my aproach will be to find all "valid" chars and replace the bad ones with space.
if you are not using any utf8  then the valid ones are between 032 and "z".
read in buffer, replace and write back
Speak postscript or die!
Translate to pdf and live!

Robert.Thompson

  • Jr. Member
  • **
  • Posts: 56
  • "A very bad coder."
    • Google Voice for Canadians
Re: Trouble with .xml file - Invalid xml characters...
« Reply #2 on: August 01, 2015, 09:56:51 pm »

this is very bad code.


Yes!

I changed my signature to reflect your observation!  :)
« Last Edit: August 02, 2015, 12:43:27 am by Robert.Thompson »
Lazarus:  1.8.4  2018-11-17
FPC:   3.0.4 x86_64-linux-gtk2
System:   Kernel: 4.15.0-39-generic x86_64 bits: 64 gcc: 7.3.0 Cinnamon 3.8.9 Linux Mint 19 Tara
              Phoenix v: 11JB.M044.20100622.hkk date: 06/22/2010
Intel Core i5 M 460 (-MT-MCP-) arch: Nehalem rev.5 cache: 3072 KB
NVIDIA GeForce 310M

Blestan

  • Sr. Member
  • ****
  • Posts: 461
Re: Trouble with .xml file - Invalid xml characters...
« Reply #3 on: August 01, 2015, 10:01:36 pm »
????
Speak postscript or die!
Translate to pdf and live!

Blestan

  • Sr. Member
  • ****
  • Posts: 461
Re: Trouble with .xml file - Invalid xml characters...
« Reply #4 on: August 01, 2015, 10:08:52 pm »
 here a sample what do do:
open infile
open outfile
alloc buffer
while not eof infile
read buffer
for i :=0 to sizeof buffer-1 if not buffer in [#32..z] then buffer:=32;
write buffer
end while
this will give you at least x29 faster code :))
because in your code you are parsing 29 times the same line (buffer)
« Last Edit: August 01, 2015, 10:12:14 pm by Blestan »
Speak postscript or die!
Translate to pdf and live!

Robert.Thompson

  • Jr. Member
  • **
  • Posts: 56
  • "A very bad coder."
    • Google Voice for Canadians
Re: Trouble with .xml file - Invalid xml characters...
« Reply #5 on: August 02, 2015, 12:41:57 am »
here a sample what do do:
open infile
open outfile
alloc buffer
while not eof infile
read buffer
for i :=0 to sizeof buffer-1 if not buffer in [#32..z] then buffer:=32;
write buffer
end while
this will give you at least x29 faster code :))
because in your code you are parsing 29 times the same line (buffer)

Thanks Blestan! :)

I'll give a try. I am not worried too much about the speed but I am very worried about removing all the *** non utf8 *** that are living in the .xml file.

Thanks again for your time,
Lazarus:  1.8.4  2018-11-17
FPC:   3.0.4 x86_64-linux-gtk2
System:   Kernel: 4.15.0-39-generic x86_64 bits: 64 gcc: 7.3.0 Cinnamon 3.8.9 Linux Mint 19 Tara
              Phoenix v: 11JB.M044.20100622.hkk date: 06/22/2010
Intel Core i5 M 460 (-MT-MCP-) arch: Nehalem rev.5 cache: 3072 KB
NVIDIA GeForce 310M

Robert.Thompson

  • Jr. Member
  • **
  • Posts: 56
  • "A very bad coder."
    • Google Voice for Canadians
Re: Trouble with .xml file - Invalid xml characters...
« Reply #6 on: August 12, 2015, 05:26:57 pm »
Here is what I finally did:

<snip>
while not EOF(InFile) do
  begin
    readln(InFile, LineRead);

    for i := 0 to 12 do
    begin
      LineRead := StringReplace(LineRead, chr(i), '', [rfReplaceAll, rfIgnoreCase]);
    end;

    for i := 14 to 31 do
    begin
      LineRead := StringReplace(LineRead, chr(i), '', [rfReplaceAll, rfIgnoreCase]);
    end;

    for i := 128 to 255 do
    begin
      LineRead := StringReplace(LineRead, chr(i), '', [rfReplaceAll, rfIgnoreCase]);
    end;

    writeln(OutFile, LineRead);
  end;
<snip>

Rob.
Lazarus:  1.8.4  2018-11-17
FPC:   3.0.4 x86_64-linux-gtk2
System:   Kernel: 4.15.0-39-generic x86_64 bits: 64 gcc: 7.3.0 Cinnamon 3.8.9 Linux Mint 19 Tara
              Phoenix v: 11JB.M044.20100622.hkk date: 06/22/2010
Intel Core i5 M 460 (-MT-MCP-) arch: Nehalem rev.5 cache: 3072 KB
NVIDIA GeForce 310M

 

TinyPortal © 2005-2018