Recent

Author Topic: Best way to parse file.  (Read 3867 times)

BSaidus

  • Sr. Member
  • ****
  • Posts: 453
  • lazarus 1.8.4 Win8.1 / cross FreeBSD
Best way to parse file.
« on: January 26, 2023, 12:21:20 pm »
Hello.
Having this simple file content (which is the result of netstat -an -f inet unix command):
Code: Text  [Select][+][-]
  1. Name    Mtu   Network     Address              Ipkts Ifail    Opkts Ofail Colls
  2. lo0     32768 <Link>                               0     0        0     0     0
  3. lo0     32768 ::1/128     ::1                      0     0        0     0     0
  4. lo0     32768 fe80::%lo0/ fe80::1%lo0              0     0        0     0     0
  5. lo0     32768 127/8       127.0.0.1                0     0        0     0     0
  6. pcn0    1500  <Link>      08:00:27:01:c1:ab        0     0        1     0     0
  7. pcn0    1500  192.168.1/2 192.168.1.254            0     0        1     0     0
  8. pcn1*   1500  <Link>      08:00:27:99:f5:80        0     0        0     0     0
  9. em0*    1500  <Link>      08:00:27:17:f0:0f        0     0        0     0     0
  10. em1     1500  <Link>      08:00:27:00:0c:c8        4     0        9     0     0
  11. em1     1500  10.0.5/24   10.0.5.15                4     0        9     0     0
  12. enc0*   0     <Link>                               0     0        0     0     0
  13. pflog0  33136 <Link>                               0     0        0     0     0
  14.  

I wonder which is the best way to parse this file in order to get in organized array of record type:

Code: Pascal  [Select][+][-]
  1.   type
  2.      if_net = record
  3.         Name,
  4.         Mtu,
  5.         Network,
  6.         Address,
  7.         Ipkts,
  8.         Ifail,
  9.         Opkts,
  10.         Ofail,
  11.         Colls : String;    
  12.      end;
  13.  

Thank you.
« Last Edit: January 26, 2023, 12:25:07 pm by BSaidus »
lazarus 1.8.4 Win8.1 / cross FreeBSD
dhukmucmur vernadh!

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 10584
  • FPC developer.
Re: Best way to parse file.
« Reply #1 on: January 26, 2023, 12:34:23 pm »
That command doesn't work for me. (on debian 11 and 12, https://bugs.launchpad.net/ubuntu/+source/net-tools/+bug/1915903)

but I would probably do something  like:

Code: [Select]


{$mode delphi}
uses classes,sysutils,types;

var n : tstringdynarray;
    s : string;
  anet : if_net;
begin
  n:=s.split([' ',#9]);
  if length(n)>0 then
     begin
       name:=n[0];
      mtu:=n[1];  // etc etc.
    end;
 end.

It might need some fiddling with the split() options (see manual) to get the behaviour for empty fields right, but I can't test due to the bug.

Alternately, you could make the record an array of string, and use properties like

Code: [Select]
  property name : string read strarr[0] write strarr[0];

etc to have human readable names for the fields.

wp

  • Hero Member
  • *****
  • Posts: 10663
Re: Best way to parse file.
« Reply #2 on: January 26, 2023, 02:50:30 pm »
Probably not the best, but the first one which came to my mind: Read the file into a stringlist and then split the lines at the fixed positions given by the start of the header columns by means of the good-old Copy command.

Warfley

  • Hero Member
  • *****
  • Posts: 1075
Re: Best way to parse file.
« Reply #3 on: January 26, 2023, 03:05:26 pm »
You can just split the string:
Code: Pascal  [Select][+][-]
  1. function read_config(const line: String): if_net;
  2. var
  3.   parts: TStringArray;
  4.   has_address: boolean;
  5. begin
  6.   parts := line.split([' ', #9], TStringSplitOptions.ExcludeEmpty);
  7.   has_address = length(parts) = 9;
  8.   With Result do
  9.   begin
  10.     Name := parts[0];
  11.     Mtu := parts[1];
  12.     Network := parts[2];
  13.     // very lazy way to only load address if the boolean is true
  14.     Address := ifthen(has_address, parts[3], '');
  15.     // ord(boolean) = 1 if true, 0 if false, so it will be offset by 1 if there is the address
  16.     Ipkts := parts[3 + ord(has_address)];
  17.     Ifail := parts[4 + ord(has_address)];
  18.     Opkts := parts[5 + ord(has_address)];
  19.     Ofail := parts[6 + ord(has_address)];
  20.     Colls := parts[7 + ord(has_address)];
  21.   end;  
  22. end;

Zvoni

  • Hero Member
  • *****
  • Posts: 1684
Re: Best way to parse file.
« Reply #4 on: January 26, 2023, 03:50:06 pm »
Quote
parts := line.split([' ', #9], TStringSplitOptions.ExcludeEmpty);
Eh? In this Case: NO!
Column Address can be empty, it would mix up the count of columns if you Exclude Empty
Though i see that you adjusted for that. But IMO way too convoluted

I'm with wp's approach with the StringList
« Last Edit: January 26, 2023, 03:51:44 pm by Zvoni »
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Warfley

  • Hero Member
  • *****
  • Posts: 1075
Re: Best way to parse file.
« Reply #5 on: January 26, 2023, 03:53:06 pm »
Quote
parts := line.split([' ', #9], TStringSplitOptions.ExcludeEmpty);
Eh? In this Case: NO!
Column Address can be empty, it would mix up the count of columns if you Exclude Empty

I'm with wp's approach with the StringList
Yes this is accounted for in the code:
Code: Pascal  [Select][+][-]
  1.   has_address = length(parts) = 9;
  2. ...
  3.     // very lazy way to only load address if the boolean is true
  4.     Address := ifthen(has_address, parts[3], '');
  5.     // ord(boolean) = 1 if true, 0 if false, so it will be offset by 1 if there is the address
  6.     Ipkts := parts[3 + ord(has_address)];
if the address is there, it will be loaded into address and all the following lookups will be offset by one, if it isn't there, address will be set to be the empty string (''), and the follwoing fields are read from index 3 onwards.

As long as only Address can be missing, this is probably the easiest way to parse this data, and only 1/3 of the lines of code that are neede with the stringlist appraoch
« Last Edit: January 26, 2023, 03:57:31 pm by Warfley »

BSaidus

  • Sr. Member
  • ****
  • Posts: 453
  • lazarus 1.8.4 Win8.1 / cross FreeBSD
Re: Best way to parse file.
« Reply #6 on: January 26, 2023, 04:32:52 pm »
What do you think using RegExpr.
I'm in work now, I will give you feedback soon.
( welcom to any one can help on RegEx).
  For the 1st line this regex work well.
Code: Pascal  [Select][+][-]
  1. ^(\w+)\s+(\d+)\s+(\W\w+\W)(\w+|\s+)(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)
  2. // works well for this line
  3. lo0     32768 <Link>                               0     0        0     0     0    
  4. // but not for this line :
  5. lo0     32768 fe80::%lo0/ fe80::1%lo0              0     0        0     0     0
  6.  
lazarus 1.8.4 Win8.1 / cross FreeBSD
dhukmucmur vernadh!

Warfley

  • Hero Member
  • *****
  • Posts: 1075
Re: Best way to parse file.
« Reply #7 on: January 26, 2023, 04:44:35 pm »
What do you think using RegExpr.
I'm in work now, I will give you feedback soon.
( welcom to any one can help on RegEx).
  For the 1st line this regex work well.
Code: Pascal  [Select][+][-]
  1. ^(\w+)\s+(\d+)\s+(\W\w+\W)(\w+|\s+)(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)
  2. // works well for this line
  3. lo0     32768 <Link>                               0     0        0     0     0    
  4. // but not for this line :
  5. lo0     32768 fe80::%lo0/ fe80::1%lo0              0     0        0     0     0
  6.  
\w is not what you think it is. \w matches "words" but some characters are not words (/ for example).
But an alternative is much simpler, just match spaces vs non spaces:
Code: Pascal  [Select][+][-]
  1. ^([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)?\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)
You have a sequence of non spaces followed by a sequence of spaces, this repeated 4 times, where the last one is optional, and then the a sequence of spaces, followed by a sequence of digits and this 5 times.

You can even make this shorter with the repeat syntax:
Code: Pascal  [Select][+][-]
  1. ^(([^\s]+)\s+){3,4}(\s+(\d+)){5}

BSaidus

  • Sr. Member
  • ****
  • Posts: 453
  • lazarus 1.8.4 Win8.1 / cross FreeBSD
Re: Best way to parse file.
« Reply #8 on: January 26, 2023, 04:54:00 pm »
@Warfley Thank you,  ;D you're a man.
Thank you all for help.
This is good for me, I'll proceed with RegEx.
Thnks.
lazarus 1.8.4 Win8.1 / cross FreeBSD
dhukmucmur vernadh!

Warfley

  • Hero Member
  • *****
  • Posts: 1075
Re: Best way to parse file.
« Reply #9 on: January 26, 2023, 09:14:30 pm »
Just out of fun, and because I am currently working with GOLD and wanted to try a bit stuff out, I've made a GOLD grammar for building an LALR parser for reading this file:
Code: C  [Select][+][-]
  1. "Start Symbol" = <Table>
  2.  
  3. {ItemChar} = {Printable} - {Whitespace}
  4. {NoNu} = {ItemChar} - {Number}
  5.  
  6. {Whitespace Ch} = {Whitespace} - {CR} - {LF}
  7.  
  8. Whitespace = {Whitespace Ch}+
  9. Newline = {CR}{LF} | {CR} | {LF}
  10. DataField = {ItemChar}*{NoNu}{ItemChar}*
  11. NumberField = {Number}+
  12.  
  13. <Table> ::= <Header> NewLine <TableEntries>
  14.  
  15. <Header> ::= 'Name' 'Mtu' 'Network' 'Address' 'Ipkts' 'Ifail' 'Opkts' 'Ofail' 'Colls'
  16.  
  17. <TableEntries> ::= <TableEntry> NewLine <TableEntries>
  18.                 |
  19.  
  20. <TableEntry> ::= <Name> <MTU> <Network> <Address> <IPkts> <IFail> <OPkts> <Ofail> <Colls>
  21.  
  22. <Name> ::= DataField
  23.  
  24. <MTU> ::= NumberField
  25.  
  26. <Network> ::= DataField
  27.  
  28. <Address> ::= DataField
  29.            |
  30.  
  31. <IPkts> ::= NumberField
  32.  
  33. <IFail> ::= NumberField
  34.  
  35. <OPkts> ::= NumberField
  36.  
  37. <Ofail> ::= NumberField
  38.  
  39. <Colls> ::= NumberField
  40.  

So if you want to completely overengineer this thing, you can use this :P

Kays

  • Sr. Member
  • ****
  • Posts: 494
  • Whasup!?
    • KaiBurghardt.de
Re: Best way to parse file.
« Reply #10 on: January 27, 2023, 12:56:50 am »
I wonder which is the best way to parse this file in order to get in organized array of record type:
If it is current information retrieved from the same host machine your program is running:

The best way is not to parse it at all. The data you want as a record structure is already there in memory – albeit as numbers, not strings. netstat “simply” converts the numbers into human-readable form, i.e. (primarily) for consumption by humans.

Calling truss reveals what’s going on there:
Code: Bash  [Select][+][-]
  1. truss netstat -an -f inet
In particular there is the system call
Code: Text  [Select][+][-]
  1. __sysctlbyname("net.inet.tcp.pcblist" […]
I recommend to retrieve the net.inet.*.pcblist system control values by yourself and read the data as they are already present.

Now, I’ll need to check out the netstat source code, too, and I admit it’s not as easy as reading a bunch of strings, but it’s definitely the best way to obtain said data.
Yours Sincerely
Kai Burghardt

440bx

  • Hero Member
  • *****
  • Posts: 3326
Re: Best way to parse file.
« Reply #11 on: January 27, 2023, 08:32:41 am »
if you don't mind writing code that actually parses the input, it could be done by parsing only the title line and saving the offsets to the start/end of the field (depending on the field.)

In the example you posted, 0 based offsets for each field are: 0, 8, 14, 26 for the first 4 fields and 51, 57,  66, 72, 79, the following presumes those field offsets do not change between lines.

Save those offsets and the pointer to the first field (which is what follows the title's line terminator.) Note also that each line seems to be constant length and appears to be the title length. 

Now you have the offsets to each field.

The first 4 fields end at the first space found on or after the offset, replace that location with a null and now you have a null terminated string whose pointer you can save. The last 5 fields end at the next character (replace that space with a null) and start one character after the first space that precedes them (scan backwards for the first space and save the address of the next character).

IOW, instead of having a record of string, you have a record of pointers to null terminated strings.  It may sound a bit complicated but it's actually quite simple and it will be as fast as possible because parsing is only done once (at least for the first 4 fields) and values are not moved around in memory (pointer to existing values are saved instead.)

Lastly, commenting that is the way you're going about can make the "algorithm" used quite obvious. I don't know if that is the best way to parse the file but, it definitely is one of the fastest (if not the fastest)

HTH.


FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 SP1 64bit.

BSaidus

  • Sr. Member
  • ****
  • Posts: 453
  • lazarus 1.8.4 Win8.1 / cross FreeBSD
Re: Best way to parse file.
« Reply #12 on: January 27, 2023, 03:49:49 pm »
That command doesn't work for me. (on debian 11 and 12, https://bugs.launchpad.net/ubuntu/+source/net-tools/+bug/1915903)
Hi, I executed the command in OpenBSD OS.

lazarus 1.8.4 Win8.1 / cross FreeBSD
dhukmucmur vernadh!

KodeZwerg

  • Hero Member
  • *****
  • Posts: 1215
  • Fifty shades of code.
    • Delphi & FreePascal
Re: Best way to parse file.
« Reply #13 on: January 27, 2023, 03:55:05 pm »
I think your way of doing is just wrong.
When you can not gather that information by code, what purpose shall it have?
In my thinking, when I am not able to get information by code, I would simple display output of console in my app somewhere.
« Last Edit: Tomorrow at 31:76:97 by KodeZwerg »

Warfley

  • Hero Member
  • *****
  • Posts: 1075
Re: Best way to parse file.
« Reply #14 on: January 27, 2023, 04:21:47 pm »
I think your way of doing is just wrong.
When you can not gather that information by code, what purpose shall it have?
In my thinking, when I am not able to get information by code, I would simple display output of console in my app somewhere.
Thats the unix philosophy, instead of having to have one app with all the functionality, you have multiple specialized programs that do one thing very good and whose output can be reused by other programs. Just for comparison, the "ifconfig" program has 1000 lines of code. So when you need that functionality in your program, you can decide if you want to have a few lines of calling ifconfig, or you want to implement all of this yourself.

This is also very important for rights management. For example to send ICMP requests, you need root rights, but you might not want to give all applications that need this root rights. So they call the system applcations "ping", "traceroute", etc., which have the required rights, but are much easier to keep track of any security issues because these applications basically just do one thing. No use input or anything else.

Another example where this is useful is for things like reading out hardware information can be quite annoying. Linux systems usually provide pseudofiles for this, but every distro might choose to put the pseudo files into another directory. Using the system tools provided (like ip addr, netstat, ifconfig, etc.) can provide the information in a uniform manner.

And the main advantage of this is, that it is very simple to debug, as all of these programs give the data in both human and machine readable form, you can debug your APIs by simply looking at the program output.

So there are a lot of reasons to do this. It's one of the Windows deseases that Microsoft thought that everything must be accissable through code APIs and DLLs whose calls must be implemented in each program that tries to use them. By having different programs provide the data in both human and machine readable form, it is much easier to get access to that data and to learn how to use it.

 

TinyPortal © 2005-2018