@Handoko, many thanks for the very pretty and sophisticated piece of code! I will try it, which will prove no doubt to be a great source of good ideas and inspiration.
I have no doubt that, as both @MarkML and @440bx have correctly pointed out, a full-fledged parser
is the best way of dealing with this matter. But for immediate purposes, a fast hack as initially provided by @Winni (or perhaps even the more sophisticated, but still not yet awfully complex, routines provided by @Handoko) will have to suffice.
Perhaps this is the time to tell that the file of descriptors I first presented when I opened this thread is part of a format called DELTA ("DEscription Language for TAxonomy"), developed in Australia around 1980 for the representation and processing of biological taxonomic data (see
www.delta-intkey.com) and still largely used today by the academic community of zoologists and botanists. The DELTA format is much more complex (but also flexible!) than just the descriptor file I provided. To begin with, a complete DELTA dataset consists of three separate files - besides the descriptors file I have already provided, there is also a "data matrix" file, scoring objects for each descriptor present in the descriptors file, as follows:
# person 1/
1,1-2 2,5 3,- 4,- 5,3 6<an> 7,1 8,7
# person 2/
1,1-2 2,5 3,400 4,- 5,3 6<an> 7,1 8,7
# person 3/
1,2-4 2,1 3,23 4,23-45 5,3 6<adsf> 7,1 8,7
# person 4/
1,2 2,2 3,67 4,12.23-23 5,50 6<COBOL> 7,1 8,7
# person 5/
1,2-6 2,4 3,20-70 4,1.23-3 5,21 6<COBOL> 7,1 8,7
# person 6/
1,2/3 2,3 3,50-100 4,4-5 5,50 6<COBOL> 7,1 8,7
# person 7/
1,2-3 2,1 3,60 4,100-1223 5,5 6<COBOL> 7,1 8,7
# person 8/
1,2-3 2,4 3,23 4,1400 5,50 6<COBOL> 7,1 8,7
and a third file containing metadata, as follows:
*SHOW ~ Dataset specifications.
*DATA BUFFER SIZE 4000
*NUMBER OF CHARACTERS 9
*MAXIMUM NUMBER OF STATES 10
*MAXIMUM NUMBER OF ITEMS 8
*CHARACTER TYPES 2,OM 3,IN 4,RN 5,IN 6,TE 9,OM
*NUMBERS OF STATES 1,8 2,5 8,10
As can be seen in this last file, the DELTA format also includes a command language, with hundreds of commands organized in a hierarchy of precedence of execution. And there is much, much more...
It happens that writing a decent parser for the DELTA format has proven to be very hard and time-consuming. The original programs for handling the DELTA format were written in FORTRAN (and still in use today), with the most recent versions being written in Java. Over the years, some have attempted to develop general-purpose parsers for use as software libraries, with attempts being made in C++, Python, and Pascal/Delphi (see
freedelta.sourceforge.net). I have myself written a large library (with thousands of lines of code) for reading DELTA files in Pascal/Delphi around 1996-1998. This was back before this forum (and StackOverflow!), and therefore my implementation have not necessarily been the best one. I then did give up and spent the last ten years using Python, but for many reasons outside the scope of this discussion, I have recently decided to return to the old and good world of Pascal (especially given the maturity of such a superb free, cross-platform development tool as Lazarus). But instead of simply using my old library "as-is", I would like to both "modernized" and simplify them, adopting at the same time a more "minimalist" approach (perhaps this is a side effect of spending ten years with Python!), if possible.
So far, the best existing DELTA parser is the C++ version. It would be tempting to translate it into Pascal, or perhaps turning the current static library into a dynamic one (that is, a Windows DLL or Linux shared library) for use in FPC/Lazarus applications.
That is all to say that properly working along the lines suggested by @MarkML and @440bx would be a
huge task, but unfortunately this is beyond my available time and resources. The advantage of such fast hacks as that provided by @Winni is that they can be used for the development of prototypes and "proof-or-concept" applications which hopefully may, over time, evolve into more mature software.
Thank you all very much!