Recent

Author Topic: read(input) skips first char & Questions about stdin, and reading stdin to array  (Read 2191 times)

inko

  • New member
  • *
  • Posts: 8
Questions about stdin, and reading stdin to arrays and types.

Code: Text  [Select][+][-]
  1.  [TL ; DR] How to read blocks of chars from standard input
  2. more than one char at a time cast into byte, dword, qword?

I am using a Linux system. When reading stdin of char to array, the first char is never read. The script below demonstrates. It reads a stream of char from stdin into array, then writes the array. In every test output matches input except the first character of input is absent from output.

1. Why is this happening and what ways are there to fix it?

2. I would rather read the stdin in arrays with 4-8 chars per index converted to dword or qword. Reading and processing one stdin char at a time is EXTREMELY slow and is not acceptable performance.

3. I prefer to use standard pascal rather than complicated streaming functions or anyone else's units, unless the unit is very simple and small. I prefer the barebones parts of the language cobbled together to accomplish this rather than a sophisticated unit.

4. If there is a way to point blockread at stdin instead of a diskfile I'd like to know about it.

5. How do I speed it up the reading of stdin? I tried it one char at a time and it was horribly, awfully, infuriatingly slow for large files piped to pascal stdin. I have python programs that could read and process the files faster than just reading this in and adding the byte values.

6. Writeln to string does not appear to be usable for this. All input will be converted to byte, dword, qword for hash and arithmetic operations.


Code: Pascal  [Select][+][-]
  1. program forum(input, output) ;
  2.  
  3. (* use this command to test
  4.  
  5. a=$(head -c 64 /dev/urandom | xxd -p); echo "$a"; echo ""; echo "$a" | ./forum *)
  6.  
  7. var
  8.     r: char;
  9.     s: array[1..64] of char;
  10.  
  11. begin
  12.  
  13. fillchar(s, sizeOf(s), #0);
  14. reset(input); (* this makes no difference for missing first char *)
  15. read(r, s);
  16. writeln(s);
  17.  
  18. repeat
  19.  
  20.     while not EOF do
  21.     begin
  22.  
  23.         read(r, s) ;
  24.         writeln(s) ;
  25.         fillchar(s, sizeOf(s), #0);
  26.         if EOF THEN
  27.             writeln('* DONE: end of file!');
  28.     end ;
  29.  
  30. until(EOF);
  31.  
  32. end.

I compiled as shown:

Code: Bash  [Select][+][-]
  1. fpc -O4 forum.pas


cli test output shows that first input char is missing from output:

Code: Bash  [Select][+][-]
  1. a=$(head -c 64 /dev/urandom | xxd -p); echo "$a"; echo ""; echo "$a" | ./forum
  2.  
  3. 17a7cefd9535477a2c35063004eae027830822cd798fb7cbc60ededa53fc
  4. eae2bcf380491ea07f9cd8fbb7d498325432de5ce9f71e908c75f70b8a8d
  5. f57b8d8a
  6.  
  7. 7a7cefd9535477a2c35063004eae027830822cd798fb7cbc60ededa53fc
  8. eae2bcf380491ea07f9cd8fbb7d498325432de5ce9f71e908c75f70b8a8d
  9. f57b8d8a
« Last Edit: October 26, 2020, 08:03:19 am by inko »

bytebites

  • Hero Member
  • *****
  • Posts: 625
The first char is not missing since you have read it into variable r.

Code: Pascal  [Select][+][-]
  1. read(r, s);

inko

  • New member
  • *
  • Posts: 8
The first char is not missing since you have read it into variable r.

Code: Pascal  [Select][+][-]
  1. read(r, s);

Why does the first char not output with the test output?

UPDATE: I found that for some reason the first read requires me to write out both r,s and that reveals the first character in r (which I thought should be array).

Code: Pascal  [Select][+][-]
  1. fillchar(s, sizeOf(s), #0);
  2. reset(input); (* this makes no difference for missing first char *)
  3. read(r, s); (* this fixes it *)
  4. writeln(r, s);
« Last Edit: October 26, 2020, 08:13:31 am by inko »

bytebites

  • Hero Member
  • *****
  • Posts: 625
Code: Pascal  [Select][+][-]
  1. writeln(r,s);

inko

  • New member
  • *
  • Posts: 8
Thank you. I swear I just figured it out right as you were posting the solution ;)

Question is why is r a single character there, and only behaves as array later? What am I missing?

bytebites

  • Hero Member
  • *****
  • Posts: 625
Where it behaves as array later?

inko

  • New member
  • *
  • Posts: 8
On the first read(r, s) it reads the first char into var r, and the other 63 chars into array s. After that every read of var r populates all 64 indexes of array s. Or at least that's what appears to be happening.

inko

  • New member
  • *
  • Posts: 8
After a mind-number amount of searching, browsing Stack Exchange and other help sites I have turned up virtually nothing about how the raw binary input of a pascal program works and how to manipulate the reading of it more than one input char at a time.

IMHO this should not be. This is one of the most basic things of programming, dealing with an input buffer to a program, yet I can't find any real useful information.

I have written, tested, and analyzed an infuriating number of small programs, hexdumped their outputs for analysis, and depending on the weather it seems that pascal's behavior changes randomly based on how I try to read the input, causing a spiral of more analyzing.

I have a hash algorithm that takes arrays of DWORD or QWORD and iterates through the arrays to work on them. It needs to read all the remaining stream data to the array, including line breaks, until there is no more data to read. I want to write a pascal version of the algorithm. What needs to happen here is the input from 'cat' or 'pipe' needs to feed to the pascal program in blocks to the array.

Then the pascal program needs to take that incoming input, and grab it one array (block?) at a time and feed it to the hash machine. This is trivial with python and bash. What am I missing about pascal?

I have already completed the pascal version of the hash machine. It works flawlessly on files on disk. All its test vectors and file copy tests check out and it is faster than any hash algorithm I can find in linux repos. It is faster than md5sum, all the sha hashes, b2sum, and many others, thanks to pascal compiler optimizing the loops. But I can't cat pipe data to the machine and hash the incoming data, which is necessary to have since lots of programmers have situations in which they do just that procedure with a hash program.

I know how to read one input char at a time and convert it to bytes and words and build the array that way. And that is useless unless we want to take ten times as long to hash a 4 gb file.

What can I try? What am I missing?

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Your construction is entirely equivalent to doing:

Code: Pascal  [Select][+][-]
  1. read(r); {read first char}
  2. read(s); {read next chars}

What you're probably not taking into account is that read() stops at (but doesn't trasfer) the line end, so the second (and further) read(r) will read that line end (which by a lucky coincidence is a single char in *nixen), so you then read into s
the full next 64 chars, instead of 63 like the first time.

This almost totally equivalent program:
Code: Pascal  [Select][+][-]
  1. program project1;
  2. var
  3.   r: char;
  4.   s: array[1..64] of char;
  5. begin
  6.   while not EOF do
  7.   begin
  8.     fillchar(s, sizeOf(s), #0);
  9.     read(r, s);
  10.     if r in [#00..#31] then
  11.       writeln('r=#', Ord(r), ' s=', s)
  12.     else
  13.       writeln('r=', r, ' s=', s);
  14.   end;
  15.   writeln('* DONE: end of file!');
  16. end.

produces this output which demonstrates it:

Code: [Select]
lucamar@Diana$ a=$(head -c 64 /dev/urandom | xxd -p); echo "$a"; echo ""; echo "$a" | ./project1
4d771918afc5d672a44fba18c5c0728f8693c5831e3cf640b5976ae96c36
f6ac78f1c8bedeef6860ff25495269780a424c2c70abe702b194fe31c271
21a9b63e

r=4   s=d771918afc5d672a44fba18c5c0728f8693c5831e3cf640b5976ae96c36
r=#10 s=f6ac78f1c8bedeef6860ff25495269780a424c2c70abe702b194fe31c271
r=#10 s=21a9b63e
r=#10 s=
* DONE: end of file!

All your program really needs, then, is this:

Code: Pascal  [Select][+][-]
  1. program project1;
  2. var
  3.   s: array[1..64] of char;
  4. begin
  5.   while not EOF do begin
  6.     fillchar(s, sizeOf(s), #0);
  7.     read(s);
  8.     readln;
  9.     { Do whatever with s }
  10.   end;
  11.   writeln('* DONE: end of file!');
  12. end.

ETA: You should always remember that both Input and Output (and ErrorOut) are Text files, so they adhere to the conventions of that type.
« Last Edit: October 27, 2020, 12:05:58 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

inko

  • New member
  • *
  • Posts: 8
Quote
You should always remember that both Input and Output (and ErrorOut) are Text files, so they adhere to the conventions of that type.

Thanks for the clarification. The example code and this statement spurred me to look more under the hood. I have some observations and more questions.

read() on stdin is geared to read up to a LineBreak character. This is expected behavior for a console application reading user input. (duh) This is very, very bad for reading streams from piped standard input. The way read() interprets and drops control, tab, and line feed characters is totally incompatible with reading piped standard input, unless your only objective is human-readable text. But I'm looking at reading untyped binary input, where all the bytes are crucial.

Because read() only reads up to LineBreak, casting it to an array leaves arrays with lots of null bytes after the slot that LineBreak char would be located, if it were reading binary without interpreting breaks. This behavior makes it useless for my particular case. I need EVERY BYTE (all of them, no exceptions) to be castable to arrays, as loosing a single byte in a stream breaks a hash algorithm.

LineBreak is set in system files of the compiler somewhere, is it not? There's also a command to change the way a read or write operation interprets which characters are LineBreak, correct?

The simplest solution I see at this point is, to instruct the text stream input, to stop interpreting LineBreak characters so that every byte in the buffer will be cast to the char array. So when read() sees the LineBreak (#10, #13) chars, it just reads them like any other chars, filling up the entire array buffer.

A good idea is to turn off interpretation of LineBreak completely in the program. How would I do this?

Better: how would I hack some way to use blockread on the stdin stream?


PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
read() on stdin is geared to read up to a LineBreak character. This is expected behavior for a console application reading user input. (duh) This is very, very bad for reading streams from piped standard input. The way read() interprets and drops control, tab, and line feed characters is totally incompatible with reading piped standard input, unless your only objective is human-readable text. But I'm looking at reading untyped binary input, where all the bytes are crucial.

Because read() only reads up to LineBreak, casting it to an array leaves arrays with lots of null bytes after the slot that LineBreak char would be located, if it were reading binary without interpreting breaks. This behavior makes it useless for my particular case. I need EVERY BYTE (all of them, no exceptions) to be castable to arrays, as loosing a single byte in a stream breaks a hash algorithm.

This is simply how Text type variables are defined: they are line based. And the Input and Output variables are defined as such types.

LineBreak is set in system files of the compiler somewhere, is it not? There's also a command to change the way a read or write operation interprets which characters are LineBreak, correct?

The routine for this is SetTextLineEnding.

The simplest solution I see at this point is, to instruct the text stream input, to stop interpreting LineBreak characters so that every byte in the buffer will be cast to the char array. So when read() sees the LineBreak (#10, #13) chars, it just reads them like any other chars, filling up the entire array buffer.

No. If you want to read a file as a binary file you need to open a suitable file:

Code: Pascal  [Select][+][-]
  1. program tfiletest;
  2.  
  3. var
  4.   f: file of char;
  5.   arr: array[0..15] of Char;
  6.   c: Char;
  7.   i: Word;
  8. begin
  9.   // by default FileMode is 2 which will open the file as output
  10.   FileMode := 0;
  11.   // empty string will lead to Reset opening StdIn
  12.   Assign(f, '');
  13.   // set the block size correctly
  14.   Reset(f, SizeOf(Char));
  15.   {$I-}
  16.   BlockRead(f, arr[0], Length(arr));
  17.   i := IOResult;
  18.   if i <> 0 then begin
  19.     Writeln('Error: ', i);
  20.     Exit;
  21.   end;
  22.   {$I+}
  23.   for c in arr do
  24.     Write(HexStr(Ord(c), 2), ' ');
  25.   Writeln;
  26.   Close(f);
  27. end.

If I pass in a binary file like the following:

Code: [Select]
sb@Merkur:~$ hexdump -C Exchange/test.bin
00000000  01 02 03 04 05 06 07 08  09 10 11 12 13 14 15 16  |................|
00000010

Then the program will print this:

Code: [Select]
D:\fpc\git>.\testoutput\tfiletest.exe < x:\test.bin
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16

 

TinyPortal © 2005-2018