Recent

Author Topic: Random errors in TensorFlow  (Read 1056 times)

jollytall

  • Sr. Member
  • ****
  • Posts: 376
Random errors in TensorFlow
« on: December 10, 2024, 08:52:12 pm »
I know this is a bit off-topic, but maybe somebody can help.

I have a program (from https://github.com/zsoltszakaly/tensorflowforpascal/tree/master/examples) what does a number of example runs on TensorFlow. It works perfectly on a CPU.
On GPU I get errors at totally random. Most of the time it is a Div0, but also saw a FPE. What is strange that it is at different places even if the very same program is run twice after each others.
When there is an error it dumps the call sequence and exits. I added debug information, but it crashes somewhere deep inside TensorFlow, or maybe even Cuda, so I see only program pointers.
The examples.pas is a single thread program, but as TF is multi threaded, I tried to add cthreads to it. It is even more interesting, as sometimes I got two errors. First it dumped a Div0 and then still ran on (?) and created another error dump and then it crashed. What is really strange in it, that in between the two errors the program totally leaves TF and returns to the single thread and then starts a new TF session.
I also noticed that if I write the results to screen (slow output) then the error I typically get earlier than if the output is redirected in a file, so it might be some timing issue.
As it is on a High Performance Computer and nobody uses TF through C API (and especially Pascal!), it is difficult to get any help with this.

So, if anyone has an idea how to debug such a dynamically linked library (not Tensorflow specific I guess) or an idea what can be the root cause I would be happy to hear about it.

TRon

  • Hero Member
  • *****
  • Posts: 3778
Re: Random errors in TensorFlow
« Reply #1 on: December 10, 2024, 09:02:49 pm »
I currently do not have the time to play with this myself but one thing that I noticed is that the api header does not explicitly state the record packing. This might result in record structures not being aligned properly.
Code: Pascal  [Select][+][-]
  1. {$packrecords c}
  2.  
Can't hurt and perhaps worth a try.
I do not have to remember anything anymore thanks to total-recall.

jollytall

  • Sr. Member
  • ****
  • Posts: 376
Re: Random errors in TensorFlow
« Reply #2 on: December 10, 2024, 10:14:30 pm »
Thanks for the thoughts, but it did not help. I added it to all units and the main program, but it is the same still.

TRon

  • Hero Member
  • *****
  • Posts: 3778
Re: Random errors in TensorFlow
« Reply #3 on: December 10, 2024, 11:02:08 pm »
What version of the libraries are you using ? As far as I am able to tell  the pascal headers were updated to max support 2.11.

It would probably be better to mention your issue(s) at the github repository.
I do not have to remember anything anymore thanks to total-recall.

jollytall

  • Sr. Member
  • ****
  • Posts: 376
Re: Random errors in TensorFlow
« Reply #4 on: December 11, 2024, 08:29:13 am »
I think the Pascal header versions should not matter, as in versions 2.x they should be backward compatible and as the program itself contains no later features there should not be compatibility issues. Once I will have time, I will generate a new set of the Pascal wrapper anyway.
Regarding the earlier comment, re packrecords, - besides the fact that it did not work, - again it is unlikely to be the root cause. If Tensorflow is used in CPU mode, there was never a problem. So, I would assume without knowing too much about TF inside, that the building of the graph between Pascal-C API-TF is OK. The problem sounds to be somewhere between TF-Cuda-GPU.
Anyway I will raise the issue on TF Github as well, but as it is so complicated (cca. ten Cuda libraries, all with versions), plus the complex HW architecture) I am not sure this can be reproduced. If one has access to Nvidia GPU and could test this program and reproduce the error that would help.
Also, some magic debugging tool would help, but I am not too familiar with gdb, fpDebug. I tried to change the debugger in Lazaus, but as I see in Project Options/Compiler Commands/Show Options (this is where I get the options for fpc used on the HPC) there is no difference. So the binary is the same, and I can use gdb on the HPC. My worry is that the TF, Cuda libraries are highly optimized, they contain no debug information at all. Maybe recompiling the TF with debug info could help, but it is beyond my capabilities.

 

TinyPortal © 2005-2018