is it possible to somehow improve the situation I described above.
The answer to that is _very likely_ to be, yes. The performance gain in most cases will be minimal to negligible and, the code to implement an iterative solution will be noticeably more complex. It's simply not worth it.
Just FYI, Sedgewick, in the second edition of his book "Algorithms" presents both, a recursive and an iterative implementation (Pascal pseudocode) of a DFS (neither is optimized.) As expected, the iterative implementation is more complex than the recursive one and, as presented in the book, it's unclear which one would actually perform better.
I _believe_ that, with careful optimization of both algorithms, the iterative algorithm would be a smidgen faster (most likely not humanly perceptible in the great majority of cases.) I believe that because, in a good iterative implementation, a push of the return address on the stack will no longer be necessary (in most cases, that isn't much of a performance gain.)
It seems to me that you agree.
In general, yes, I agree with what you've stated.
However, there is one statement you've made that I disagree with, which is:
These types of operations are very expensive for the CPU.
There is obviously a cost associated with pushing "items" on a stack but, they really aren't "very expensive".