Are you using "modern" OpenGL (shaders, vertex buffer objects/arrays, framebuffers, glDrawElements)? Or are you using the old fixed-function pipeline (glBegin, glEnd, glVertex3f, glCallList etc.)? I'm asking because these are very different approaches, with the former being insanely more flexible compared to the latter.
In my (modern) OpenGL project for example I'm rendering the scene again with a shader that outputs (among other things) the position of each fragment (interpreted as RGB it may look something like the attached image). Then I do glReadPixels at the mouse position to get the position.
The same thing should be possible by sampling the Z-buffer instead (which is pretty much always present), although that isn't ideal (nonlinear, loss of precision for objects far away from the camera, needs more camera information).
Unfortunately this technique isn't implementable in the fixed-function pipeline (glReadPixels may still be useful however, for example by enumerating objects with specific color values to at least check what is under the mouse pointer). In that case I'm afraid you'll have to use CPU-based ray casting. In any case, the fixed-function pipeline makes anything slightly advanced immediately much, much harder if not outright impossible to accomplish (while also being deprecated since 2008 + shaders have been available for 20 years).