Getmem: my code is much the same in principle, but I keep an ordered (path sorted) list. This makes it easier to dump a list which can be used which files (in the archive) belong together (same dir)
One more crucial difference performance wise is that I only hash (also md5) if there are multiple files with the same size.
In my case, one of the two dirs is always the same, so the info on files that were md5'ed in that dir are persistent.
The tool is mostly to deduplicate decommissioned harddisks to save time before checking what needs to be saved, and the million files (+/- 500GB) archive dir contains all known files.