Well, the alternative is to read up on unicode, and implement your own parsing.
You may hear the advice to use Utf16 (if there is conversion code, since the files you read appear to be in utf8).
But Utf16 does not solve the issue. It may well hide it, but that isn't the same as solving. E.g. you then get surrogate pairs. => You can avoid those by going to utf32. But then you still may have to deal with combining codepoints (and they are in each and every encoding).
Btw, even UtfCopy does not take care of combining. So some chars, that exist only as combining sequence, will get broken even then.