Hi all,
I'm currently working on my own scanning/document management system for incoming letters etc.
https://bitbucket.org/reiniero/papertiger/overviewCurrently on Linux; GUI client may be done on Linux and/or Windows (or web based)
I currently can scan using sane, generate a TIFF, do OCR with tesseract and combine the OCR text and the tiff image into a pdf with exactimage's hocr2pdf. I'm going to add support for a metadata database, and probably a full text index. Then I'd need a GUI viewer/searcher, desktop and/or web based.
Once things are working, I would like to (semi)automate some stuff such as detection of company logos on incoming letters.
Depending on my tesseract hocr output (or e.g. cuneiform if I also add support for that), I might be able to already limit the parts of the TIFF that contain graphics.
In any case, I'd like to be able to detect what company sent me a letter, presumably based on having an example scanned logo stored in the database and running some kind of tool/algorithm on it.
Any hints on the best way to do this?
Thanks!