If you have an older manuscript that was published a few years ago and do not have the original text file available what can you do to resurrect the text so you can make a new edition or possibly offer it in eBook form? Or maybe your computer crashed (this is where backups are handy) but you have your work in a printed form.
In the last few years OCR technology has made great strides and continues to advance. Not too far back you would be lucky to correctly retrieve maybe 80% of a fairly clean hardcopy text. Now, scanned properly you can see on upward of 98% or better. For as many as we have converted, I’ve never seen a 100% correct conversion but I’ve seen a couple that were very close.
Some text will convert better than others. For instance characters in a serif font such as Times New Roman, Courier, and Garamond will be recognized better than one in a san-serif font like Arial or Helvetica. The consistent shapes of a sans-serif font can blend letters where “la” may be interpreted as “h” or “in” may end up as “m.” The serif font with its details on the ends of some strokes as well as varying thickness of the letter helps to define the characters. Note the change in thickness in the “o” on a Times New Roman vs. that from an Arial.
Most of the OCR software we work with can get excellent results with either. But if you try to get a good conversion from very scrolly or emphasized fonts like Monotype Corsiva or Edwardian Script or Old English, you most likely will be retyping that part of the text. It is just not going to happen. Since OCR is for recognizing text it does not include pictures or graphics.
If you start with good, clean, quality text that is not torn, wrinkled, or discolored that will help considerably. When scanning, the higher the resolution is the better. We scan at a minimum of 300dpi for very clean text or 600dpi if the lettering is small or faded.
Depending on the type face, size and quality of the text we have a number of tricks we developed which can help the process. Just about every scanner I’ve seen will give different results. The scanners that are part of an All-in-One printer usually don’t give the best results. Also if you have 300 pages of material a feeder is almost a necessity. We use a 13 X 19 flatbed scanner with an autofeeder that is part of a larger document processing system.
Once you get the text into a usable form you will note a couple of things. There is a carriage return at the end of each line. Generally, the OCR conversion software does not know if it’s the end of a line, a sentence or paragraph so it adds the carriage return. Now you have a problem. How do you eliminate all of those unwanted returns? Two methods we use are just get in there and manually take them out or use a search and replace.
While you can do a search and replace for the returns you may want to first go through the text and identify the start of new paragraphs. Once you take out the returns it will make the whole document one big paragraph. What you can do is to add an additional return. Then search for all double returns and replace them with something like a return+$. Now you can take out all returns, then replace all $ with returns. The result will honor the original paragraphs. I used $ as an example, if you have any in your manuscript it could cause a return in the middle of the line where they exist. Simply use an odd character that is nowhere else in the text, i.e. #, ^, +, or =.
We’ve done many hardcopy to electronic file conversions using OCR. For the best results you do need some quality hardware and software for scanning the text, converting, identifying and selecting the text. The process is not difficult in most cases, but the corrections for misidentified characters can be time consuming.
If you have something you would like to convert and would like to know what your options are and approximate cost just send us a few pages of the text. Either make a copy and send via regular mail or scan at a minimum of 300 dpi, save as a PDF and let us know how many pages there are in total.