If you wanted a digital copy of a magazine article, printed book or newspaper clipping, you could always put it through a scanner for a facsimile image of it. Sometimes, though, an image turns out to be unsuitable for the purpose at hand.
To begin, images aren't searchable. If you scan a business card to put on your computer, for instance, your computer or other device won't be able to search on your list of contacts to find it. Printed matter captured as images isn't editable, either. The question is - how do you get a digital copy of printed matter without spending time typing it out by hand?
The answer is optical character recognition - the technology that allows a computer to scan printed words as images, decode what words those images represent and produce machine recognizable text. This text, just the same as content typed into word processing software, is searchable.
How does OCR software turn images of words into machine-recognizable words?
More than one kind of OCR technology exists.
At its most basic, optical character recognition does exactly what the name suggests - it tries to recognize words one letter or character at a time.
Optical word recognition is more advanced - it tries to recognize whole words at a time.
OCR technology works best on printed words - not handwriting. Intelligent character and word recognition are attempts at reliably recognizing even cursive handwritten text. This advanced recognition technology only works reliably on certain kinds of handwriting at this time.
Traditionally, one of the greatest challenges facing the designers of OCR software has been to make sure that the software isn't confused by print artifacts - lines on the page, spots, scans where a page is not straight, text in color and so on. Modern OCR software is built with special algorithms to overcome these problems - de-skewing algorithms to help straighten up images, binary isolation algorithms to help turn a color image to black and white and de-speckling software to help clean up spots and other artifacts.
Once an image is cleaned up, the software uses various techniques to compare each character on the image to a stored set of character shapes to arrive at the closest possible recognition result.
Accurate OCR offers great benefits
To improve recognition accuracy, software vendors often create special versions of their software for specific applications. Software that's used by lawyers, for instance, comes with a database of legal industry expressions and terminology. Access to such a database can help OCR software check how likely the appearance of a particular phrase or sentence is. Special software exists for the medical industry, too. Several document processing businesses that specialize in OCR technology exist today to deliver professional scanning services to these industries.
When applied to large-scale projects such as Google's plan to digitize the world's libraries, OCR technology can bring knowledge to the masses at low cost. Businesses and individuals use OCR in imaginative and empowering ways, too. OCR software helps the visually impaired find their way around the world. The banking industry uses it to process checks quickly and Internet security businesses deliver CAPTCHA systems to protect websites and individuals online.
Janifar is a computer scientist and researcher. She enjoys passing on his insights through blogging. Visit the Scanning Services Vancouver link to learn more about scanning in that area.