OCR and Resources

Some of the information you might like to store for future reference might be graphics. Personally, I like comics like Dilbert, Non-Sequitur and PhD for their often brutally honest view on office life, life in general, or the life as doctoral student (I have many lives ;-)). Since I send specific comics to other people on specific occasions, when I remember a comic that suits the situation, and some for inspiration for creative projects, I often have the need to find a specific comic. I once thought about typing the text of the comic in a database and link the comic to the entry — then I laughed and did postpone the issue until I find a way that does this more intelligently and without wasting hours of my time. I think I have found a solution now. OCR for comics.

OCR (Optical Character Recognition) is around for years, but my regular associations were with scanned books. However, there is no reason why OCR should not be able to digitize the text in comics, after all, the text is easily readable and I do not need a 100% correct recognition rate — just good enough to find a specific comic strip when I need it. What got me to think about this was the website of Evernote where OCR is used to make scanned or photographed notes findable. While I recommended (and still do) to type your notes whenever possible (it takes up less space, is found faster, can be used per copy&paste in other apps) it makes sense for scans or pictures, where the pictorial information itself is important.

Unfortunately, iPhoto (which I use to archive comics) has no OCR (perhaps this will come in future releases, after all, iPhoto ’08 can recognize faces), but the following way should work out:

  1. Create an PDF and use Acrobat or another program to recognize the text in the comics.
    Move the folder with the comics in TextWranger to get a listing of all the file names. Use Excel to create the HTML Code that displays the file name first, then the image, then an indicator for a new image. Copy the code into a text document and save as .htm to create a very large webpage. Then create (print) as .pdf (careful with page breaks) and use the PDF for OCR.
  2. Get the textual information with the file name and the content from the document
  3. Use a PHP script to insert the textual information into the title information of the image.

This should work and give you searchable comics to find the specific comic you look for. I’ll let the idea simmer for a while, but so far, it looks like an easy way to get the information. You could also use Evernote for the comics (since it can recognize the text in images), but in my case, I am not sure whether I would like to trust Evernote with 6844 Dilberts (a few are duplicates), 4867 Non Sequiturs and 846 PhD’s … and I kinda like iPhoto to archive my comics.

Update

Of course, it helps if the comic sites themselves offer search functionality. xkcd and Dilbert do and if you remember a word or two from the strip you are searching you can find it quickly. Still, download the comic in any case — who knows how long the comic strips will be available (for free).