Scanning Books

2014-04-23 Daniel Doing Science, Improving your Creativity, Infrastructure, Learning to do Science, Other Programs, Other Tools, Science, Tools 4

“Imagine a banana. Or anything curved. Actually, don’t, cause it’s not curved or like a banana. Forget the banana!”
The Doctor, describing conceptual space, in “Doctor Who: Space and Time”

I’ve repeatedly written posts about the joy to read books digitally and how to quickly scan paper books. Given that I spend the last weekend at home and took care of the last books I had stored in my old room, I’d like to give a short update.

The books I scanned varied in size, color vs. b/w, amount of pages, etc. When I scan the books, I usually scan color books in color and with the best quality. In rare cases, I use gray (if it is really a b/w book with a lot of pictures), and for most paperbacks/novels I use b/w (except for the cover).

These are the settings for my ScanSnap S1500M

Color

color_1

color_2

Gray

gray_1

gray_2

B/W

(no compression setting)

However, what makes the scans worthwhile happens after the scan.

Using Acrobat (still have Adobe Acrobat 9 Pro) I first use Reduce File Size (with retain existing, but saved as new file) and then OCR.

Saving the documents as new files lets you easily keep the original scans in best quality — in case you ever need them. The effect of reducing the file size is incredible, just look at the original sizes, after reduced file size, and after OCR (OCR adds a little).

part1

part2

(61 files, from 12.71 GB to 1.2 GB to 1.6 GB)

After OCR is done (keep in mind to select the correct language) you can read the books is “well-enough” quality and easily copy and paste interesting passages. A few books (about 1 of 20) has some OCR problems, and perhaps 1 in 50 is unusable for OCR.

But most of the time it works brilliantly.

I still have my books without them taking up any (relevant) space.

Happy reading 🙂

Michael W. Perry
2014-04-23 at 17:53

“I still have my books without them taking up any (relevant) space.”

Yes, but does the author of those books have any of your money for laboring to write the book that you perhaps borrowed before OCRing?
Daniel
2014-04-23 at 17:57

When I OCR a book, I cut of the spine and send it through a document scanner — afterwards the book is pretty much “kaput”. So, yeah, the author got the full price of the book — or, in cases I prefer, the original buyer got some of his/her money back when I bought a used book. OCRing a book that is borrowed would mean (for me) unfolding every page manually and pressing a scan button on a copier/scanner. Possible, and I did do this when I was a student assistant (good thing I had a notebook then and spend the time watching videos), but I would not do this today.

In short, everything I put through my document scanner I have bought in one way or the other, and the book is removed from circulation. 🙂
Chad
2014-04-23 at 19:46

Thanks for the post. Do you automate the process in any way? I’ve been trying to figure out how to get Acrobat to open any documents that go into a specific folder, run its reduce file size mojo, ocr, and then save it in a new file. Still haven’t figured out how.
Daniel
2014-04-23 at 20:08

No, not really, although I am sure there are ways to do so. Probably via Apple Script or via external programs.

The only thing I do is to use the “Documents” — “Reduce File Size” and “Documents” — “OCR Text Recognition” — “Recognize Text in Multiple Files Using OCR …” to do it in batches. You can select multiple files at once and determine the new file name. I use another folder and add _rdx for reduced file size files and _ocr for text recognition files, ending up with author_year_rdx_ocr. I usually first reduce the file size of all scanned books over night, then using OCR also over night if possible. With OCR, I first sort the files into German vs. English books to select the correct language (makes a huge difference, given that German has “Umlaute” like ä, ö, ü).

I have no qualms letting my notebook work on processor intensive tasks over night, although I strongly recommend having a smoke detector in one’s bedroom. Not that anything ever happened, but you never know, and it’s the smoke that kills you, not the fire.

Comments are closed.

A comment for those seeking to use this site for personal gain: Given the increase in requests, let me be clear. I write on this blog because I want to. It’s my hobby, my playground. Sometimes people point me to interesting products/services and I write about them. But any request regarding ads or sponsored placements ends up the trash without a reply. And if you think something would be of interest, differentiate yourself from the spammers by referring to a posting — in an intelligent way. (I get enough auto-generated mails to identify them immediately.)

BTW, posts can get updated after I published them if I spot spelling errors (not a native speaker) or think a different wording might improve precision and clarity.

This blog is not focused on a single topic, or method. As long as it is relevant to improving creativity (or allowing it in the first place), it's fair game.
Some postings on this blog deal with freedom, as I think that we need freedom of thought, of speech, of association, etc. pp. to solve mankind's problems. Thus, some postings may seem a bit remote when it comes to organizing creativity. Freedom is, however, the bedrock of creativity.
The heterogeneity of the postings can make reading this blog a bit cumbersome, at least if you are only interested in one topic. You can either use the search function (above), or use the categories or the tags to narrow down the postings you see.

ORGANIZING CREATIVITY

How to generate, capture, and collect ideas to realize creative projects.

Scanning Books

4 Comments