(c) Randall Munroe, visit his website at xkcd.com
If you love books, don’t read this.
If you love to work with the content of books, especially in digital format, do.
I’ve got a pretty large book shelf and it bothers me. It bothers me because it takes up space and I see myself more as a digital nomad than as someone who carries a lot of books around, and it bothers me because I like to read book in digital form. Sounds strange, but I like to read them as .pdf on my notebook, being able to copy interesting passages into my wiki where each book has a page and where I store notes about the book.
I’m a huge fan of ebooks, if they aren’t “eyes only” DRM, but this doesn’t help me with the books I have — I’m not going to buy them again. Yes, I can find copies of most books “for free” online, but I actually believe in paying for what I use.
So, I’ve made an investment. I bought an Fujitsu ScanSnap S1500M, a really fast document scanner, to scan all the books I own, or rather, most of the books I own.
But wait, how can you scan books with a document scanner? It only takes sheets of paper, not bound books.
This is where it gets painful for librarians and book-lovers.
You take a cutter, open the book so that the thick cover is out of the way, and then you cut down near the spine until the whole spine is separated from the book. If you cut in the right distance (about 2mm) the whole text is on the page itself but the spine with the thread or glue binding, holding the pages together, is separate from the pages itself. You might have to turn the first page also, sometimes it is glued to the cover in a way that makes the distance from the spine appear smaller than it actually is. When it happened to me it did cost me a book (cut right through the first letters of each line). So mind the distance. Next you cut the spine part from the front cover. This leaves you with a) the front cover, b) the book block as sheets of paper, and c) the back cover.
See, I told you not to read on.
After this painful procedure, the S1500M takes care of the rest — it scans the book block in “Best” Quality (B&W 600 dpi, duplex scan) as .pdf without OCR (this comes later in Acrobat) in the blink of an eye. Seriously, this scanner is fricking fast. You put in 50 sheets of paper (100 double sided scans) and you are hard pressed to do something useful in the short time it takes to scan the pages. I also scan the covers in “Best” Quality (Color 300 dpi, simplex scan, you can create scan profiles so switching goes really fast) and put them together in Acrobat with the book block.
Regarding file size, my favorite book, a 12,24×19,03 cm 146 pages book is 8,1 MB after Acrobat is finished with it.
And it’s well worth it. 🙂
But wait, doesn’t that mean that I destroy my library?
Yes, in a way, I do. The books are fodder for the trash bin when I’m through with them. But on the other hand, the really important part, the content, lives on digitally. Sure, a lightning strike can turn my virtual library to cinders, but on the other hand, it could also do that with a paper library, if a fire breaks out after the strike. Only that I probably won’t have a copy of my paper library in the next bank, or at work, or even with me on a 16 GB USB stick.
Danger of loss is omnipresent — but you can deal with it.
But there are some … damn it, the book has finished scanning and I haven’t got the next one ready — I can’t keep up with this scanner … books I won’t scan, for example, books with character. This can be a book I have favorable memories of or a well worn book that belonged to a library. Some books shouldn’t be cut and scanned. But I have no problems with scanning a one in a hundred thousand off the mill copy of an internationally successful book.
I’ve scanned (or cut/killed/gutted, you name it) eleven books in the test run yesterday evening. Acrobat did the OCR while I was sleeping. Was it fast, you betcha! 🙂
Note: While this might sound as an advertisement for a specific scanner, it is not. It works with any scanner that is fast enough for you to work with (varies depending on individual preferences). Also note that I take no responsibility, legal or otherwise, of what will happen if you use the information and recommendations of this posting. And in the spirit of full disclosure, I paid for my scanner, the full price, and I didn’t get any incentives to post this here (just the expectation of the warm and fuzzy feeling of sharing something that works for me and might work for others ;-)).
Would you be willing to share your settings for Acrobat. I too am embarking on this road (using the Mac citation program Sente to rename files and organize my digital library), but I can’t get my pdfs down that small after ocr.
Sure, but I’m not sure where I have deviated from the default settings … hmmm, the process is that I a) scan the files like mentioned in the posting (b&w in 300 dpi, there is no compression setting in ScanSnap Manager for this; and color in 600 dpi with medium compression). Then I b) combine the files with Acrobat Pro 9 using File > Combine > Merge Files into … with the Default File Size (Icon in the middle), which should leave the PDFs as they are (it does, see below). Then I use c) Acrobat’s OCR (Multiple files) with PDF Output Style Searchable Image and Downsample Lowest (600 dpi) and check Fast Web View (but not PDF Optimizer) — this greatly reduces the files size. As a last step d) Acrobat’s Document > Reduce File Size which further reduces the file size (did’t use it until your comment, takes ages but has a nice effect on the file size and the page turning is much snappier).
The file sizes for another book (290 pages, 15.103×23.069 cm, color front and back cover, pages b/w) are:
a) 8 files with 46.82 MB total
b) 1 file with 46.7 MB
c) 1 file with 13.2 MB (where would I find out what Acrobat did during the OCR to reduce the file size?)
d) 1 file with 3.4 MB (hmm, and the turning of the pages is extremely snappy with this one, the unreduced file size version is slower)
Regarding the quality of the scans, the reduced file size version has a higher contrast but it still readable at least as well as the unreduced version.
For a comparison (screen readability, I have no idea how the printed version looks like) left is the version from the first scan, right is the reduce file size version — ignore the shades of gray in between:
Great. I’ll give those a try. I too don’t care what a printed version looks like, but as computers will grow ever faster in the future, I was worried about compromising too much on image quality for the sake of file size today.
Good point — I will keep the original scans (as PDF, uncombined, although the color pages scanned only with medium compression) somewhere as an archive in case I ever need the “high quality” scans again. I don’t think I will (it’s text, I just want to read it) but I might (e.g., if I need a good quality graphic from one book). Meanwhile I’ll work with the quick-to-open-and-turn-pages reduced file size versions.