Probably most of you don’t know what a raw scan which has been run through OCR software (Optical Character Recognition) looks like. Let me just list a few things that a proofreader has to look for:
1) Dropped punctuation, I don’t know why but it’s everywhere in a scanned book…or actually nowhere.
2) Misplaced punctuation, commas instead of periods, semi-colons instead of commas, apostrophes instead of open or closed quotes, odd little marks such as bullets.
3) Wonky words, it’s amazing what OCR “sees” when it converts your “picture” (aka PDF) into editable text, but often it’s far, far from what you actually wrote!
4) Headers and footers, these actually show up as paragraphs within your document because scanning to text only will take those headers and footers out, but not the text within them. So just delete those paragraphs.
5) Lots of misplaced paragraph returns, lots!
6) Lots of Tabs which are a no-no in e-books, but you can take them out globally with the Find/Replace function in Word.
7) Then of course you have to tweak the front matter such as copyright data and insert a Table of Contents (TOC) which you really don’t need for a novel, but readers have gotten comfortable using those “Content” buttons on their e-Readers, so keep them happy and use one. In Smashwords it has to be in the front of your e-book, but for the epub and prc (Kindle) I move it to the back of the book and it’s fine there (see my post on Forcing your TOC).
7) Lost italics! During the scan/OCR process, your italics have run away from home. You must spend hours finding and marking the italics in your original manuscript. Then as you proof you must re-insert those italics. If you don’t have your original book, for instance, you sent the book off to a company that just Scans/OCR but doesn’t turn the file into an e-book so they throw away the loose leaf novel after scanning…why should they keep it, they’re done with their work…you’re screwed! Hopefully you have another copy! Anyway, it takes hours to find and mark italics and then more hours during your proof to re-insert them. You don’t have to do that with a strict proofread…well sometimes you don’t. Think “nuclear purge”…but I’ll go into that horrific phrase in another blog.
8) And finally graphics often creep in and are bypassed by OCR even though you tell it to convert to TEXT only. You can remove these globally with the Find/Special/Graphics on your Home bar in Word. Just be careful that none of your text is in Text Boxes (which are considered graphics) or those will be erased and you’ll have to physically type the text back into the doc.
I’m writing this blog because I don’t believe my clients realize that proofing their book from a scan is different than proofreading a manuscript file. When you proofread a book (either a print or digital format) you assume that although there will be misspelled words and punctuation errors, you don’t expect wonky words, completely dropped punctuation or missplaced headers and footers (you probably want those in a book going to print). We OCR proofreaders have to contend with so much more which greatly increases our proofing time. I can read most novels under 300 pages in 3 or 4 hours, but it will take me 8 or more to proof an OCR document. In fact, I scanned and proofed a Sci-Fi novel a few months ago, and it took 27 hours! for a 380 page book…small original fonts, poor paper which adds many layers of error into a scanned document…and lots of italicized words. Thankfully, the novel was very well written and had a very exciting storyline. Got a headache but didn’t want to put a gun to my head! A successful proof!
So I guess the moral of this story is: Authors, if you want a good proof in an e-book (from a scanned document), you should expect to pay for it. It’s hard, mind-numbing work. Be grateful to your proofreader. Can you imagine what an e-book would look like from a raw scan? Yikes! Although I’ve seen some e-books which weren’t proofed after OCR. I’ve got one right now on my iPad which is full of my notes (don’t you just love the “note” and “highlight” feature!) about errors in proofing. I’m tempted to write the Author, but I think instead I’ll just use it as an object lesson for myself. I would be mortified if one of my formatted books ended up on Kindle or anywhere else looking like that…so I’m going to make it my mission to never have one that does! The goal is PERFECT…well, it’s a goal.