Raw Scans from a Formatters View

Probably most of you don’t know what a raw scan which has been run through OCR software (Optical Character Recognition) looks like.  Let me just list a few things that a proofreader has to look for:

1) Dropped punctuation, I don’t know why but it’s everywhere in a scanned book…or actually nowhere.

2) Misplaced punctuation, commas instead of periods, semi-colons instead of commas, apostrophes instead of open or closed quotes, odd little marks such as bullets.

3) Wonky words, it’s amazing what OCR “sees” when it converts your “picture” (aka PDF) into editable text, but often it’s far, far from what you actually wrote!

4) Headers and footers, these actually show up as paragraphs within your document because scanning to text only will take those headers and footers out, but not the text within them. So just delete those paragraphs.

5) Lots of misplaced paragraph returns, lots!

6) Lots of Tabs which are a no-no in e-books, but you can take them out globally with the Find/Replace function in Word.

7) Then of course you have to tweak the front matter such as copyright data and insert a Table of Contents (TOC) which you really don’t need for a novel, but readers have gotten comfortable using those “Content” buttons on their e-Readers, so keep them happy and use one. In Smashwords it has to be in the front of your e-book, but for the epub and prc (Kindle) I move it to the back of the book and it’s fine there (see my post on Forcing your TOC).

7) Lost italics! During the scan/OCR process, your italics have run away from home.  You must spend hours finding and marking the italics in your original manuscript. Then as you proof you must re-insert those italics.  If you don’t have your original book, for instance, you sent the book off to a company that just Scans/OCR but doesn’t turn the file into an e-book so they throw away the loose leaf novel after scanning…why should they keep it, they’re done with their work…you’re screwed! Hopefully you have another copy!  Anyway, it takes hours to find and mark italics and then more hours during your proof to re-insert them.  You don’t have to do that with a strict proofread…well sometimes you don’t.  Think “nuclear purge”…but I’ll go into that horrific phrase in another blog.

8) And finally graphics often creep in and are bypassed by OCR even though you tell it to convert to TEXT only.  You can remove these globally with the Find/Special/Graphics on your Home bar in Word.  Just be careful that none of your text is in Text Boxes (which are considered graphics) or those will be erased and you’ll have to physically type the text back into the doc.

I’m writing this blog because I don’t believe my clients realize that proofing their book from a scan is different than proofreading a manuscript file.  When you proofread a book (either a print or digital format) you assume that although there will be misspelled words and punctuation errors, you don’t expect wonky words, completely dropped punctuation or missplaced headers and footers (you probably want those in a book going to print).  We OCR proofreaders have to contend with so much more which greatly increases our proofing time.  I can read most novels under 300 pages in 3 or 4 hours, but it will take me 8 or more to proof an OCR document.  In fact, I scanned and proofed a Sci-Fi novel a few months ago, and it took 27 hours! for a 380 page book…small original fonts, poor paper which adds many layers of error into a scanned document…and lots of italicized words. Thankfully, the novel was very well written and had a very exciting storyline.  Got a headache but didn’t want to put a gun to my head!  A successful proof!

So I guess the moral of this story is: Authors, if you want a good proof in an e-book (from a scanned document), you should expect to pay for it.  It’s hard, mind-numbing work.  Be grateful to your proofreader.  Can you imagine what an e-book would look like from a raw scan? Yikes! Although I’ve seen some e-books which weren’t proofed after OCR. I’ve got one right now on my iPad which is full of my notes (don’t you just love the “note” and “highlight” feature!) about errors in proofing. I’m tempted to write the Author, but I think instead I’ll just use it as an object lesson for myself.  I would be mortified if one of my formatted books ended up on Kindle or anywhere else looking like that…so I’m going to make it my mission to never have one that does!  The goal is PERFECT…well, it’s a goal.


About athirstyblog

If you're a published author and are sitting on a basement full of backlisted books, then you've found the right blog. Although I formerly filled these pages with book reviews, they will now be filled with tips on eBook formatting, talk about the current technology of eBooks, and other stuff that interests me and hopefully interests you. I'm currently an eBook formatter, formerly a bookseller, archaeologist, illustrator and lover of all things historical and scientific. And I'm now a permanent citizen of DownEast Maine with my own beach and 175 year old house and everything! Come along for the Journey!
This entry was posted in Formatting, How To Books, Musings, Tips and tagged , , , . Bookmark the permalink.

1 Response to Raw Scans from a Formatters View

  1. Tess mallory says:

    We appreciate you Pam!!! You’re the best! ; )


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s