Images of text are not useful when all you can do is look at them. How can you get text out of something like this if you need a chunk of it to put to use?

Wikimedia Commons image Infographic: We Live in a Nano World by InsightPublishers shared under a Creative Commons CC BY-SA License.

Go ahead, put your cursor there and try to grab a chunk.

I have a current project where a lot of content is given to be as images like some charts that would be better rendered and more accessible as web content. And there is another reason, to provide good alt tags for images, I’d like to get more in there than just using alt="We live in a Nano world poster". That’s pretty useless.

This is a pretty slick trick for putting The Google to work for you. I would take a copy of that image above, and upload to Google Drive.

Once there, I use the contextual menu (right click Windows, control click Mac). Or click the file, and use the 3 dot menu in the top right. Either way you are looking to Open With then select Google Docs.

That’s right.

Open an image in Google Docs.

You get a doc like this– the image inserted above and below it… A damn fine bit of OCR to get the text out as … text.

Hello Text, Freed from the Image

Yes, the formatting is Crazy Clown Pants. But it’s text I can highlight and copy.

And this brings me to my next time saving trick. I am brilliantly gobsmacked (as the say in the UK) (who says it?) (someone must!).

You’ve copied a glob of text from a formatted web page or a document. Most text editors you paste it into, it carries the formatting/style of the original. What if you just want clean, plane text?

Command-Shift-V is my new friend. It does a paste and leaves the style behind.

Here’s the mishmosh of yellow, green, text created above, pasted cleanly in one motion:

WE LIVE IN A NANO WORLD
Nanomaterials- defined as having one dimension below 100 nanometres – are all around us.
POSSIBLE DANGERS
THE NEED FOR TESTING
A NANOMETRE HAS THE SAME RELATION TO A METRE AS THE DIAMETER OF A HAZELNUT HAS TO THE DIAMETER OF THE EARTH
THE HUMAN BODY USES NATURAL NANOMATERIALS, SUCH AS PROTEINS, TO CONTROL MANY SYSTEMS AND PROCESSES.

This is not the best example because of the all caps, but I can quickly run it through BBedit or something like the Text Mechanic Letter Case Converter and it’s less… shouty

We Live In A Nano World
Nanomaterials- Defined As Having One Dimension Below 100 Nanometres – Are All Around Us.
Possible Dangers
The Need For Testing
A Nanometre Has The Same Relation To A Metre As The Diameter Of A Hazelnut Has To The Diameter Of The Earth
The Human Body Uses Natural Nanomaterials, Such As Proteins, To Control Many Systems And Processes.Case Conversion Function Button Above.
Privacy Of Data: This Tool Is Built-With And Functions-In Client Side Javascripting, So Only Your Computer Will See Or Process Your Data Input/Output.

But this letting Google extract the words out of a picture of words is soooooo useful.

UPDATES

Thanks @noiseprofessor for letting me know Google Docs has the capitalization tools buried in the menus.

He’s right!


Featured Image: I bet Carl Wishes Google had an extractor!

carl's extraction
carl’s extraction flickr photo by meigooni shared under a Creative Commons (BY) license

If this kind of stuff has value, please support me by tossing a one time PayPal kibble or monthly on Patreon
Profile Picture for Alan Levine aka CogDog
An early 90s builder of the web and blogging Alan Levine barks at CogDogBlog.com on web storytelling (#ds106 #4life), photography, bending WordPress, and serendipity in the infinite internet river. He thinks it's weird to write about himself in the third person.

Comments

  1. Hi Alan,
    Great trick.
    If you have an office365 sub, I get one through being teacher in Scotland, the Office app on iPhone does amazing OCR. It can even take a photo of a table and turn it into an excel doc pretty flawlessly. Gobsmacked here;-)

  2. Man, Crazy Clown Pants. That takes me back to my old days of being the only person that new how to support/run Caere (at that time) OmniPage OCR. That thing boy, I tell ya’ had more knobs, bells and whistles than any version of Adobe Photoshop. But the biggest tripping point, stumbling block was “recognize as”. TruPage Formatting was the default. And darned if that thing would not “try” as hard as it could to pull out a formatted text that “looked” like the original. Problem was it used every trick like tables, non-breaking spaces, hidden characters ANYTHING it could do to make the OCR’d version look like the original paper document. Everyone would complain after we would run a 200page doc through our scanner, “This is too goofy and hard for me to edit, don’t you have a way to pull the formatting out?” And so that would lead to the fully “unformatted” recognition that would look, smell and taste like the “Crazy Clown Pants” you see today from Google Docs. And we used to have to pay $600 every year to keep that stupid license up-to-date. I’m glad I don’t have to support that anymore.

Leave a Reply

Your email address will not be published. Required fields are marked *