How I used Speech To Text to write Four books during Nanowrimo

Amar Vyas
8 min readMar 4, 2023

During Nanowrimo 2022, I was able to write over 150,000 words in a single month, which resulted in 4 books. Learn How I achieved this goal using speech to text.

Image of a man swearing a suit and recording in front of a microphone

Introduction

Globally, the writer/ author community celebrates the month of November as National Novel Writing Month(Nanowrimo). The 2022 edition for me personally was highly successful in terms of achieving goals and productivity. Below I mention my exploits using Speech To Text or STT. My aim is not to brag, but to report with delight my achievements for my November 2022 experiment.

What I achieved in November 2022

In a span of a month, the first drafts of the following books were recorded and transcribed:

- One fiction book on climate change: 84,000 words.

- Three non-fiction books, 53,500 words across four books.

- Podcast script writing, 14,000 words.*

- Total writing: 151,500 words.

*These episodes will be repurposed into a non-fiction book which is planned for second half of 2023.

Well over 10,000 words each were in Marathi and Hindi, this was an added bonus. I use the word “write” because through speech to text, all of my voice recordings were transcribed and I have the typed text ready for editing.

Image of a man standing in front of a microphone
Give me a mic, and hit record!

Goal setting for Nanowrimo 2022

I planned to participate in Nanowrimo 2022 with a single aim: To complete my long delayed fiction book project.

The topic is complex, and the book is on climate change set in mid-century India. To get the flow of thoughts going, I voice recorded my novels in a recording studio for most of November. It may sound ironic considering I have a home recording studio, and the setup is used for my podcasts as well as video calling. But I opted for a nearby studio in order to develop the discipline of going every day to record for atleast one hour.

Original Objective: using Speech To Text on a trial basis

My aim was to use speech to text (STT) tools to convert my recorded voice to written format. At the beginning, I was not sure how this would work for a book. I have used STT extensively for recording blog posts for my blog.

A book is a different ballgame. Moreover, having worked in podcasting for seven years, I had become used to developing script or speaking notes. In case of interview based shows, I had a questionnaire ready if I was the host, or my responses were based on my experience and areas of expertise. Therefore, not much preparation was needed up front.

For the books, I did not use any notes, my aim was to let the creativity flow as I spoke. In case that idea worked, the following questions would arise: Which tools to use? Will the quality of the transcription be adequate? Who will do the editing?

Those questions required answers, but instead of finding answers at the beginning of the project, I decided to dive into the deep end of the pool — figuratively speaking.

On the transcribed text from otter or Happyscribe speech to text softwares, I planned to do basic editing, formatting, and send the script off to a professional editor.

You may wonder how practical is this approach in terms of time and effort spent in creating content. I have provided the break up of the outcome of my efforts towards the end. The results speak for themselves. I plan to repeat this experiment in a few months again, possibly during Camp Nanowrimo in April. Time and other commitments will tell.

Victim of own Success — form one book to four

I am excited to report that my experiment became a victim of its own success. Not only was I able to record four books, I was also able to narrate and transcribe a lot of content in Hindi and Marathi.

My workflow for writing Novels using Speech To Text during Nanowrimo

In this segment, you will find my step by step workflow for Nanowrimo 2022.

1. Recording time

For the entire month of November, I recorded for about 1.5 hours from Monday to Friday. Weekends were a mandatory break to recharge creative batteries. Overall, I recorded for 24 days, totaling nearly 40 hours of studio time.

In lieu of a studio, you can set up your recording space in a quiet area. But for better focus, discipline and habit forming you should ideally rent a studio. The latter may not be possible for all because of time or cost reasons. Overall, 1 hour of studio time can generate between 5,000 and 6,000 words. An un-interrupted 90 minute session allowed me to record over 10,000 hours on very productive days.

My narration was without notes since I knew the subject matter rather well, particularly for non-fiction books. I would not recommend this approach.

In order to record so much content, you need around 3 to 4 hours of preparation time per day. For most users, let me reiterate that it is imperative to have speaking notes or ideally the outline handy for better results. The time taken for preparing the notes or the script outline will depend on the subject, your comfort with ‘build as you speak’ approach, and many other factors. Discussing them is beyond the scope of this post. Maybe in a future post, I will discuss the specific challenges I encountered during my recording sessions.

Home recording studio and speech to text software can achieve similar results. Blog post by Amar Vyas www.amarvyas.in
A home recording studio can do the job perfectly well. No studio required.

2. Speech to Text conversion

Once the audio was recorded, the studio folks sent it to me over a Cloud Drive. We use Koofr for file transfer. Other fine tools such as Wetransfer, Dropbox, and Google Drive can also be used, depending on your choice and requirement. For each narration, I used Audacity or ocenaudio to clean up background noise. Using premium, professional applications such as Auphonic or Adobe Edition seemed like an overkill to me.Note I am not implying that Audacity is not professional grade.

Even though the audio was recorded in a studio, a basic cleaning might help in quality of transcription. I reduced the audio speed to about 92 to 95 % of original. That is, each narration was about 5 to 8 % slower than my recording speed. Slower speaking speed may not sound too pleasant to the human ear, but STT, that is speech to text software, loves it. The significance of this step will become clear shortly.

3. Export the edited audio

I would export the audio as minimum 192 kilo bits per second in .mp3, or .ogg at similar quality. I used otter.aior HappyScribe for transcribing.

Update March ’23. Several new tools including OpenAi’s whisper tool are now available for transcribing the audio. You can try a few, and opt for the one which gives maximum accuracy.

Slowing down the speed of narration helps improve the quality of the transcribing. I was able to get close to 90 % accuracy in the transcription.

This may sound like a good ratio (most apps claim 80–85% accuracy in transcription considering ambient noise in India, accent and other factors), considering my accent and talking speed (I talk rather fast). But when one is dealing with a large quantity of text, 150,000 words in my case, that would still mean nearly 15,000 words were transcribed incorrectly.

4. Read, read, and read

Proof reading the transcribed text played an important role in ensuring continuity, logic and most importantly, accuracy of the written work. Coming back to my previous point about 15,000 words that were inaccurate. This did prove to be a headache while I was proof reading the text. In many instances I had to go back to the original audio to understand what I was trying to say, or meant. This resulted in editing a word, a sentences, and in some extreme cases, an entire paragraph.

Transcripts of audio generated from speech to text do require lot of formatting, proof reading and editing
Proof reading, formatting and editing turned out to be the hardest part of the process

5. Time and cost tradeoff

Overall, ‘writing’ these 150,000 odd words took the following time:

Studio Time : 40 Hours

Speech to Text conversion: 8 Hours

Exporting to text document, formatting, compiling: 12 hours

— — — — — — — — — — —

Total: 60 Hours

Cost: Studio Rental Time : INR 50,000 or nearly 600 US Dollars

Audio transcription Services: One time fee of INR 900 or 11 US Dollars

I have not included time taken to travel to the studio, or the cost of fuel. The recording studio is close to my home, fortunately, and I would drop my wife off at her office and head for recording. In other words, I did not have to make a separate trip for most of the month.

I have not included time in proof reading, basic editing and formatting- which would have been required no matter what. This activity has taken me over 100 hours, and counting.

Was the effort worth it?

My short answer is, it depends on your specific objectives. In terms of writing goals and productivity, the results still outweigh the efforts. For nearly 60 hours of my time, and around 50,000 INR of investment, I was able to write 150,000 words.

In contrast, at my speed of writing, typically I can get an output of 1,800 to 2,000 words per hour. I’ve been trying to get over the OCD habit of editing as a type, and that limits my speed. As a result, the same 150,000 odd words would have taken around 75 plus hours for typing.

Strictly speaking, I saved a little over 15 hours using this method. I did spend a good sum of money in studio rentals, but was able to get first draft of 4 books ready. Potentially a fifth too. Nearly 12,000 words each were in Marathi and Hindi languages. That can be used for a short series of Hindi and Marathi essays for my blog.

In that sense, the whole project was clearly a win. Using a good speech to text convertor will be well worth your time, particularly at the editing stage. Hope this will sow a seed for a thought or two in your minds on how to achieve your writing goals in the coming year.

Update 4 March ’23, the first of the four non-fiction books titled An Eye for AI, how to create images using Artificial Intelligence technology, is ready for publication. You can learn more about it at www.artwithai.in

image of a man wearing headphones and holding a microphone for recording. Blog post by Amar Vyas on Using Speech To Text to Write four books
Check…3,2,1. and go!

Note: I had originally written this post in December 2022. Some sections have been updated in March ’23. All images used in this post have been generated using Stable Diffusion or Midjourney based AI imaging tools.

--

--

Amar Vyas

Author, Speaker. Cofounder, gaathastory podcasts and creator of Baalgatha, Devgatha and Fairytales of India Podcasts. Book "An Eye for AI" releasing soon.