Gary Sieling

Getting started with the Google Cloud Speech API

Google has a “speech to text” API. At the moment, they are advertising a $300 credit for new accounts, so I thought this might be a good fit for an app I’m working on to search/discover standalone lectures. In this article I’ll talk about how you go about setting up a proof-of-concept. My thoughts/opinions on the experience are at the end.

The Google Speech API is part of Google’s larger platform, so if you’re doing this for the first time, you’ll need to follow a series of steps. While this may look like a lot of work, it’s still an order of magnitude easier than setting up the open source projects used for audio transcription (kaldi/sphinx).

1. Create an account
2. Enable Google Speech API
3. Create a service account
4. Install GCloud SDK
5. Activate GCloud SDK
6. Create a bucket
7. Install SOX1
8. Convert MP3s to RAW file format using Sox
9. Upload converted files (command line – gsutil has an rsync command)
10. Create a service account, and download the credentials
11. Set an environment variable to point to this file
12. Use Google’s Python demo example to transcribe your file2

Some of these steps are well-documented in Google’s docs, so I’m going to cover the areas that tripped me up. It appears that the Python examples are updated more regularly, so I recommend those. Whatever you choose, Google’s support does monitor Github tickets, which is very helpful.

A note about sox: this is used for converting between audio formats. Unlike some of the competing products, Google forces you to convert your files into one or two files, which unfortunately offloads a lot of work onto you as a consumer of the API. When I did this, I missed a step in the instructions, found that my files played fine for me, but had the API fail without anything being logged.

Sox can detect clipping in audio files, and will advise you when you can potentially fix the issue by reducing the audio volume. For my lecture search engine, this is a great finding, because I can use the presence of clipping to affect the ranking of files.

To convert a mass of files, you’ll need to write a small script (make sure to get a lot of disk space – these files get big)

for %%f in (wav\*.wav) do (
  sox -v 0.98 wav\%%~nf.wav --rate 16k --bits 16 --channels 1 d:\data\flac\%%~nf.raw
)

For a single file, I’d upload it to a bucket through the UI, but for a bunch, you can use rsync:

gsutil rsync -d d:\Data\raw gs://gsieling-flac

Set your credentials:

export GOOGLE_APPLICATION_CREDENTIALS=/d/Data/search-ff2e0539de94.json

If you get errors about missing libraries, you may need to install some missing dependencies. It’s very important to use the versions specified in the requirements.txt of the sample project – some of these have newer versions with large breaking changes.

pip install gcloud==0.18.2
pip install grpcio==1.0.0
pip install PyAudio==0.2.9
pip install grpc-google-cloud-speech-v1beta1==1.0.1
pip install six==1.10.0

Then transcribe a file3:

python transcribe_async.py --encoding LINEAR16 gs://gsieling-flac/14823.raw

If you get the following error, you probably skipped the environment variable:

grpc.framework.interfaces.face.face.AbortionError: 
AbortionError(code=StatusCode.PERMISSION_DENIED, 
details="Google Cloud Speech API has not been used in
 project google.com:cloudsdktool before or it is disabled. 
Enable it by visiting 
https://console.developers.google.com/apis/api/speech.googleapis.com/overview?project=google.com:cloudsdktool 
then retry.

If you enabled this API recently, wait a few minutes 
for the action to propagate to our systems and retry.")

If you get a “resource exhausted” error, it actually indicates that you didn’t convert your files to the correct format:

Traceback (most recent call last):
File "D:\Software\Anaconda3\lib\site-packages\grpc\beta_client_adaptations.py", 
line 201, in blocking_unary_unary

credentials=credentials(protocol_options))
File "D:\Software\Anaconda3\lib\site-packages\grpc_channel.py", 
line 481, in __call

return _end_unary_response_blocking(state, False, deadline)
File "D:\Software\Anaconda3\lib\site-packages\grpc_channel.py", 
line 432, in _end_unary_response_blocking

raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: 

Once you get this to work, you’ll get some output like so:

Waiting for server processing...
Waiting for server processing...
Waiting for server processing...
Waiting for server processing...
results {
  alternatives {
    transcript: "forced migration of you issue 44 September 2013"
    confidence: 0.787492573261
  }
}
results {
  alternatives {
    transcript: "voices from inside Australia\'s detention centres is Melissa Phillips"
    confidence: 0.87813615799
  }
}
results {
  alternatives {
    transcript: "the Harvest Island Beach Australia there is little sense of individual in question"
    confidence: 0.745137870312
  }
}

Google’s API seems to envision three major use cases – streaming audio with commands (e.g. from an app – they let you specify words you anticipate), short transcriptions, and long, asynchronous transcripts. When I was corresponding with support, they referred to my 15 minute file as “long”, which is unfortunate considering most of my files are at the hour length. The docs also advise that you not use MP3s because you can lose information, however there is unfortunately an enormous amount of this out there, so if you are in this situation, it may not be the best API for you.

Performance-wise, I found that this API takes a very long time to complete (maybe ~1/3 the length of the file), but your mileage may vary.

One thing that surprises me about this API is that while Google has “buckets” for storage, you can’t have the output of your long-running jobs stored there when they finish – they go into some mystery location that you have to poll until the job finishes, or else they disappear (but count against your bill).

  1. http://sox.sourceforge.net/ []
  2. https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/speech []
  3. https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/api-client/transcribe_async.py []
Exit mobile version