Building Speech To Text Web Application

In the previous part, We showed you how to add a Facebook Login Button and handle the basic authentication flow. Now is the time to deal with audio recording and retrieving transcriptions from AWS Transcribe.

Audio Recording

Let’s imagine that the user will be able to start recording by clicking a button, which will cause the ::start-recording event to be dispatched. Let’s create an event handler for it:

(ns aws-transcribe-clj.transcribe.events
  (:require [re-frame.core :as rf]
            [aws-transcribe-clj.config :as config]
            [aws-transcribe-clj.transcribe.fx :as fx]))
...
(rf/reg-event-fx
  ::start-recording
  [rf/trim-v]
  (fn [{:keys [db]} _]
    (let [limit config/APP-AUDIO-LIMIT]
      {:db                (-> db
                              (update :transcribe #(dissoc % :transcription))
                              (assoc-in [:transcribe :recording] {:status   :initialized
                                                                  :progress {:max     limit
                                                                            :current 0}}))
      ::fx/record-audio! [{:delay     500
                            :limit     limit
                            :id        (random-uuid)
                            :mime-type "audio/wav"
                            :tick      {:interval 1000}}
                          {:on-started [::recording-started]
                            :on-tick    [::update-recording-progress]
                            :on-stop    [::stop-recording]
                            :on-error   [::recording-error]}]})))

It updates application state, by setting the recording status and progress, and resets a previous transcription. Additionally it produces the ::fx/record-audio! effect and passes a vector of two maps to its handler. The first of them contains recording parameters and the second tells which event it is going to dispatch for particular situations during recording.

This is what the effect handler is going to look like:

(ns aws-transcribe-clj.transcribe.fx
  (:require [cljs.core.async :as a]            
            [re-frame.core :as rf]            
            [aws-transcribe-clj.transcribe.fx.audio :as audio]))
...
(rf/reg-fx
  ::record-audio!
  (fn [[params {:keys [on-stop on-error] :as events}]]
    (let [output (a/chan)]
      (a/go
        (audio/record! (assoc params :output output) events)
        (let [result (a/<! output)
              event  (if (instance? js/Error result) on-error on-stop)]
          (rf/dispatch (conj event result)))))))

Here we are creating a channel, to which we are going to send a recording result, and adding it to the parameters map passed to the record! function. When the result is ready, we are dispatching the appropriate event depending on whether there is an error or not.

aws-transcribe-clj.transcribe.fx.audio namespace contains functions responsible for starting and stopping recording – you can find them here. They use recorder-js library which, apart from taking care of the recording itself, converts the recording to wav, which is supported by AWS Transcribe.

Now let’s move on to defining handlers for events dispatched during recording:

(ns aws-transcribe-clj.transcribe.events
  ...)

(rf/reg-event-db
  ::recording-started
  [rf/trim-v]
  (fn [db _]
    (assoc-in db [:transcribe :recording :status] :pending)))

(rf/reg-event-db
  ::update-recording-progress
  [rf/trim-v]
  (fn [db [tick-delay]]
    (update-in db [:transcribe :recording :progress :current] (partial + tick-delay))))

(rf/reg-event-fx
  ::stop-recording
  [rf/trim-v]
  (fn [{:keys [db]} [recording]]
    {:db       (-> db
                  (assoc-in [:transcribe :recording :status] :success)
                  (update-in [:transcribe :recording :progress] (fn [{:keys [max] :as progress}]
                                                                  (assoc progress :current max))))
    :dispatch [::store-recording-locally recording]}))

(rf/reg-event-db
  ::recording-error
  [rf/trim-v]
  (fn [db [error]]
    (assoc-in db [:transcribe :recording] {:status :failure
                                          :error  (err-map error)})))

(rf/reg-event-fx
  ::store-recording-locally
  [rf/trim-v]
  (fn [{:keys [db]} [recording]]
    {:db       (update-in db [:transcribe :recording] #(merge % recording))
    :dispatch [::store-recording-on-s3 recording]}))

::recording-started, ::update-recording-progress and ::recording-error are responsible only for updating the application state – recording status and its current progress. ::stop-recording in turn, despite updating the status, will dispatch the ::store-recording-locally event. We could send the audio file to S3 right away, but we want the user to be able to listen to what he recorded. After it is saved locally, we will send it to S3 by dispatching the ::store-recording-on-s3 event to make it available for Transcribe.

Uploading the file

In the same namespace, let’s create a handler for the event mentioned above. It is going to update the status of S3 object related to the recording and perform the upload. In order to access the bucket, we will need the Facebook access token, which we obtained earlier after successful authentication. Additionally we will want to use the Facebook user id to generate the name of the object in the bucket. Since the upload is a side effect, our event handler will produce a re-frame effect. Let’s call it ::fx/upload!. Here’s what it will look like:

(rf/reg-event-fx
  ::store-recording-on-s3
  [rf/trim-v]
  (fn [{:keys [db]} [recording]]
    (let [user-id      (get-in db [:facebook :auth :userID])
          access-token (get-in db [:facebook :auth :accessToken])]
      {:db          (assoc-in db [:transcribe :recording :s3-object :status] :pending)
      ::fx/upload! [{:recording recording
                      :user-id   user-id}
                    {:access-token access-token
                      :role-arn     config/AWS-ROLE-ARN
                      :bucket-name  config/AWS-RECORDINGS-BUCKET
                      :provider-id  config/AWS-WEB-IDENTITY-PROVIDER-ID}
                    {:on-success [::store-recording-on-s3-success]
                      :on-error   [::store-recording-on-s3-error]}]})))

The upload effect handler will take a vector of three maps as an argument:

the context, which will contain the recording and Facebook user id,
params used by S3 service,
the map of events which are going to be dispatched in case of a successful operation or an error.

Let’s implement the effect handler now:

(ns aws-transcribe-clj.transcribe.fx
  (:require ...
            [aws-transcribe-clj.aws.sdk :as aws]))
...
(rf/reg-fx
  ::upload!
  (fn [[{:keys [recording user-id]} {:keys [bucket-name] :as opts} {:keys [on-success on-error]}]]
    (let [file      (:file recording)
          obj-key   (str "facebook-" user-id "/" (oget file :name))
          bucket    (aws/bucket-service opts)]
      (aws/put-object
        bucket
        #js{:Key         obj-key
            :ContentType (oget file :type)
            :Body        file}
        (fn [err _]
          (if (some? err)
            (rf/dispatch (conj on-error err))
            (rf/dispatch (conj on-success {:recording recording
                                          :s3-object {:object-key  obj-key
                                                      :bucket-name bucket-name}}))))))))

It will use the PutObject endpoint of the S3 service. After the operation is done, the callback function will be called and depending on whether the error was encountered it will dispatch an appropriate event. In case of failure it will be ::store-recording-on-s3-error and its handler will look like this:

(rf/reg-event-db
  ::store-recording-on-s3-error
  [rf/trim-v]
  (fn [db [error]]
    (assoc-in db [:transcribe :recording :s3-object] {:status :failure
                                                      :error  (err-map error)})))

Its task is to assign the appropriate status and error to the s3 object in the recording.

When the object is added to the bucket successfully, the ::store-recording-on-s3-success event will be dispatched. At this point we will save the S3 object info and trigger the transcribing process by dispatching the ::start-transcription event:

(rf/reg-event-fx
  ::store-recording-on-s3-success
  [rf/trim-v]
  (fn [{:keys [db]} [{:keys [recording s3-object]}]]
    (let [s3-object' (assoc s3-object :status :success)
          recording' (assoc recording :s3-object s3-object')]
      {:db       (update-in db [:transcribe :recording] #(merge % recording'))
      :dispatch [::start-transcription recording']})))

Transcription

The recording transcription process will be started by the handler of the ::start-subscription event. The handler will expect the recording data in the event vector. It will be used further to produce the ::fx/transcribe effect. As with the upload, we are going to use Facebook credentials to access AWS Transcribe and the effect handler is going to receive a vector of three maps as an argument:

(rf/reg-event-fx
  ::start-transcription
  [rf/trim-v]
  (fn [{:keys [db]} [recording]]
    (let [user-id      (get-in db [:facebook :auth :userID])
          access-token (get-in db [:facebook :auth :accessToken])]
      {:db              (assoc-in db [:transcribe :transcription :status] :pending)
      ::fx/transcribe! [{:recording recording
                          :user-id   user-id}
                        {:access-token       access-token
                          :role-arn           config/AWS-ROLE-ARN
                          :provider-id        config/AWS-WEB-IDENTITY-PROVIDER-ID
                          :transcripts-bucket config/AWS-TRANSCRIPTS-BUCKET}
                        {:on-started [::transcription-started]
                          :on-done    [::transcription-job-success]
                          :on-error   [::transcription-job-error]}]})))

Now let’s define the effect handler:

(ns aws-transcribe-clj.transcribe.fx
  (:require ...
            [cljs.core.async :as a]
            [re-frame.core :as rf]
            [aws-transcribe-clj.aws.sdk :as aws]
            [aws-transcribe-clj.common :as c]
            [aws-transcribe-clj.transcribe.fx.transcription-job :as tjob]))
...
(rf/reg-fx
  ::transcribe!
  (fn [[{:keys [recording user-id] :as context}
        {:keys [transcripts-bucket] :as opts}
        {:keys [on-error] :as events}]]
    (let [{:keys [object-key bucket-name]} (:s3-object recording)
          transcribe (aws/transcribe-service opts)
          watcher    (a/chan)
          opts'      (assoc opts :aws-transcribe transcribe :watcher watcher)]
      (a/go
        (aws/start-transcription-job
          transcribe
          #js{:TranscriptionJobName (str user-id "-" (:id recording))
              :LanguageCode         "en-US"
              :Media                #js{:MediaFileUri (str "s3://" bucket-name "/" object-key)}
              :OutputBucketName     transcripts-bucket}
          (tjob/start-transcription-job-callback context opts' events))
        (let [result (a/<! watcher)]
          (if (c/js-error? result)
            (rf/dispatch (conj on-error result))
            (tjob/track-transcription-job
              (assoc context :transcription-job result)
              opts'
              events)))))))

Its task is to start an asynchronous transcription job in AWS Transcribe and to monitor it. When the job is started the :on-started event passed in the events map will be dispatched and the transcription job info will be put onto the watcher channel. Next, if the transcription has started successfully, we will track the execution of the job using the GetTranscriptionJob endpoint until it is finished.

All the helper functions We used can be found in the aws-transcribe-clj.transcribe.fx.transcription-job namespace.

Once the transcription starts, we will want to update the status in our application to match the one on Transcribe. To do this, let’s define a handler for the event ::transcription-started:

(rf/reg-event-fx
  ::transcription-started
  [rf/trim-v]
  (fn [{:keys [db]} _]
    {:db (assoc-in db [:transcribe :transcription :job :status] :pending)}))

Let’s do the same for any error we encounter:

(rf/reg-event-fx
  ::transcription-job-error
  [rf/trim-v]
  (fn [{:keys [db]} [error]]
    {:db (-> db
            (assoc-in [:transcribe :transcription] {:status :failure
                                                    :error  (err-map error)})
            (assoc-in [:transcribe :transcription :job] {:status :failure
                                                          :error  (err-map error)}))}))

When everything goes well, we want to save the currently processed job, update the status and send an event that will initiate the process of retrieving results so that we can display them.

(rf/reg-event-fx
  ::transcription-job-success
  [rf/trim-v]
  (fn [{:keys [db]} [job]]
    (let [duration (- (:CompletionTime job) (:StartTime job))
          job'     (assoc job :status :success :duration duration)]
      {:db       (assoc-in db [:transcribe :transcription :job] job')
      :dispatch [::get-transcript-content job']})))

Fetching results

There is one more thing to do, namely getting the results from the output bucket. After the transcription is completed, they are stored in JSON format. Therefore we need to get the appropriate S3 object, parse its content and store it locally. At the end of the previous section, we told the ::transcription-job-success event handler, to dispatch the event ::get-transcript-content. Its handler implementation will look like below:

(rf/reg-event-fx
  ::get-transcript-content
  [rf/trim-v]
  (fn [{:keys [db]} [transcription-job]]
    (let [access-token (get-in db [:facebook :auth :accessToken])]
      {::fx/fetch-transcription! [{:transcription-job transcription-job}
                                  {:access-token access-token
                                  :role-arn     config/AWS-ROLE-ARN
                                  :provider-id  config/AWS-WEB-IDENTITY-PROVIDER-ID
                                  :bucket-name  config/AWS-TRANSCRIPTS-BUCKET}
                                  {:on-success [::store-transcript-content]
                                  :on-error   [::get-transcript-content-error]}]})))

We also pass a vector of three maps to the handler of the produced effect: the context, parameters and events. And this is what the handler for the ::fx/fetch-transcription effect will look like:

(rf/reg-fx
  ::fetch-transcription!
  (fn [[{:keys [transcription-job]} {:keys [bucket-name] :as opts} {:keys [on-success on-error]}]]
    (let [bucket      (aws/bucket-service opts)
          op-params   #js{:Bucket bucket-name
                          :Key    (str (:TranscriptionJobName transcription-job) ".json")}]
      (aws/get-object bucket
                      op-params
                      (fn [err resp]
                        (if (some? err)
                          (rf/dispatch (conj on-error err))
                          (rf/dispatch (conj on-success resp))))))))

As you may have noticed, it looks very similar to the ::upload! handler. The main difference is that here we are using GetObject endpoint to retrieve its content.

Now let’s handle an error by defining an event handler for it:

(rf/reg-event-db
  ::get-transcript-content-error
  [rf/trim-v]
  (fn [db [error]]
    (assoc-in db [:transcribe :transcription] {:status :failure
                                              :error  (err-map error)})))

and store the parsed results when it succeeds:

(rf/reg-event-db
  ::store-transcript-content
  [rf/trim-v]
  (fn [db [content]]
    (let [body (as-> content $
                    (js->clj $ :keywordize-keys true)
                    (:Body $)
                    (ocall! $ :toString "utf8")
                    (ocall! js/JSON :parse $)
                    (js->clj $ :keywordize-keys true))]
      (update-in db [:transcribe :transcription] #(merge % {:results (-> body :results)
                                                            :status  :success})))))

Presentation

The next step will be to build the user interface. We have to implement necessary subscriptions and the component display logic. I assume that at this stage you have the knowledge to do it yourself with the help of this article and the repository or other materials available on the web. Therefore, I invite you to watch a short video showing how I designed the appearance of the application and how it actually works.

I hope you find this article helpful! If you have any suggestions, write me an email. Thanks for your time and good luck!