>> Hi.
This is Steve Michelotti from
the Azure Government Engineering Team.
Today, I'm joined by Vishwas Lele,
the CTO of Applied Information Sciences.
And we're going to be talking about
Custom Speech Service for Government. Welcome, Vishwas.
>> Thank you Steve. Thank you for having me.
>> Great. So, why don't we start out by just giving
me some high level of what are we
talking about with Custom Speech?
And give us some background to start out.
>> Yeah. So Steve,
Custom Speech Service is part
of the many cognitive APIs, that are available.
And I thought it would be interesting to talk
about Custom Speech Service,
in the context of government scenarios.
Because we are seeing a lot of interest in
using a service like
this to solve a number of government,
state, and local problems.
>> Absolutely.
>> The focus would be to
talk about Custom Speech Service,
but before I jump in Steve,
if you think it is valuable,
to talk a little bit about how we got here.
Because these days we pick up our phone
and expect our phone to recognize our voice,
to understand our commands.
So, some of your listeners may look at this and say,
"Okay, so what is Custom Speech Service
doing and how's it working?"
So I thought a historical perspective may be helpful.
>> Otherwise it's all magic
and we don't know how we got here.
>> Yes. I have a couple of slides
before we get into the Custom Speech Service itself.
The speech research has a long history.
Started really in 1971,
with formation of Speech Recognition Study Group.
Then DARPA, and Carnegie Mellon,
in 1976 started doing more detailed work on speech,
and libraries like Dragon,
which we know of today,
and Sphinx, very popular library.
That came out of that effort that started in 1976.
And then, Microsoft has
been associated with the speech research
for a long time also.
In the news in '95,
you're too young to remember.
>> I know I'm not. I wish I was.
>> There was a speech API,
that shipped with Windows 95,
that allowed you to write programs with speech.
And then, here's the funny part.
In 2001 at the CES,
Bill Gates demonstrated a prototype
of something called a MiPad,
which was a Windows CE based prototype
that allowed you to interact with that device,
but not only touch the stylus, but also voice.
>> Wow. Windows CE.
That's somethign I havent heard for a while,
but Microsoft's been on the cusp of this for a long time.
>> A long time.
And just continuing the flavor
for a listener, for your audience here.
So speech, you need to understand
that the way the speech recognition works is,
you take a piece of audio file,
and you basically create a statistical model from it.
And then you figure out, try to apply some probability,
and say, "Did the speaker want to see this thing?"
Fundamentally, that is what it's all about.
And before 2010, they used to do
some things like the Markov models.
You know these are algorithms that people use.
But post 2010, when
neural networks became really important,
people have been using a combination of
algorithms and neural networks to progress that.
That's pretty much a state of fact.
I really wanted to call out something that
happened in 2017 which is really important.
The speech industry has been
using a test called the Switchboard test.
Which is 20 years worth of recordings
between strangers discussing politics and sports.
And the human error rate of
being able to recognize
that conversation is about 5.1 percent.
And many companies have
been trying to break that barrier.
And last year, I believe in August,
Microsoft team achieved an error rate of 5.1 percent,
which is comparable to a human error rate.
>> So that was the first time that ever happened.
>> That's the first time that ever happened.
>> Okay.
>> And I was reading some articles about it.
That study, or experiment,
happened on the basis of GPUs,
base parsing, Deep Neural Networks, and also CNTK.
>> Okay. So CNTK is a cognitive toolkit that we have,
Microsoft's sort of analogous detensor flow tool.
>> That's correct.
>> GPU we have on Azure Government.
All these are tools that have homes on Azure Government.
>> That's exactly right.
So, if your customers are looking
for more deep dive on Deep Neural Network libraries,
they can go to Azure Gov, get GPUs,
run the CNTK, or other libraries for that matter.
>> Yeah. Absolutely.
>> So, that's a brief history of how we got here.
So, what has changed?
So I gave you a chronology of events.
But, what has changed?
If you are wondering, "Why have we
gotten so much better in this?"
So what has changed is,
abundance of computing power. Of course with the cloud.
We talked about that a moment ago.
Then also, there's nothing better
than training algorithms on more and more and more data.
>> Right.
>> So as people have been using these services,
more data has been there.
So, more training data set is available.
And then some very interesting algorithms,
which we won't get into the details,
but just to give your viewers an understanding.
If you speak a word X,
what is the probability that you're
going to follow it up with a sequence of words?
That's an interesting problem.
You would think that it's a pattern problem,
a pattern matching problem.
It is not. Because the number of
patterns that are possible are astronomical.
So, there are some very interesting algorithms
that have been developed.
And I say all this is,
then we to the custom speech API part,
Steve, it will just be a rest API. And people say, "Hey.
It's real easy to get started" and it is.
But understand that you are leveraging many,
many years worth of research
when you're using that capability.
>> Okay. All right. Great.
>> So with that said,
let's transition over and
describe what Custom Speech Service is.
So Custom Speech Service,
in the simplest possible terms,
it's a Speech-to-Text transcription service.
But it is more than a transcription service,
because it can allow you to tailor to your scenarios.
What do I mean by that? Well you
might be having a conversation with someone,
you may be using very technical,
or very domain specific words.
>> Or it could even be slang, right?
>> It could be slang.
>> Or differences in
regional dialects, that kind of thing?
>> Absolutely. Difference in regional dialect,
you could be using highly technical terms,
and I have a demo of that dialects.
Or you could be operating in
an environment that does
a lot of ambient noise, for example.
So, for example, I was working on a prototype of
the Department of Transportation
before the Custom Speech Service came along,
and the scenario's interesting.
They have these workers who inspect
these tracks and if they see
a security violation they notify authorities about that.
And, because they are out there on the tracks.
The department doesn't want them to be
looking at their screens because the safety issue.
>> Right.
>> They want them interacting with
their applications through speech.
But the problem is,
that the speech recognition can be harder,
because of all of the ambient noise.
So you could take
commands that have been spoken in those environments to
train a service like this so that
your ability to detect these commands is far, far higher.
So that's another example of how you can use
a Custom Speed Service to
highly customize it for your domain.
>> Okay so with the Custom Speed Service
we're talking about a couple of things.
One is, differences in vocabulary,
whether it's regional dialects or
highly technical terms or
even kind of environmental factors.
Ambient noise or background noise
which you just mentioned.
>> Yes.
>> Okay. So, different aspects to it. Cool.
>> And then we talked about technical terms
which are not part of the standard language models.
You know there's a wide.
You know these algorithms have already been
trained on a generic language model.
>> Yeah.
>> So they understand that already.
We are just building on top of that by teaching
these models additional domain specific terms.
>> Okay. Great.
>> So that's what Custom Speed Service is.
>> And you have something called the
pronunciation file, what's that?
>> So pronunciation file is,
I talked about the language model,
which is, you know, you can tell
the service what the words are likely to occur,
what are the sequence of words people are using.
Acoustic model are short fragments of audio files.
You provide transcriptions with that
so that you can train the service but then you can
also help them with pronunciations of certain words here.
>> Does this include like
what the text is that will be output.
>> Yes.
>>Okay. So to use a Star Wars example, C-3PO and R2-D2.
I can tell it, use the letter C rather
than the word s-e-e. Something like that.
>> That is correct. That's correct.
So, that's Custom Speed Service.
>> Okay.
>> Let me just quickly describe
the workflow before we look at the demo.
Understanding this flow will
help you understand the demo.
>> All right.
>> The first thing you do is you find samples of
the files and associated transcriptions
and you upload them.
Once you've done that,
then you can customize the model
further with pronunciation files and things like that.
>> Right. >> Create acoustic models.
Then you'll train your service
based on these artifacts that we mentioned.
>> And when I train my service I don't have to
be a PHD in data science?
I don't have to know CNTK?
>> You will have to learn no CNTK.
In fact, we will see it is a matter of
following three or four steps
of uploading an acoustic model,
uploading your language model,
and then once you've uploaded them,
the service runs and trains that.
>> Okay.
>> Once the training has been completed you create
an endpoint that's specific to your service.
And then once you have an endpoint,
you can start interacting it
just like any other rest service.
>> And in this workflow I'm seeing.
Am I correct in saying that I can use
an acoustic model or
a pronunciation or both at the same time.
I'm not required to deal with everything.
If I just care about
the acoustic model that's what I use.
>> That is correct.
>> Or I can use everything together. It's my choice.
>> That is correct. So I mean
you need the acoustic model for sure.
>> Right.
>> You need to need to have the files and
transcription texts but anything else,
like the pronunciation file, is optional.
>> Okay. >> That's true.
>> Great.
>> That's true.
>> All right.
So you've got me interested here but I
think we need to get you to prove it here.
>> So, let me show you an example
and one thing that I wanted to call out is,
how do you get
those sample files and transcriptions of text?
Because the service is expecting
you to have this training data in a certain format.
>> Right. >> It has to be a WAV file.
It has to have a sampling rate of a certain type.
>> Right.
>> So how do you get that data in the right format.
>> Right.
>> And I'm going to show you an open source code
that's available that can make the job easier.
>> Right. I mean it could be an MP3,
we need to be WAV or what you are talking about here.
>> And then the right sampling great,
stored it in the right format, and things like that.
As we know, machine learning algorithms are
great because you have
a lot of this knowledge built in them.
But at the same time,
getting the right training data is important as well.
>> Yeah. >> So we will focus on that.
The scenario that I have for you Steve
is in this case I'm
talking about a domain related to Parkinson's disease.
>> Okay. >> So, I've collected
some video files and then ran it through
this code where it created this WAV samples for
me and then associated transcription text.
>> Okay. >> And then what I did was,
I didn't sit there and transcribe these videos.
I actually used another service
to get the transcriptions.
>> Okay.
>> And then I fed it into this open source library,
which converted them into samples,
and then we'll upload these samples.
And then, because each of
these steps can take two or three or four minutes,
I'm not going to not
train these models in real time for you.
I just did those before we started this presentation.
What I'm going to do, however,
is we will take a trained model,
take the rest endpoint,
and go to our favorite tool post man,
and then try to call
this trained endpoint with the audio sample.
>>All right. Sounds good.
>> That will be our demo.
>> All right.
>> The first thing I'm going to do is take you to
this portal here which
is the Custom Speed Service portal.
It still says cris.ai.
>> Cris.ai.
Interesting term you use. What does that mean?
>> So, you have to
understand that this service
was called something else before.
There's a branding change happening.
It's called Custom Speed Service now.
But all of the branding changes
have not been effected across all of the-
>> So when you see cris, just think
Custom Speed Service if it hasn't changed already.
>> Just just do the translation here for now.
So right now, here I have
created this Parkinson's language model.
And here I have a few options here in this portal.
It's really simple.
They have only three or four screens here.
I'll start out with showing you the acoustic model.
>> So the acoustic model
corresponds to the background noise,
ambient noise you were referring to earlier.
>> That is correct. The acoustic model is a bunch of
audio files and then
the text transcribed related to those audio files.
>> Okay.
>> In fact, I can very quickly show
you what that looks like here.
Let me just open this here.
>> So it shows the WAV file.
>> It shows the WAV file and it
shows you the transcription text.
Okay. So that's the acoustic model.
Okay. Let's just go back here.
So that's our acoustic model.
Then I also have some language models.
And if you look at the language models,
it'll refresh in a second,
and while it refreshes let's
just go back here and just show
you the language model here in a moment.
So, let's show you the language model right here.
You can see, these are some
of the terms that are really complicated to use.
>> Highly technical terms.
>> Highly technical terms.
And we've just, like for example,
cortical spinal is one of the technical terms.
>> Right.
>> If you spoke about this in a genetic speech model,
this word would not be recognized.
>> Right. >> But we are feeding
these language models to the servers here.
>> In some cases, it might be slang,
in this case, it's technical vocabulary.
>> Technical vocabulary.
>> Okay, great.
>> Right. So, coming back here,
so we talked about,
the acoustic model, we talked about the language model,
I don't have a pronunciation which
is an optional scenario here.
Once I had done all this,
then I can come into the "Deployments" tab. There you go.
So, if you look at the "Deployments" here,
this the Deployment, that was created for us here.
And you have the ability,
to scale this up.
So, what does a Deployment mean?
The model has been trained for me,
and then has been deployed to a REST endpoint.
And then depending on how many people are
accessing this custom speech model, I can scale up.
>> Yeah, the elastic scaling of the Cloud of course.
>> Elastic scaling of the Cloud.
>> Okay.
>> So right now, since
we are the only ones doing the demonstration,
I just have only one scale unit here.
>> Okay.
>> And if I click on
the the "Details" section right here,
you can see it will give me information
about that REST endpoint.
>> For example, what the endpoint is?
>> What the endpoint is.
>> So, I know what to call. Okay.
>> So, let me just go ahead,
and here's a summary of everything we have done, Steve.
This is the language model that we specified.
This is the acoustic model.
And then right here,
is the REST endpoint.
>> It's not just a REST endpoint,
I see we also have WebSockets.
>> That is true.
>> And other options there.
>> That is true. So, we
have WebSocket, and that's important.
Because this is a transcription service,
but you might also want to have a conversation,
a two way conversation,
which the WebSockets will lend itself much better.
>> Right.
>> For those kinds of scenarios.
>> So, you can pick the implementation that's
most appropriate for your scenario.
>> That is exactly right.
>> Okay.
>> So, in this case, we are using Postman.
So, we will just pick the REST endpoint here.
So, I'll go capture this,
and then let us see this in action here.
>> Okay.
>> So, I'm going to go to Postman here,
and we are going to invoke this.
>> Okay, so a Postman is a tool that we can just make
simple HTTP calls without having to fireup a browser,
and can customize what calls we want.
>> That's exactly right.
>> Just play around with really easily.
>> That's exactly right. So, before
we show you this demonstration Steve,
let me just very quickly play our test sample here.
>> Or cortical spinal pathway.
In addition, the resting tremor.
>> So, is a highly technical.
>> Highly technical audio clip here,
describing some Parkinson's disease related term.
>> Right.
>> So, what we will do is, we'll go back to Postman.
We will get ourselves a token,
authenticate against this endpoint.
Then once we get that token,
we'll take this test wav file,
and send it to the service that we just trained.
>> Okay.
>> And see if we can get back the results.
>> Sounds good. Let's do it.
>> So the first thing I'm going to
do here is, go to Postman.
And as I was saying earlier Steve,
a very handy tool to make REST API calls.
>> All right.
>> And in fact, what I've done is,
not only can you make these commands,
but you can also create these custom collections,
which makes it super easy.
>> Yes.
>> So, the first thing I'm going to do is,
go get myself a token,
because my token may have expired.
>> So, based on
their subscription key I got from the portal,
I can go get my token,
which just last for a finite period of time.
>> That is correct.
>> Okay.
>> So, I went to the portal, got the subscription key,
and now I have the token, I'm
going to go capture this token.
>> Okay.
>> And then, let's spend the moment here.
So, this is the endpoint
that we got from the Chris portal.
Remember we were just looking at this.
>> You highlighted that portal, yeah.
>> And then, what we're going to do is,
we are making a POST call right here.
>> Yeah.
>> And in the case of body here,
I've selected binary, which
allows me to choose a file that I want to send up.
>> Right. We are sending up a binary file to
the server that contains our audio.
>> That's exactly right.
So, let's just choose the file.
I think I played test four if I remember correctly.
And what we're going to do is, let's just.
>> You got that authentication token in here?
>> I have. Thank you for reminding.
>> Okay.
>> I better get that.
>> All right, great. Now, we are in business.
>> And we copy the authentication token here.
And let's just call this API here.
And if you heard the audio earlier,
we were indeed talking about cortical spinal pathway.
>> Wow, I mean on that. That's pretty impressive.
So, we heard the file,
and we have a flawless transcription here.
>> We have a flawless transcription here.
And that should not be very surprising because we trained
our algorithm using this domain specific terms.
>> So, we have
custom vocabulary to whatever our domain might be.
>> Right. So Steve,
before we move away from this demonstration,
I wanted to show a tool that
helped me create this streaming data,
because sometimes, that's really the hard part.
You have the video of course,
so you may have collection of videos.
But then you have to translate
those set of videos into small audio files.
>> So, in your case, you actually are starting
with video tracing audio.
Other people might be starting with audio.
In your case, you were video.
>> Yes.
>> Okay.
>> So, what I did was, I took these videos,
send it to a transcription service,
so I have the video and the transcripted text.
>> Quick question. Was there any issue
where maybe the transcription service
transcribed it wrong?
>> So, that is a good point.
So, once the transcription came back,
it was manually edited to make sure.
>> Make sure it was correct.
>> Make sure it was correct.
>> So, just to save some time.
>> To save some time.
That's a very good point. That had to be done.
Once that was done,
took those two pieces of artifact,
and I used an open source code here,
which I want to call
attention to called, Acoustic model machine.
And that's a GitHub project
which takes these files that we talked about.
And then converts them into a format
that is acceptable to the Custom Speech Service.
>> Okay. So, that specific wav file format you mentioned,
maybe that kind of preprocessing beforehand to get
that format so they're ready
to send to the Custom Speech Service.
>> That is right.
>> Okay, great.
>> So, this project
really allowed me to create an artifact.
Just to show you quickly the kinds
of artifact that it generated for me,
let me go into the learning piece.
And then, if I open this here,
this format was generated for me on demand.
>> So, not only did it
create the wav files in the right format,
but it also gave you this text file
that you needed. Awesome.
>> And then it generated this format,
which was then I was able to upload.
>> Okay, great. So, we've seen a demo of what it can do.
Can you just talk a little bit about the use cases?
>> Some of the use cases,
now that you have seen
the Custom Speech Service in action,
I want to motivate a discussion about
this service through a few use cases.
We talked about, a situation
where a worker is outside with a lot of ambient noise.
Be able to train the service with
that ambient noise is one use case.
We are seeing a lot of interest
in people wanting to create bots,
whether it is a Q and A bot or some other bot.
And when you're creating these bots,
you have to often
describe using let's say the language I understand,
LUIS, Language Understanding and Intelligence Service.
You have to oftentimes present that
service a sequence of words,
and then you can go back and see how
your users are interacting with bot service.
>>All right. Okay.
>>You can take some of those commands,
and treat them as language models
for your Custom Speech Service.
>> Interesting, okay.
>> So, there's a cross pollination between creating bots,
and then being able to understand
what the user means to say in a given context.
So bots, and then finally,
of course, we should talk about translation.
It's a translation of the service that I'm
sure your listeners are familiar with.
You can combine translation with
custom transcription service to
enhance the effect of translation.
Because you're trying to translate
something which may be hard to understand,
domain specific, or maybe a specific dialect,
you can take advantage of
Custom Speech Service, do the transcription,
in that manner, and then send
the output to a translation service,
and get yourself better results in that manner.
>> Awesome, okay.
So, there are a lot of
very relevant scenarios
particularly to government customers.
>> That's right.
>> Okay great.
>> That's right.
>> All right.
Well, this has been an extremely
informative talk as always.
Thank you very much for joining us.
So, this has been Steve Michelotti along with Vishwas Lele
talking about the Custom Speech Service
for government. Thanks for listening.
Không có nhận xét nào:
Đăng nhận xét