- Cloud Vendor Based NoOps
- Transcription
- Diarization
- Language Detection
- Amazon Transcribe
- Prerequisites are to have a valid and activated AWS account and permissions to use "Transcribe" cognitive services
- Prepare to configure AWS CLI
NB. Do not use the AWS account root user access key. The access key for the AWS account root user gives full access to all resources for all AWS services, including billing information. The permissions cannot be reduce for the AWS account root user access key.- Create a GROUP in the Console, such as
cognitive
, and assignAmazonTranscribeFullAccess
andAmazonS3FullAccess
as Policy create-admin-group
Select one or more policies to attach. Each group can have up to 10 policies attached. - Create a USER in the Console, such as
aiuser
, assign it to the GROUP, and save thecredentials.csv
file (store and keep it secret) create-admin-user - Set a PASSWORD for the user aws-password
- Create a GROUP in the Console, such as
- Run the
aws configure
command to configure the AWS CLI using the keys for the USER (aiuser
)
NB. The command prompts for: access key, secret access key, AWS Region, and output format; stores this in a profile ("default"), this is used when running an AWS CLI command without explicitly specify another profile.$ aws configure list Name Value Type Location ---- ----- ---- -------- profile <not set> None None access_key ****************MYVZ shared-credentials-file secret_key ****************nEac shared-credentials-file region <not set> None None
- Create S3 Bucket
- In this case the bucket is named
blobbucket
and set toprivate
, with LocationConstraint set to the specified region
$ aws s3api create-bucket --bucket blobbucket --acl private --region us-east-2 --create-bucket-configuration LocationConstraint=us-east-2 http://blobbucket.s3.amazonaws.com/
- Upload files to the S3 Bucket (s3 and s3api commands)
$ aws s3 cp --recursive ../data/ s3://blobbucket/ $ aws s3api put-object --bucket blobbucket --key texttyped1.png --body ../data/texttyped1.png --acl private
- List objects (files) in the S3 Bucket (s3 and s3api commands)
$ aws s3 ls s3://blobbucket $ aws s3api list-objects --bucket blobbucket --query 'Contents[].{Key: Key}' | jq -r '.[].Key'
- Trying to access this bucket over HTTP without authenticating is denied
<Error> <Code>AccessDenied</Code> <Message>Access Denied</Message> <RequestId>090832BE4B92F4DC</RequestId> <HostId> 27Ec+Sx6rPwGJFpWIQ4ktZrdlG5m710m+yUKjXJ9IfWE3GWXde6e2OdaY0OdKnV6Y3NEUSOI4iw= </HostId> </Error>
- In this case the bucket is named
- transcribe
- transcribe-input
- FLAC, MP3, MP4, or WAV file format
- API_StartTranscriptionJob
- Verify (that the file is in the S3 Bucket, if not copy it there
$ aws s3 ls s3://blobbucket/audio2.wav || aws s3 cp ../data/audio2.wav s3://blobbucket/audio2.wav
upload: ../data/audio2.wav to s3://blobbucket/audio2.wav
- Create JSON formatted request file (request.json)
$ JOBNO=$RANDOM
$ cat <<-EOD > request.json
{ "TranscriptionJobName": "job$JOBNO", "LanguageCode": "en-US", "MediaFormat": "wav", "Media": { "MediaFileUri": "s3://blobbucket/audio2.wav" } }
EOD
$ cat request.json
{ "TranscriptionJobName": "job26816", "LanguageCode": "en-US", "MediaFormat": "wav", "Media": { "MediaFileUri": "s3://blobbucket/audio2.wav" } }
- Submit the job (input: JSON file "request.json"; output: JSON file "result$JOBNO.json)
$ aws transcribe start-transcription-job --region us-east-2 --cli-input-json file://request.json | tee result-start-$JOBNO.json
{
"TranscriptionJob": {
"TranscriptionJobName": "job26816",
"TranscriptionJobStatus": "IN_PROGRESS",
"LanguageCode": "en-US",
"MediaFormat": "wav",
"Media": {
"MediaFileUri": "s3://blobbucket/audio2.wav"
},
"CreationTime": 1570858854.632
}
}
- Check progress of the Job
$ aws transcribe list-transcription-jobs --region us-east-2 --status IN_PROGRESS | tee result-list-$JOBNO.json
{
"Status": "IN_PROGRESS",
"TranscriptionJobSummaries": [
{
"TranscriptionJobName": "job26816",
"CreationTime": 1570871217.434,
"LanguageCode": "en-US",
"TranscriptionJobStatus": "IN_PROGRESS",
"OutputLocationType": "SERVICE_BUCKET"
}
]
}
$ aws transcribe list-transcription-jobs --region us-east-2 --status IN_PROGRESS | tee result-list-$JOBNO.json
{
"Status": "IN_PROGRESS",
"TranscriptionJobSummaries": []
}
- Get details about the Job
$ aws transcribe get-transcription-job --region us-east-2 --transcription-job-name "job26816" | tee result-get-$JOBNO.json
{
"TranscriptionJob": {
"TranscriptionJobName": "job26816",
"TranscriptionJobStatus": "COMPLETED",
"LanguageCode": "en-US",
"MediaSampleRateHertz": 44100,
"MediaFormat": "wav",
"Media": {
"MediaFileUri": "s3://blobbucket/audio2.wav"
},
"Transcript": {
"TranscriptFileUri": "https://s3.us-east-2.amazonaws.com/aws-transcribe-us-east-2-prod/598691507898/job26816/92e124ae-d054-480f-a850-72d68a61bbc0/asrOutput.json?X-Amz-Security-Token=AgoJb3JpZ2luX2VjEPz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMiJHMEUCIQC%2FU4Y2jp7gWkaTY8JQztfwfXxeSNIcQdQOMFxl4IVFhgIgdfv%2FtLHAXXgOYGjwZdVsAngpjlpRWGIV0sfEbwbfVq4q4wMI5v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARABGgwyNDYzNjEzMjI3NTIiDPMx2qiU32KAOSeN%2Byq3A8o1hjejbXr%2B0odnwNNLleH2ve9oqLvb3k8HBQIDr0Oh2X9h277vD%2BoXI6ZgfL2NF2rPw3NtFaj25OYBbWdcRYfHNel6uJD8wq49a8oGGPh8GblmvfgpW9kqzP82L1NJaTxOoKOYpHi9aIo16G7ygMjwqqeQgNk3JOuIm4J6YMzNs3Gyp7aLOd180JGGTjgzkJ%2BWFD%2BCj5EG%2BZjjDO%2BWz7G7jqdk7Md498bWa%2BVjkEo3a2Kytc9v5W3X2tpJT%2ByqS3o%2FuoFUJj2f%2FbVhOV%2BoPvch7UWwz9spi0kO1pqp%2FivmZ%2B2e3VxYrTTUwMIfssW7r%2FZe755sRlUcjcMNDZk0UTJHA7VIv63VGLpI7VtCt0nXLylq8Hre1Y479Y83mz4ZF3PvQ%2B4ms3HSm62XNlDxjqfXnqhXxU69YZlMHf%2FysaAqQZWAUrecIDaGgsDUm0g5yLOKTsEDIMpmCp9e4fsiWTI44gQo2fKoxgyaSRW9nTx%2B%2FcMCQiN2Iutpl7A%2BRXjFu5qJVxg1wr%2Bh5aOSaIq%2FLsFUBFLtTpWnggmLerbP3Hdv%2BnFTJkAYTkxbU79FbIkkCL91FTQn5fwwgauF7QU6tAGjF8Oe7uXrvHiac3gSGKNbpB2GKa%2FzGdbXMmIbCnkENx0aoRSaB2kqq3oVGeNF70XJoa1xvLzLrml2YYmLpUKFyeEH6segX%2F0hkhF0d2Haegw27do4rLyoLFRnub58M0zQCWLc5aYoDo2R9fYoxwR%2BOFdmJJk7%2BoI6R44vURaLnhoR%2FD1C3wkq0kfqnMIZ7i3TQVl%2BlaSPX6XTqJxHNqgYuypw6tPXiP9MQvTGNhbIuVFh1zo%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191012T055032Z&X-Amz-SignedHeaders=host&X-Amz-Expires=899&X-Amz-Credential=ASIATSXCHOUAOYARCSPL%2F20191012%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Signature=0ffe7b88113dacb547f1861105c21fcebbf3022f70dc831f6a1193c544a4d732"
},
"CreationTime": 1570858854.632,
"CompletionTime": 1570858902.423,
"Settings": {
"ChannelIdentification": false
}
}
}
- Retrieve the JSON output file from the Job
$ wget $(jq -r '.TranscriptionJob.Transcript.TranscriptFileUri' results-get-job26816.json) --output-document=results-output-job26816.json
- Review the translation result from the Job
$ jq -r '.jobName,.status,.results.transcripts[0].transcript' results-output-job26816.json
job26816
COMPLETED
checking in with another show for H p. R. Um, In the car on my way to a client's gonna be a short show. I'm think I'm gonna be there in 10 minutes, but I want to do, you know, shoot something up the flagpole here, uh, wanted to talk about the state of podcasting these days. These days, I I sound old because in podcasting terms, I am. I've been around since 4 4000 Started producing shows since 2005. Have been listening to podcasts and daily since 2004. I came across, um, my own archives from shows that I used to download back then and listen to which I had burned to a CD, and I've put them on my nads. And I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes of
- diarization
- API_StartTranscriptionJob
- To turn on speaker identification, set the
MaxSpeakerLabels
andShowSpeakerLabels
field of the Settings field when you make a call to the StartTranscriptionJob operation.
- Verify (that the file is in the S3 Bucket, if not copy it there
$ aws s3 ls s3://blobbucket/audio2.wav || aws s3 cp ../data/audio2.wav s3://blobbucket/audio2.wav
upload: ../data/audio2.wav to s3://blobbucket/audio2.wav
- Create JSON formatted request file (request.json)
$ JOBNO=28912
$ cat <<-EOD > request.json
{ "TranscriptionJobName": "job28912", "LanguageCode": "en-US", "Settings": { "MaxSpeakerLabels": 2, "ShowSpeakerLabels": true }, "MediaFormat": "wav", "Media": { "MediaFileUri": "s3://blobbucket/audio2.wav" } }
EOD
- Submit the job (input: JSON file "request.json"; output: JSON file "result$JOBNO.json)
$ aws transcribe start-transcription-job --region us-east-2 --cli-input-json file://request.json | tee result-start-$JOBNO.json
{
"TranscriptionJob": {
"TranscriptionJobName": "job28912",
"TranscriptionJobStatus": "IN_PROGRESS",
"LanguageCode": "en-US",
"MediaFormat": "wav",
"Media": {
"MediaFileUri": "s3://blobbucket/audio2.wav"
},
"CreationTime": 1570871217.434,
"Settings": {
"ShowSpeakerLabels": true,
"MaxSpeakerLabels": 2
}
}
}
- Check progress of the Job
$ aws transcribe list-transcription-jobs --region us-east-2 --status IN_PROGRESS | tee result-list-$JOBNO.json
{
"Status": "IN_PROGRESS",
"TranscriptionJobSummaries": [
{
"TranscriptionJobName": "job28912",
"CreationTime": 1570871217.434,
"LanguageCode": "en-US",
"TranscriptionJobStatus": "IN_PROGRESS",
"OutputLocationType": "SERVICE_BUCKET"
}
]
}
$ aws transcribe list-transcription-jobs --region us-east-2 --status IN_PROGRESS | tee result-list-$JOBNO.json
{
"Status": "IN_PROGRESS",
"TranscriptionJobSummaries": []
}
- Get details about the Job
$ aws transcribe get-transcription-job --region us-east-2 --transcription-job-name "job28912" | tee result-get-$JOBNO.json
{
"TranscriptionJob": {
"TranscriptionJobName": "job28912",
"TranscriptionJobStatus": "COMPLETED",
"LanguageCode": "en-US",
"MediaSampleRateHertz": 44100,
"MediaFormat": "wav",
"Media": {
"MediaFileUri": "s3://blobbucket/audio2.wav"
},
"Transcript": {
"TranscriptFileUri": "https://s3.us-east-2.amazonaws.com/aws-transcribe-us-east-2-prod/598691507898/job28912/d18357e9-b7f8-419e-b6c8-c37516a66f8f/asrOutput.json?X-Amz-Security-Token=AgoJb3JpZ2luX2VjEAEaCXVzLWVhc3QtMiJGMEQCIFc8TsgswisbjUbQEAkddBvqUCfu%2BjBl%2B9o30RWxKZwhAiAqYZWg9g%2BxAMXw2yzI2JLABe4h%2BlS4Xc%2B%2B4jQvSFZvRyrjAwjq%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAEaDDI0NjM2MTMyMjc1MiIM7gm%2FH9KjxBU60vG2KrcDJFfi1ztE13grWZYWJStMu%2BotR%2FSk%2FlD%2Fqi0%2BCxyjnIx9oYKG2LCARSPZKprNpXZloFI0A7i6%2FBXrnGf6P%2F9zbTNzeWZE3rxr3GTUjq067HMpNOoZnx3kLDnjRk1NE90CN7XS3VcHKha8eFiBdhtTiMvBmN7rpS%2BdiWpQH6cD3UGAyXkr17jswn3hsV8yc9DkrEZ5sRzVqDEaWHuRp8JczkHV07wGIPKlbFF%2F%2Blw%2Fs401PWKJKncqtVhYuwG97rzhloifNAdVEgs7u5Lip4SfBpFV1Lr%2B3%2FKTT2azj%2FJDKSEfVJLnzGwmDcL34Z88efajRyTobFTbrSufkA7v0fPLBxSURD87YgN7bQh%2B3WB3pxWl4rkKcw8r6QJgKHgBskSMWC2uHWjfUVji5RcRsAuSeedJLcrUdQ1NSsEI13Vkzr7oR8WYsiHz3rPmVsKCLfMgfifPMmU0MNqAcPBZhi3UlCJ0bh5nM7Bb2m9nMiTHk7kRdg1mbra8eckKFXUcHwFtdIcQwxTPdiuxkiM5eMc%2BsWCyaDLUZ7VZE2w%2BfOv6PyMXYj%2BVQKax9c%2F4V9Jirlv7MtfbiB41czC8oYbtBTq1ARgc5qb%2ByLnlbtOCr7yYRsUorkH0iz5WHepFpB73AHDcttA%2BtN2Irkz2FPWfV%2BBUrKMU7G012niqsDu8qXBZeIxQ2ZA7z0sCdaZBaJqODywo4o5CeH173FKhmy00YFvfXXTtSZHOy3XYxuf%2BEDFix1q6bfiRe18eNA5mCR%2BPwLoFSEUdB5eLiJFvFpM6MXAUrWpEal3%2FAvIzXcEqqb0RPAb6YZid1%2BDV%2FzjBO%2B7W9JzVUrgKKdI%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191012T092107Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=ASIATSXCHOUAPLLBLSNZ%2F20191012%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Signature=ee56417aac64d5428616d74a93bd82bd6c7ec48bb77eec4a2611637203daaf48"
},
"CreationTime": 1570871217.434,
"CompletionTime": 1570871343.764,
"Settings": {
"ShowSpeakerLabels": true,
"MaxSpeakerLabels": 2,
"ChannelIdentification": false
}
}
}
- Retrieve the JSON output file from the Job
$ wget $(jq -r '.TranscriptionJob.Transcript.TranscriptFileUri' results-get-job28912.json) --output-document=results-output-job28912.json
--2019-10-12 13:22:20-- https://s3.us-east-2.amazonaws.com/aws-transcribe-us-east-2-prod/598691507898/job28912/d18357e9-b7f8-419e-b6c8-c37516a66f8f/asrOutput.json?X-Amz-Security-Token=AgoJb3JpZ2luX2VjEAEaCXVzLWVhc3QtMiJGMEQCIFc8TsgswisbjUbQEAkddBvqUCfu%2BjBl%2B9o30RWxKZwhAiAqYZWg9g%2BxAMXw2yzI2JLABe4h%2BlS4Xc%2B%2B4jQvSFZvRyrjAwjq%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAEaDDI0NjM2MTMyMjc1MiIM7gm%2FH9KjxBU60vG2KrcDJFfi1ztE13grWZYWJStMu%2BotR%2FSk%2FlD%2Fqi0%2BCxyjnIx9oYKG2LCARSPZKprNpXZloFI0A7i6%2FBXrnGf6P%2F9zbTNzeWZE3rxr3GTUjq067HMpNOoZnx3kLDnjRk1NE90CN7XS3VcHKha8eFiBdhtTiMvBmN7rpS%2BdiWpQH6cD3UGAyXkr17jswn3hsV8yc9DkrEZ5sRzVqDEaWHuRp8JczkHV07wGIPKlbFF%2F%2Blw%2Fs401PWKJKncqtVhYuwG97rzhloifNAdVEgs7u5Lip4SfBpFV1Lr%2B3%2FKTT2azj%2FJDKSEfVJLnzGwmDcL34Z88efajRyTobFTbrSufkA7v0fPLBxSURD87YgN7bQh%2B3WB3pxWl4rkKcw8r6QJgKHgBskSMWC2uHWjfUVji5RcRsAuSeedJLcrUdQ1NSsEI13Vkzr7oR8WYsiHz3rPmVsKCLfMgfifPMmU0MNqAcPBZhi3UlCJ0bh5nM7Bb2m9nMiTHk7kRdg1mbra8eckKFXUcHwFtdIcQwxTPdiuxkiM5eMc%2BsWCyaDLUZ7VZE2w%2BfOv6PyMXYj%2BVQKax9c%2F4V9Jirlv7MtfbiB41czC8oYbtBTq1ARgc5qb%2ByLnlbtOCr7yYRsUorkH0iz5WHepFpB73AHDcttA%2BtN2Irkz2FPWfV%2BBUrKMU7G012niqsDu8qXBZeIxQ2ZA7z0sCdaZBaJqODywo4o5CeH173FKhmy00YFvfXXTtSZHOy3XYxuf%2BEDFix1q6bfiRe18eNA5mCR%2BPwLoFSEUdB5eLiJFvFpM6MXAUrWpEal3%2FAvIzXcEqqb0RPAb6YZid1%2BDV%2FzjBO%2B7W9JzVUrgKKdI%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191012T092107Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=ASIATSXCHOUAPLLBLSNZ%2F20191012%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Signature=ee56417aac64d5428616d74a93bd82bd6c7ec48bb77eec4a2611637203daaf48
Resolving s3.us-east-2.amazonaws.com (s3.us-east-2.amazonaws.com)... 52.219.104.114
Connecting to s3.us-east-2.amazonaws.com (s3.us-east-2.amazonaws.com)|52.219.104.114|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30007 (29K) [application/octet-stream]
Saving to: 'results-output-job28912.json'
results-output-job28912.json 100%[==============================================================================================>] 29.30K 143KB/s in 0.2s
2019-10-12 13:22:22 (143 KB/s) - 'results-output-job28912.json' saved [30007/30007]
- Review the translation result from the Job
$ jq -r '.jobName,.status,.results.speaker_labels.speakers,.results.transcripts[0].transcript' results-output-job28912.json
job28912
COMPLETED
1
checking in with another show for H p. R. Um, In the car on my way to a client's gonna be a short show. I'm think I'm gonna be there in 10 minutes, but I want to do, you know, shoot something up the flagpole here, uh, wanted to talk about the state of podcasting these days. These days, I I sound old because in podcasting terms, I am. I've been around since 4 4000 Started producing shows since 2005. Have been listening to podcasts and daily since 2004. I came across, um, my own archives from shows that I used to download back then and listen to which I had burned to a CD, and I've put them on my nads. And I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes of
N/A