We need to change the uploaded_file which is currently set to None # If the audio comes from the YouTube extracting mode, the audio is downloaded so the uploaded_file is Once this is done, all you have to do is display all the elements and link them using the transcription()function: def transcription(stt_tokenizer, stt_model, filename, uploaded_file=None): Txt_text += transcription + " " # So x seconds sentences are separated If end - newsilence " + str(timedelta(milliseconds=sub_end)).split(".") + "\n" If end - newsilence Final value and last value of new silence are too close, need to merge If silence_list - max_space > newsilence:įor i in range(int(newsilence), int(silence_list), max_space): # int bc float can't be in a range loop # example newsilence = and silence_list starts with 100000 => It will create a massive gap While index max_space and run out of memory Get the middle value of each timestamp def get_middle_silence_time(silence_list): Silence_list = tect_silence(audio, min_silence_len=750, silence_thresh=dbfs-14) # Get Decibels (dB) so silences detection depends on the audio instead of a fixed value Get the timestamps of the silences def detect_silences(audio): Then, create the four functions that allow silence detection method, which we have explained in the first notebook tutorial. St.error("Sorry, seems we have a problem on our side. # Means we have a chunk with a case with value1>value2 Transcription = stt_tokenizer.batch_decode(prediction)Įlif isinstance(stt_tokenizer, Wav2Vec2Processor): If isinstance(stt_tokenizer, Wav2Vec2Tokenizer): # Decode & lower our string (model's output is only uppercase) Prediction = torch.argmax(logits, dim=-1) Logits = stt_model.to(device)(input_values).logits # Get logits from the data structure containing all the information returned by the model and get our prediction Input_values = stt_tokenizer(input_audio, return_tensors="pt").to(device).input_values # return PyTorch torch.Tensor instead of a list of python integers thanks to return_tensors = ‘pt’ Input_audio, _ = librosa.load(path, sr=16000) # Load audio file with librosa, set sound rate to 16000 Hz because the model we use was trained on 16000 Hz data New_audio.export(path) # Exports to a mp3 file in the current path Path = filename + "audio_" + str(index) + ".mp3" New_audio = myaudio # Works in milliseconds def transcribe_audio_part(filename, stt_model, stt_tokenizer, myaudio, sub_start, sub_end, index):ĭevice = "cuda" if _available() else "cpu" To begin, let’s create the function that allows you to transcribe an audio chunk. That’s why we will not re-explain its usefulness here. ⚠️ Reminder: All this code has been explained in the notebook tutorials. We also need to use some previous functions, you will probably recognize some of them. They will allow us to use artificial intelligence models, to manipulate audio files, times, … # Modelsįrom transformers import Wav2Vec2Processor, HubertForCTC Once your environment is ready, create a file named app.py and import the required libraries we used in the notebooks. To do so, you just have to open a terminal and enter the following command: pip install -r requirements.txt 2. Then, you can install all these elements in only one command. This will allow us to specify each version of the libraries required by our Speech to text project. To do this, create a file named requirements.txt and add the following text to it. To start, let’s create our Python environment. We therefore recommend that you read the notebooks first. ⚠️ Since this article uses code already explained in the previous notebook tutorials, we will not re-explain its usefulness here. In the following articles, we will see how to implement the more advanced features (diarization, summarization, punctuation, …), and we will also learn how to build and use a custom Docker image for a Streamlit application, which will allow us to deploy our app on AI Deploy ! If you don’t know this tool, don’t worry, it is very simple to use. ➡ To create this app, we will use Streamlit, a Python framework that turns scripts into a shareable web application. Now that we know how to do all this, let’s combine all these features together into a Speech-To-Text application using Python! We have also seen how to distinguish speakers and how to generate video subtitles, all the while managing potential memory problems. In the previous notebook tutorials, we have seen how to translate speech into text, how to punctuate the transcript and summarize it. Overview of our final Speech-To-Text application Objective A tutorial to create and build your own Speech-To-Text application with Python.Īt the end of this first article, your Speech-To-Text application will be able to receive an audio recording and will generate its transcript!įinal code of the app is available in our dedicated GitHub repository.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |