Recent

Author Topic: [AI][Speech Recognition]Run speech recognition locally on CPU with object pascal  (Read 8103 times)

csukuangfj

  • New Member
  • *
  • Posts: 16
Hi all,

Just want to share an open-source project, https://github.com/k2-fsa/sherpa-onnx, with you. It provides Object Pascal APIs
for speech recognition that runs locally on CPU.

Currently, it supports the following models

- Whisper
- Zipformer
- Paraformer
- SenseVoice
- TeleSpeech ASR

It has been tested on the following platforms
- Linux
- macOS
- Windows

You can find the documentation at
https://k2-fsa.github.io/sherpa/onnx/pascal-api/index.html


We also provide an example using Lazarus that generates subtitles.
The documentation is at
https://k2-fsa.github.io/sherpa/onnx/lazarus/generate-subtitles.html



Attached are some screenshots for running generating subtitles on different platforms.

Pre-built Lazarus APPs for generating subtitles can be found at
https://k2-fsa.github.io/sherpa/onnx/lazarus/download-generated-subtitles.html





TRon

  • Hero Member
  • *****
  • Posts: 4377
I post because you also reacted in this thread about text to speech.

I had a few minutes to spare and tried TTS (instructions where buried deep in the documentation) and my results were very poor.

For me it did not even sounded as human speech but rather some sort of balloon that inflated with here and there a consonant (I assume that the German model that I downloaded was enough pre-trained).

My poor results might perhaps be due to the fact that I do no fully understand how to set things up correctly or provide the correct parameters. Would love to discuss further but at the moment have very little time to invest.

Just letting you know in case you are not aware.
Today is tomorrow's yesterday.

csukuangfj

  • New Member
  • *
  • Posts: 16
Could you describe in detail what commands you have used?

We have a hugginface space for you to try it from within your browser
https://huggingface.co/spaces/k2-fsa/text-to-speech

You don't need to install anything to use it.

The quality you get from the above huggingface space is the same as the one you would get once we wrap it to object pascal.

TRon

  • Hero Member
  • *****
  • Posts: 4377
Could you describe in detail what commands you have used?
Sure.

Btw I was mistaken about the language (I tested several and got confused). It was actually the dutch voice that I tested that seem to act strange for me.


Code: Bash  [Select][+][-]
  1. mkdir sherpa && cd sherpa
  2. wget https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.10.22/sherpa-onnx-v1.10.22-linux-x64-static.tar.bz2
  3. tar xf sherpa-onnx-v1.10.22-linux-x64-static.tar.bz2
  4.  
  5. wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-nl_NL-mls-medium.tar.bz2
  6. tar xf vits-piper-nl_NL-mls-medium.tar.bz2
  7.  
  8. ./sherpa-onnx-v1.10.22-linux-x64-static/bin/sherpa-onnx-offline-tts  --vits-model=./vits-piper-nl_NL-mls-medium/nl_NL-mls-medium.onnx   --vits-tokens=./vits-piper-nl_NL-mls-medium/tokens.txt   --vits-data-dir=./vits-piper-nl_NL-mls-medium/espeak-ng-data   --output-filename=./hallo.wav   "hallo wereld"
  9.  

Result wav attached. (you probably need to login in order to be able to see and download the attachment, happens sometimes when the forum is experiencing issues)

You would have to verify with a native Dutch speaker to make sure that the output is wrong but phonetically the wav file makes no sense to me (I know only a little dutch).
« Last Edit: August 20, 2024, 11:45:32 am by TRon »
Today is tomorrow's yesterday.

csukuangfj

  • New Member
  • *
  • Posts: 16
I see. Turns out the model filenames containing `mls` don't perform well.

Could you try other models, e.g.,
vits-piper-nl_BE-nathalie-medium

I am deleting the `mls` models for dutch.

TRon

  • Hero Member
  • *****
  • Posts: 4377
I see. Turns out the model filenames containing `mls` don't perform well.
Ah, ok. that explains.

Do you happen to know if the cause for that is because the model was not enough or improperly trained or something else ?

Quote
Could you try other models, e.g.,
vits-piper-nl_BE-nathalie-medium
Yes, that result sounds more to what I'd expected (although with a Belgian accent).  8-)

I have attached the result again (for comparison).
Today is tomorrow's yesterday.

csukuangfj

  • New Member
  • *
  • Posts: 16
I see. Turns out the model filenames containing `mls` don't perform well.
Ah, ok. that explains.

Do you happen to know if the cause for that is because the model was not enough or improperly trained or something else ?

Quote
Could you try other models, e.g.,
vits-piper-nl_BE-nathalie-medium
Yes, that result sounds more to what I'd expected (although with a Belgian accent).  8-)

I have attached the result again (for comparison).

> Do you happen to know if the cause for that is because the model was not enough or improperly trained or something else

Sorry, I've no idea. Just removed MLS-related models for all supported languages.

> although with a Belgian accent

By the way, we support many models. You can try each of them if you like.
(I suggest that you try them using the above huggingface space. It is faster.)

csukuangfj

  • New Member
  • *
  • Posts: 16
FYI: Just wrapped the text to speech functions to object pascal API in the following pull request
https://github.com/k2-fsa/sherpa-onnx/pull/1273

Now you can use hundreds of text to speech models from sherpa-onnx with object pascal.

It supports more than 40 languages!

Everything runs locally on your CPU and it is open-sourced.

Please see

https://github.com/k2-fsa/sherpa-onnx/tree/master/pascal-api-examples/tts

for examples

Note that I only add two examples for models from
https://github.com/rhasspy/piper

TRon

  • Hero Member
  • *****
  • Posts: 4377
FYI: Just wrapped the text to speech functions to object pascal API in the following pull request
https://github.com/k2-fsa/sherpa-onnx/pull/1273
Thank you for that and the example.

Question: After you create the OfflineTTS class is it possible/allowed to change the configuration so that you can change (voice) models on the fly ? Are there any specific conditions to be able to do that or do you specifically need to destroy the OfflineTTS class first and create it anew with a different voice model configuration ?

I am aware that currently the Pascal implementation does not offer such behaviour (the configuration is private) but wondered if the runtime library allows for the configuration to be changed on the fly.
« Last Edit: August 22, 2024, 06:23:38 am by TRon »
Today is tomorrow's yesterday.

csukuangfj

  • New Member
  • *
  • Posts: 16
You have to create a new instance of OfflineTts for a new model.

TRon

  • Hero Member
  • *****
  • Posts: 4377
Ok I understand. Thank you csukuangfj
Today is tomorrow's yesterday.

csukuangfj

  • New Member
  • *
  • Posts: 16
Ok I understand. Thank you csukuangfj

By the way, are you able to run the examples? Have you encountered any usage issues?

VisualLab

  • Hero Member
  • *****
  • Posts: 706
I checked the pronunciation for a dozen or so languages ​​(from the following groups: Romance, Germanic, Slavic, Baltic, Finno-Ugric, Turkish). I used the website: https://huggingface.co/spaces/k2-fsa/text-to-speech. Yes, I don't know these languages, but the quality of the generated WAV files can be compared, for example, to what Google Translator generates. I noticed that the generated audio for English and Spanish is terrible, i.e. the generated WAV file sounds like it was generated from a crappy speech generator from the late 80's/early 90's. For other languages, the pronunciation seems to be closer to natural. I also noticed that the speed and volume of the pronunciation of words is quite different between languages ​​(I set the same pronunciation speed for each language). For some languages, the content was pronounced quickly (and less clearly) and for others much slower (clearer). Similarly, for some languages ​​the content was pronounced quietly and for others much louder (then you had to adjust the volume in the system).

TRon

  • Hero Member
  • *****
  • Posts: 4377
By the way, are you able to run the examples? Have you encountered any usage issues?
It took me some time to get familiar with your repository.

Resulting build & run log-file attached.

The log shows that it failed on the examples making use of portaudio. That is because I forgot to install the libportaudio dev-package before running the tests  :-X

BTW: with regards to the MLS model(s). I experimented a bit more with them and when you provide longer sequences of text then the model seems to pick up on pronunciation after several words/sentences.

@VisualLab:
I do not know for sure what might be the culprit there (I am (also) new when it comes to sherpa-onnx) but better voice training usually yield better results. How you can do that can f.e. be seen in this video (that channel is interesting anyway if you are interested in this kind of software).

That speed and volume differs might be caused by training as well. In case you did not already do realize that some language are spoken must faster/louder then you might be used to (or slower/softer in case you are using a fast paste language). Some engines solve that better (e.g. automatically) than others (where you have change things manually).

The API allows for setting details about the voice model/output that can't be set by the command-line programs.
« Last Edit: August 23, 2024, 04:58:17 am by TRon »
Today is tomorrow's yesterday.

csukuangfj

  • New Member
  • *
  • Posts: 16
I checked the pronunciation for a dozen or so languages ​​(from the following groups: Romance, Germanic, Slavic, Baltic, Finno-Ugric, Turkish). I used the website: https://huggingface.co/spaces/k2-fsa/text-to-speech. Yes, I don't know these languages, but the quality of the generated WAV files can be compared, for example, to what Google Translator generates. I noticed that the generated audio for English and Spanish is terrible, i.e. the generated WAV file sounds like it was generated from a crappy speech generator from the late 80's/early 90's. For other languages, the pronunciation seems to be closer to natural. I also noticed that the speed and volume of the pronunciation of words is quite different between languages ​​(I set the same pronunciation speed for each language). For some languages, the content was pronounced quickly (and less clearly) and for others much slower (clearer). Similarly, for some languages ​​the content was pronounced quietly and for others much louder (then you had to adjust the volume in the system).

> I noticed that the generated audio for English and Spanish is terrible, i.e. the generated WAV file sounds like it was generated from a crappy speech generator from the late 80's/early 90's

We have more than 20 English models. I think some of them generate natural speech, e.g., vits-piper-en_US-libritts_r-medium.
The German models also sound good to me.

 

TinyPortal © 2005-2018