Recent

Author Topic: Writing a dictionary software for a spoken language  (Read 1563 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 15641
  • Censorship about opinions does not belong here.
Re: Writing a dictionary software for a spoken language
« Reply #15 on: September 03, 2024, 02:58:35 pm »
I don't think this is about programming, but about a notational phonetic language. Such thing exists, is standardized and it is called IPA. E.g. wikipedia uses it always.
https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Pronunciation.
I think, for now, this is more helpful.
IPA is a notational standard for pronunciation, although very confusing to me, but I am not a linguist.
The way IPA notation works is a way to preserve a natural language or accent is spoken.
So if you want to preserve a language that is only spoken by few, use IPA to document it and start conversations with linguists on how to do that. linguist scientists love that, so look up some contacts at universities in your area. You do not have a degree of any sort to trigger their interest.

https://www.internationalphoneticalphabet.org/

Once you get up to speed with IPA, it is not unlikely we can help you to map most or all of it to Pascal to do text to speech and speech to text. We would need audio to detect the phonemes, though and map that to the notation. Phonemes and pronounciation differ, so we need both for expression. (happy, sad, laughing, disapproving and the like, basically how to express emotion)
With the audio part I can help you.(slicing on many samples)
« Last Edit: September 03, 2024, 03:25:05 pm by Thaddy »
If I smell bad code it usually is bad code and that includes my own code.

MarkMLl

  • Hero Member
  • *****
  • Posts: 7622
Re: Writing a dictionary software for a spoken language
« Reply #16 on: September 03, 2024, 04:49:36 pm »
IPA is a notational standard for pronunciation, although very confusing to me, but I am not a linguist.

At least two messages already mention it. However I think it's worth adding

https://fumbling.it/posts/building-ipa-keyboard-part-four/

which discusses a physical IPA keyboard, which I suspect would end up much easier than the contortions needed to type it on a standard QWERTY.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

bobby100

  • Full Member
  • ***
  • Posts: 246
    • Malzilla
Re: Writing a dictionary software for a spoken language
« Reply #17 on: September 03, 2024, 07:54:21 pm »
I am not ignoring IPA, but... until some software automatically converts audio to IPA, there is not much help for me there.
Learning IPA from texts brings nothing. You need to hear something in order to make a link in your head between a sound and a symbol.
Audio recording should be enough for now.
As for a search for linguists - it is too much for a hobby project.

About existing projects:

https://www.paundurlic.com/vlaski.recnik/index.php - the problem here is that Mr. Durlic does not try to standardize the written form, but he rather uses just APHI (something like IPA, but not so widely used and also less detailed) - http://www.paundurlic.com/vlaski.recnik/sound.php . This means that every dialect gets its own written form of the same word

http://www.gergina.org.rs/vlasko-pismo/ - this is a monstrosity - and it is recognized by the Serbian state as official Walachian alphabet (Slav people call us Walachian/Vlach/Vlah - https://en.wikipedia.org/wiki/Vlachs ). They added Walachian-specific letters to Serbian alphabet, both Latin and Cyrillic versions. It is something in the manner of how Russian state forced its own alphabet to Turkic nations in USSR - there is a bunch of letters that we do not need at all. Even the name of the website is written wrong - it should be Gherghina (flower dahlia), and not Gergina (h is a separator, not a sound). This is an example of incompetent people doing a bad job. The head of the organization and the inventor of this alphabet is a gynecologist. If I leave this job to him, you know where the future of this language is going... They aren't standardizing anything. They just write "by the feeling" for the purpose of organizing cultural manifestations (concerts, handcraft workshops and similar).

CM630

  • Hero Member
  • *****
  • Posts: 1168
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Writing a dictionary software for a spoken language
« Reply #18 on: September 04, 2024, 10:45:25 pm »
Your assumption that the Latin alphabeth is better than the Cyrillic alphabeth or the Armenian aphabeth for writing Romanian might be true, but might be wrong.
Turkish and Albanian implemetation of Latin alphabeth are much, much better than the English one and even better than the German one. These implementation were done later, they were based on experience during the ages and have evaded some mistakes.
Also, some decisions are made not because of linguistic reasons, but because of political ones.
Just an example: "Craiova" is read "Krajova" or "Kraiova", but "ce face" is read "t͡ʃe fat͡ʃe".


BTW, Romanian was written in Cyrillic until the middle of the ninghteenth century.
Maybe caling the language from that time Romanianis wrong, I do not know.
« Last Edit: September 04, 2024, 10:48:12 pm by CM630 »
Лазар 3,4 32 bit (sometimes 64 bit); FPC3,2,2

bobby100

  • Full Member
  • ***
  • Posts: 246
    • Malzilla
Re: Writing a dictionary software for a spoken language
« Reply #19 on: September 05, 2024, 12:38:06 am »
Also, some decisions are made not because of linguistic reasons, but because of political ones.
Just an example: "Craiova" is read "Krajova" or "Kraiova", but "ce face" is read "t͡ʃe fat͡ʃe".
C followed by E or by I isn't read C (K) anymore, but rather like English CH in Charlie. Same goes for G followed by E or by I - it is like G in Georgia, not like G in Bulgaria.
It is in the nature of the language.
I see that you know Cyrillic script, so I'll use some Cyrillic transcription.
E.g. singular (un) Drac (a devil), transcribed to Cyrillic - Драк. The plural is built by adding an "I" at the end of the word - Draci (devils). Transcribed to Cyrillic - Драчи.
Back to your example with "ce face" (what is he doing): Ieu fac, noi facem (I do, we do) - transcribed to Cyrillic - Иеу фак, нои фачем.
Same for G - (un) Drag (a loved one) is with G like in Bulgaria, but the plural Dragi (loved ones) goes to G like in Georgia.
So, in Craiova, C is followed by R, and you can read it like K, but in "ce face" you have two CE groups, and here happens the transformation. In standard Romanian, the letter H is used to break the transformation, like in word Cheie (a key). H isn't a vowel here, but is just used in script to depart E from C, so that the transformation does not happen.

As for Romanian and Cyrillic script - yes, the church used it because of the influence of Russian/Ukrainian church, from where the religion came to Romania. The liturgy in churches were in Church-Slavic til 18th century. There wasn't a lot of literacy outside the churches. But, do not imagine Romanian Cyrillic script to be anything like modern Bulgarian, Russian or Serbian script, as it was derived from Church-Slavic Cyrillic script ( https://ro.wikipedia.org/wiki/Alfabetul_limbii_rom%C3%A2ne#/media/Fi%C8%99ier:Romanian_Cyrillic_-_Lord's_Prayer_text.svg ). It contains letters like șt, în, ia, ie or ou - one letter for group of vowels.
Latin alphabet for Romanian language is established in 1860.
In Bucharest, in National Museum, you can also find tombstones written in Romanian, by using Greek alphabet.

CM630

  • Hero Member
  • *****
  • Posts: 1168
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Writing a dictionary software for a spoken language
« Reply #20 on: September 06, 2024, 08:36:16 pm »
...
C followed by E or by I isn't read C (K) anymore, but rather like English CH in Charlie. Same goes for G followed by E or by I - it is like G in Georgia, not like G in Bulgaria.
It is in the nature of the language.
...
I am aware of that, that is why I gave it as an example. But the nature of the languge is one thing, while the alphabeth (which is much more artificial) is another thing. Your own examples show the ambiguous usage of „c“ and „g“, which might not be present in the yugoslavian script (I believe you know whether it is so or not, I am just assuming).
Romanian history does not seem to be in your area of interest, but maybe you would find interesting that the Serbian letter „Џ“ is actially an adopted Romanian/ Wallach letter for the same sound.
Since we have got far away from Lazarus, I think I shall say nothing more in thus thread.
« Last Edit: September 06, 2024, 08:38:18 pm by CM630 »
Лазар 3,4 32 bit (sometimes 64 bit); FPC3,2,2

MarkMLl

  • Hero Member
  • *****
  • Posts: 7622
Re: Writing a dictionary software for a spoken language
« Reply #21 on: September 06, 2024, 09:39:30 pm »
I am aware of that, that is why I gave it as an example. But the nature of the languge is one thing, while the alphabeth (which is much more artificial) is another thing.

All of which reinforce the concept of Notation as a Tool of Thought: I believe that one can only get so far with stored audio snippets.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

 

TinyPortal © 2005-2018