If I would have any knowledge about .po files I wouldn't mind having a go at it.
If I understand any of it, then msgid and msgstr (if not "") should hold the same format arguments?
I assume msgstr is the translation of msgid?
I also assume we would check the created bla.xx.po agains bla.po?
This should not be to difficult then?
What exactly do you mean by unused resourcestrings?
Can you also give an example of a duplicate resourcestring?
Would we want it to be a gui or console program (with batch processing all po files?)?
If I understand any of it, then msgid and msgstr (if not "") should hold the same format arguments?Spec is here : http://www.gnu.org/s/hello/manual/gettext/PO-Files.html (http://www.gnu.org/s/hello/manual/gettext/PO-Files.html).
I assume msgstr is the translation of msgid?
"Unused resourcestrings" means that a string is defined but not used anywhere in the project's pascal source.Again if the tool is to be a general purpose tool, then resource strings is only part of it. A lot of people, including myself, use gnugettext exclusively with the _() function. Advantage of this method is that the strings and code are kept close together.
To find out if the string is used or not you need to scan all the source files. A simple search operation without context checking should be enough.
You will always find one instance of the string which is its definition. You can make a rule:
- if no instances are found -> something is wrong, should not happen.
- one instance found -> unused in source.
- more than one instances -> ok, used.
I don't know exactly how to use the gnugettext _() function.That's it. No resourcestring. Very easy also when translation comes as late requirement in the project: just put a _() around the strings that need to be translated.
Do you just pass the string as a parameter for the function, and you don't need any resourcestring sections?
If there is existing open source .po parser code it can be used of course.po file parsing is indeed the easy part. The dxgettext code is the interesting part. It contains code to extract all kind of strings from sources (dpr, pas, dfm, rc,..) and even exe files. LFM and LPR are probably easy to add. It does some basic pascal parsing to extract only resourcestrings and gnugettext function parameters (ie. text to translate).
Yet, parsing the .po file is not very difficult even with the string concatenation rules.
You are making this too big for me.
I can see myself wriitng a tool for processing po files and identifying wrong formatparameters, and duplicate resourcestringvalues (in one po file, so not across different po files, or is this "required" too?), but making it a general purpose translation checking tool is way beyond me.
B.t.w. looking for unused resourcestrings, can't we get that info from the compiler (doesn't it generate a hint), if we build lazarus?
There are also probably other standard (and higly specialized) tools available for checking fast amounts of files for the occurence of a particular string?
If you want a simple tool, I think I can write one.
It can have a GUI, and it can have a console front-end (which can be used for automated testing).
If you want more than that, I don't think I can participate.
But seriously, often people are too shy or afraid or something with their own code. In reality they can create good code. Maybe the core developers' critical comments scare them away.
I have no experience with string-string maps, search trees and so on.
So, my first job would be to think about a suitable data model: how do I store the strings/string-pairs, in a way I can search it fast.
Given enough time, I'll come up with something.
Where to share my initial code then?
I ended up adapting the code from Translations unit, stripping everything out that is not needed atm.
Still it takes up to 2-3 seconds to load lazaruside.xx.po (the bloody thing being up to > 0.5 MB)
[... skip ...]
Juha: can you take a look. Is the TSimplePoFile class feasible for what we want to achieve, or should I go another way?
I'ld rather hear this now, then trying to rewrite the base class of the app much later on in the process.
Original code | Skip adding entries to TStringHashList | Skip adding entries alltogether |
5713 ticks | 5798 ticks | 5105 ticks |
I rewrote the ReadPoText() procedure to use Strings instead of PChars.
Original code using PChars: 6427 ticks.
New code using Strings: 300 ticks.
Wow: 20 times faster!
I rewrote the ReadPoText() procedure to use Strings instead of PChars.
Is there a cross-platform alternative to "GetTickCount"? I had to comment them out what testing on Linux.
The container speed was also a surprise for me. I guess the reason is that there are only few thousand items. The speed differences become relevant only when there is much more data.
To me it looks as if handling Pascal type strings in a Pascal type mannor is faster then handling the enire data as a PChar.
In ReadPOText(s: String) the string is typecasted to a PChar and then there is much pointer-calculation to determine line-ending etc.
This overhead is redundant once we treat the data as a stringlist.
Could you please leave the old parser code there, too, so they are easy to compare, with IFDEF or other switch.
In your GUI, the "select all tests" still creates a RunError 202. It may be caused by a recursive event handler loop and a stack overflow. Tested on Linux with QT bindings.
I was thinking if this is made a package, ...
I included an experimental po-file highlighter for the synedit in the results form.
Updated sources at: http://home.tiscali.nl/~knmg0017/software/gpocheck_bron.zip (http://home.tiscali.nl/~knmg0017/software/gpocheck_bron.zip)
I would like to offer the highlighter as an addition to the synedit package of Lazarus.
@Juha: Just saw you filed a bugreport on the issue (#20927 (http://bugs.freepascal.org/view.php?id=20927)), so it seems it's not my part of the code O:-)
I included an experimental po-file highlighter for the synedit in the results form.
Updated sources at: http://home.tiscali.nl/~knmg0017/software/gpocheck_bron.zip (http://home.tiscali.nl/~knmg0017/software/gpocheck_bron.zip)
I would like to offer the highlighter as an addition to the synedit package of Lazarus.
I'll have a look at it, when I have time. Probably not before January though
I started to add it
It appears in the IDE now (for source editor)
But has empty default colors
If some one has a choice for colors, sent me an export, otherwise I will put in some random colors
<Color Version="8">
<Langpo_language_files Version="8">
<SchemeDefault>
<Key Style="fsBold"/>
<Flags Foreground="clTeal"/>
<String Foreground="clFuchsia"/>
<Comment Style="fsItalic" Foreground="clGreen"/>
<Identifier Style="fsBold" Foreground="clGreen"/>
<Previous_value Style="fsItalic" Foreground="clOlive"/>
</SchemeDefault>
</Langpo_language_files>
</Color>
It appears in the IDE now (for source editor)
But has empty default colors
It appears in the IDE now (for source editor)
But has empty default colors
If some one has a choice for colors, sent me an export, otherwise I will put in some random colors
@Juha: time to add this to Lazarus tools?
Feel free to make it a package.
I would still like to have this tool available as a seperate project for fpc/Lazarus users though...
When you commit the code, credits are appreciated O:-)
Ok, I did so. Now the package can be selected in the list of IDE packages. It installs under Tools menu.
Please test.
The stand-alone application version should be maintained somewhere else. Do you have access to CCR?
There could also be project files under the package directory and one could build the application from the same sources.
However, that will not work when the package gets more IDE integration, so I don't consider it a good idea.
As long as the SimplePoFiles and PoFamilies units, which basically do all the reading and testing, remain independent of the IDE (and I see no reason why they shouldn't) a simple stand-alone version is IMHO still a feasible project to maintain.
I think the shared code also needs more attention. I guess it is a char-encoding problem.
For example "lazaruside.he.po" reports 260 errors in Format params although they look ok.
Looking at is, I see severeral s% instead of %s.
I know Hebrew (I guess it is Hebrew, I cannot display the characters on my system) is RTL, but should this affect the %s as I can see them in an UTF-8 capable (SynEdit) editor?
The lazaruside.zh_CN.po file (which is some kind of chinese, I guess, and so also RTL) otoh has only 13 errors in this test.
So maybe the Hebrew translation actually is off???
Bart, I added the project files for PoChecker into "Proj" directory. The project uses the same source files except for the main form file which I copied from the package. (Earlier I improved its anchors and layout).
Please test.
The test for duplicate values is broken. If you have an idea how to fix it, please tell me. Otherwise I will learn the code better at some time.
#: dummy.wiseditform
msgid "&Edit"
msgstr ""
#: dummy.wiseditsource
msgid "&Edit"
msgstr ""
In lazaruside.po can you please point out one duplicate (with the corresponding linenumbers) so I can re-test?
I found appr 250 of them, but all of them have a context specified.
I exluded items with a context from the check, based upon the documentation I found in the wiki: http://wiki.lazarus.freepascal.org/Translations_/_i18n_/_localizations_for_programs#Fuzzy_entries (http://wiki.lazarus.freepascal.org/Translations_/_i18n_/_localizations_for_programs#Fuzzy_entries).
I can simply add a check for all duplicates if you want, but I think 2 (or more) entries with the same text, but with a context specified, should not be considered an error?
Where do you see 250 of them?
for i := FMaster.Count - 1 downto 0 do
begin
PoItem := FMaster.PoItems[i];
Dup := FMaster.OriginalToItem(PoItem.Original);
if Assigned(Dup) and (Dup.Identifier <> PoItem.Identifier) and (Dup.Context = '') then
for i := FMaster.Count - 1 downto 0 do
begin
PoItem := FMaster.PoItems[i];
Dup := FMaster.OriginalToItem(PoItem.Original);
// remove the check for (Dup.Context = '')
if Assigned(Dup) and (Dup.Identifier <> PoItem.Identifier) then
--------------------------------------------------
Errors reported by CheckDiplicateOriginals for:
lazaruside.po
--------------------------------------------------
[Line: 17722]
This resourcestring:
#: lazarusidestrconsts.uemsetfreebookmark
msgid "Set a Free Bookmark"
msgctxt "lazarusidestrconsts.uemsetfreebookmark"
has the same value as idenftifier lazarusidestrconsts.lismenusetfreebookmark at line 10971
For this entry it is recommended to set: msgctxt="lazarusidestrconsts.uemsetfreebookmark"
...snip ...
[Line: 12]
This resourcestring:
#: lazarusidestrconsts.dbgbreakgroupdlgcaptionenable
msgid "Select Groups"
msgctxt "lazarusidestrconsts.dbgbreakgroupdlgcaptionenable"
has the same value as idenftifier lazarusidestrconsts.dbgbreakgroupdlgcaptiondisable at line 7
For this entry it is recommended to set: msgctxt="lazarusidestrconsts.dbgbreakgroupdlgcaptionenable"
Found 253 errors.
--------------------------------------------------
Total errors found: 253
We should have seperate Errors and Warnings (and seperate counters for them then)?
[...]
In pofamilies.pp, procedure TPoFamily.CheckDuplicateOriginals, change these lines