* * *

Author Topic: Idea for a Lazarus addition: resource string validator  (Read 32214 times)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1458
Idea for a Lazarus addition: resource string validator
« on: December 06, 2011, 04:33:28 pm »
Even translated strings can cause crash bugs in Lazarus. See:
 Lazarus IDE shortcuts can't be changed
 http://bugs.freepascal.org/view.php?id=20811

If someone is looking for an idea for a project, here is one:

Make a Lazarus package that checks for Format() parameter errors in translated .po files.
It should also check for unused and duplicate resource strings. There are plenty of them.

Juha

Bart

  • Hero Member
  • *****
  • Posts: 1458
    • Bart en Mariska's Webstek
Re: Idea for a Lazarus addition: resource string validator
« Reply #1 on: December 06, 2011, 11:08:52 pm »
If I would have any knowledge about .po files I wouldn't mind having a go at it.

If I understand any of it, then msgid and msgstr (if not "") should hold the same format arguments?
I assume msgstr is the translation of msgid?
I also assume we would check the created bla.xx.po agains bla.po?

This should not be to difficult then?

What exactly do you mean by unused resourcestrings?
Can you also give an example of a duplicate resourcestring?

Would we want it to be a gui or console program (with batch processing all po files?)?

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1458
Re: Idea for a Lazarus addition: resource string validator
« Reply #2 on: December 07, 2011, 01:00:09 am »
If I would have any knowledge about .po files I wouldn't mind having a go at it.

If I understand any of it, then msgid and msgstr (if not "") should hold the same format arguments?
I assume msgstr is the translation of msgid?
I also assume we would check the created bla.xx.po agains bla.po?

Yes, msgstr is the translation of msgid and msgid is originally defined in a pascal source file under a resourceString section.
For the validator program's purposes, the "master .po file", bla.po in your example can be used as a main source for resource strings.
The format params (%x) in those strings should be compared with the translated strings. The country code in translation files is typically 2 chars (like .de.po) but can be 5 chars (like .pt_BR.po).

Quote
This should not be to difficult then?

What exactly do you mean by unused resourcestrings?
Can you also give an example of a duplicate resourcestring?

"Unused resourcestrings" means that a string is defined but not used anywhere in the project's pascal source.
To find out if the string is used or not you need to scan all the source files. A simple search operation without context checking should be enough.
You will always find one instance of the string which is its definition. You can make a rule:
- if no instances are found -> something is wrong, should not happen.
- one instance found -> unused in source.
- more than one instances -> ok, used.

"Duplicate resourcestring" means that 2 or more string definitions have the exact same text.
However they can't always be combined into 1 resourcestring because their meaning may depend on context and they may need a different translation.

In practice you need to keep all the string names and values in a hash- or tree-map for a fast lookup.

Quote
Would we want it to be a gui or console program (with batch processing all po files?)?

Being a Lazarus package it should have a GUI.  It could show reports of its findings in listboxes for example.
It could be simple first. It usually happens that people (and the author himself) start to get ideas for improvements after a first simple version is done.
Like in my Tools -> Example Projects ... feature. It was very simple first. Then Martin suggested a load of improvements and I figured some more myself.
It means, initially keep it simple. The refinement and complication comes later by itself.

This validator would benefit all translated applications made with Lazarus, not only the Lazarus project.

Juha
« Last Edit: December 07, 2011, 01:05:51 am by JuhaManninen »

ludob

  • Hero Member
  • *****
  • Posts: 1173
Re: Idea for a Lazarus addition: resource string validator
« Reply #3 on: December 07, 2011, 09:10:44 am »
Quote
If I understand any of it, then msgid and msgstr (if not "") should hold the same format arguments?
I assume msgstr is the translation of msgid?
Spec is here : http://www.gnu.org/s/hello/manual/gettext/PO-Files.html.
If the tool is to be a general po checker then you need to pay attention to some oddities such as the string concatenation rule found at the bottom of the spec. PO files can be generated with all kind of external tools.

Quote
"Unused resourcestrings" means that a string is defined but not used anywhere in the project's pascal source.
To find out if the string is used or not you need to scan all the source files. A simple search operation without context checking should be enough.
You will always find one instance of the string which is its definition. You can make a rule:
- if no instances are found -> something is wrong, should not happen.
- one instance found -> unused in source.
- more than one instances -> ok, used.
Again if the tool is to be a general purpose tool, then resource strings is only part of it.  A lot of people, including myself, use gnugettext exclusively with the _() function. Advantage of this method is that the strings and code are kept close together.

An interesting open source (MozillaPL) project and a good starting point is dxgettext http://dxgettext.po.dk/download/ written in Delphi but including support for different pascal dialects. It contains a po parser, a string extractor, a "what strings have changed in the source compared to po" tool, etc.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1458
Re: Idea for a Lazarus addition: resource string validator
« Reply #4 on: December 07, 2011, 10:23:16 am »
I don't know exactly how to use the gnugettext _() function.
Do you just pass the string as a parameter for the function, and you don't need any resourcestring sections?

Lazarus does not use that. I would suggest the validator tool first supports this resourcestring system and has the gettext func support as a ToDo item.
If there is existing open source .po parser code it can be used of course.
Yet, parsing the .po file is not very difficult even with the string concatenation rules.

Juha

ludob

  • Hero Member
  • *****
  • Posts: 1173
Re: Idea for a Lazarus addition: resource string validator
« Reply #5 on: December 07, 2011, 11:08:39 am »
Quote
I don't know exactly how to use the gnugettext _() function.
Do you just pass the string as a parameter for the function, and you don't need any resourcestring sections?
That's it. No resourcestring. Very easy also when translation comes as late requirement in the project: just put a _() around the strings that need to be translated.

Quote
If there is existing open source .po parser code it can be used of course.
Yet, parsing the .po file is not very difficult even with the string concatenation rules.
po file parsing is indeed the easy part. The dxgettext code is the interesting part. It contains code to extract all kind of strings from sources (dpr, pas, dfm, rc,..) and even exe files. LFM and LPR are probably easy to add. It does some basic pascal parsing to extract only resourcestrings and gnugettext function parameters (ie. text to translate).
At least all the basic building blocks seem to be there to get a quick start on this project.
 

Bart

  • Hero Member
  • *****
  • Posts: 1458
    • Bart en Mariska's Webstek
Re: Idea for a Lazarus addition: resource string validator
« Reply #6 on: December 07, 2011, 02:43:27 pm »
You are making this too big for me.
I can see myself wriitng a tool for processing po files and identifying wrong formatparameters, and duplicate resourcestringvalues (in one po file, so not across different po files, or is this "required" too?), but making it a general purpose translation checking tool is way beyond me.

B.t.w. looking for unused resourcestrings, can't we get that info from the compiler (doesn't it generate a hint), if we build lazarus?
There are also probably other standard (and higly specialized) tools available for checking fast amounts of files for the occurence of a particular string?

If you want a simple tool, I think I can write one.
It can have a GUI, and it can have a console front-end (which can be used for automated testing).

If you want more than that, I don't think I can participate.

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1458
Re: Idea for a Lazarus addition: resource string validator
« Reply #7 on: December 07, 2011, 10:26:28 pm »
You are making this too big for me.

You are a "Hero Member", how can it be big for you?
... ok, just kidding :)
But seriously, often people are too shy or afraid or something with their own code. In reality they can create good code. Maybe the core developers' critical comments scare them away.

For example I can't handle very big or complex things myself, yet I commit code to a big Lazarus project. The secret is to go in small steps. Learn existing code, copy and modify other people's code shamelessly, make a small improvement... Repeat that and finally you have a big feature.

Quote
I can see myself wriitng a tool for processing po files and identifying wrong formatparameters, and duplicate resourcestringvalues (in one po file, so not across different po files, or is this "required" too?), but making it a general purpose translation checking tool is way beyond me.

You can go ahead with that. Later someone will add features anyway.
BTW, checking duplicate string values from many files is as easy as from one file. In practice you keep the values and names in a string-string map and you check if a value already exists there. It makes no difference if those values came from 2 files.

Quote
B.t.w. looking for unused resourcestrings, can't we get that info from the compiler (doesn't it generate a hint), if we build lazarus?
There are also probably other standard (and higly specialized) tools available for checking fast amounts of files for the occurence of a particular string?

It may be easier to search the files than parse the compiler output. Besides it would require compiling all files in the project when you just want a report of resource strings.
There may be specialized tools for searching a word but this one would be even more specialized. It would search all resource strings in all source files of a Lazarus project. No such tool exists yet.
But, this "unused resourcestring" feature can be a ToDo item if you just implement the 2 things (format param check and duplicate value).

Quote
If you want a simple tool, I think I can write one.
It can have a GUI, and it can have a console front-end (which can be used for automated testing).

If you want more than that, I don't think I can participate.

People will come with improvement ideas and patches. I also will participate in it.
It is a psychological effect that people find it easier to improve an existing feature than to create a whole new feature. You just have to realize it and not be jealous for your code when someone wants to change it. That is how open source works.

I already mentioned the Example Projects dialog. Another one is the "Add Used Unit" dialog (Alt-F11). There was a feature request of such dialog for years but nobody had implemented it, until I did.
Then many people started to push patches to improve it and it still continues. It is getting too complex for my taste already but that is OK.

Juha

Bart

  • Hero Member
  • *****
  • Posts: 1458
    • Bart en Mariska's Webstek
Re: Idea for a Lazarus addition: resource string validator
« Reply #8 on: December 07, 2011, 11:25:10 pm »
But seriously, often people are too shy or afraid or something with their own code. In reality they can create good code. Maybe the core developers' critical comments scare them away.

I kind of am the unofficial maintainer of the maskedit unit, which at one stage I almost completely rewrote.
However, that kind of coding I find myself comfortable with.

Criticisme on my coding is welcome, and it doesn't scare me. In the past this critic form Lazarus/Fpc devels has inspired me to improve my coding and explore new things in Pascal.
I have posted fixes in the bugtracker which didn't meet the standards and rewritten them until they did.
If it turns out to be more than I can chew on, I (gracefully) stand down.
I feel no shame in that.

I have no experience with string-string maps, search trees and so on.
So, my first job would be to think about a suitable data model: how do I store the strings/string-pairs, in a way I can search it fast.

Given enough time, I'll come up with something.

Where to share my initial code then?

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1458
Re: Idea for a Lazarus addition: resource string validator
« Reply #9 on: December 08, 2011, 12:07:52 am »
I have no experience with string-string maps, search trees and so on.
So, my first job would be to think about a suitable data model: how do I store the strings/string-pairs, in a way I can search it fast.

Given enough time, I'll come up with something.

Where to share my initial code then?

TStringToStringTree in unit AvgLvlTree is a good one.
I could commit the code, just copy here or to bugtracker or somewhere.

Juha

Bart

  • Hero Member
  • *****
  • Posts: 1458
    • Bart en Mariska's Webstek
Re: Idea for a Lazarus addition: resource string validator
« Reply #10 on: December 08, 2011, 06:31:11 pm »
I'll have a look at it.

If I come up with anything acceptable (to me), I'll post it in the bugtracker (which category b.t.w.?), and I'll assign t to you, if that's ok.
May take a while though...

Bart

Bart

  • Hero Member
  • *****
  • Posts: 1458
    • Bart en Mariska's Webstek
Re: Idea for a Lazarus addition: resource string validator
« Reply #11 on: December 11, 2011, 05:44:29 pm »
ATM I have the following working:
  • Check for incompatible format() arguments
  • Check for missing identifiers

I ended up adapting the code from Translations unit, stripping everything out that is not needed atm.
Still it takes up to 2-3 seconds to load lazaruside.xx.po (the bloody thing being up to > 0.5 MB)

I found 13 format argument errors in lazaruside.nl.po file to start with.

Source is at http://home.tiscali.nl/~knmg0017/software/gpocheck_bron.zip
(Lazarus 0.9.31 r33459 FPC 2.4.4 i386-win32-win32/win64)

Mind you this is just a testing example, to begin with you need to alter the hardcoded paths for the po-files in main.pp!

No error checking what so ever has been added.
Test at your own risk.

Plans:
  • Option to iterate over all .xx.po files belonging to a master .po file
  • Option to test all .po files in batch mode

Known bugs:
  • The line number isn't correct (see TSimplePoFile.ReadPOFile)

Juha: can you take a look. Is the TSimplePoFile class feasible for what we want to achieve, or should I go another way?
I'ld  rather hear this now, then trying to rewrite the base class of the app much later on in the process.

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1458
Re: Idea for a Lazarus addition: resource string validator
« Reply #12 on: December 12, 2011, 12:57:23 pm »
I ended up adapting the code from Translations unit, stripping everything out that is not needed atm.
Still it takes up to 2-3 seconds to load lazaruside.xx.po (the bloody thing being up to > 0.5 MB)

[... skip ...]

Juha: can you take a look. Is the TSimplePoFile class feasible for what we want to achieve, or should I go another way?
I'ld  rather hear this now, then trying to rewrite the base class of the app much later on in the process.

It is good to use the existing parser as you did, from Translations unit. I had never looked at it before.
Your code found format param errors also in Finnish translation.

About the speed, this kind of reporting tool is not very speed critical.
Yet it would be interesting to see how much faster it becomes if you replace the StringHashList container with a real hash map.

StringHashList is a weird combination of containers. I was reading about it already when I used Delphi and did some tests. Its performance is poor, especially when adding items.
It calculates an integer hash key for the strings, but instead of using the key as an index for a bucket array, it inserts the keys into a sorted list!
Inserting items into a sorted list is always an expensive operation.
To find the strings it does a binary search from the integer hash list. It is little faster than from a sorted StringList because comparing integers is faster than comparing strings. It still needs to calculate the hash key before the binary search which pretty much nullifies the speed gain.
The real benefit of this class is that you don't need to decide the bucket array size in the beginning as you need with a hash map, and also it saves some memory.
Maybe it made sense in 1980's or early 90's with very limited memory but I don't see much use for this container any more.
Somehow it made its way to Lazarus code base. I bet Mattias didn't do any speed tests before using it!

If you want to test the speed difference, you could use this:
 http://wiki.lazarus.freepascal.org/StringHashMap
or the similar class in LCL somewhere (forgot the name). My wild guess is that > 50% of .po file reading time is spent for adding items to the container. Using a hash map with O(1) time can make a BIG difference.

[Edit] In LCL there is TMap which is based on TAvgLvlTree. It is a balanced tree and pretty fast.
Still, hash maps are superior when there is lots of data. FCL contnrs has TFPDataHashTable which is a string-pointer hash map. I have not tested it myself.

Juha
« Last Edit: December 12, 2011, 03:34:45 pm by JuhaManninen »

Bart

  • Hero Member
  • *****
  • Posts: 1458
    • Bart en Mariska's Webstek
Re: Idea for a Lazarus addition: resource string validator
« Reply #13 on: December 13, 2011, 04:19:42 pm »
Felipe: I have done some speedtesting.
Mind you my machine is an 11 years old Intel Celeron 700 Mhz with 512 MB memory.

Original codeSkip adding entries
to TStringHashList
Skip adding entries
alltogether
5713 ticks5798 ticks5105 ticks

It seems that the internally used TFPList and TStringHashList are fast enough.
It the parsing of the > 500 Kb strings that is the bottleneck here.

Bart

Bart

  • Hero Member
  • *****
  • Posts: 1458
    • Bart en Mariska's Webstek
Re: Idea for a Lazarus addition: resource string validator
« Reply #14 on: December 13, 2011, 04:23:28 pm »
I rewrote the ReadPoText() procedure to use Strings instead of PChars.

Original code using PChars: 6427 ticks.
New code using Strings: 300 ticks.

Wow: 20 times faster!

I updated the sources (see link in previous post), so you can test it.

Bart
« Last Edit: December 13, 2011, 06:03:58 pm by Bart »

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus