Lazarus

Programming => General => Topic started by: Paolo on August 21, 2021, 07:12:23 pm

Title: common sub-expressions
Post by: Paolo on August 21, 2021, 07:12:23 pm
Dear all,

I wrote a tool to compute math formulas (postifix notation). It can compute derivatives too, in principle of any order.
At the moment it is base on a tree representation, where nodes are operator/function, leaves are value/variable/parameter(another tree).

However, in particular with trigonometric function and derivatives many sub-expression appears several times in the formula tree: the question is how identify common sub-expressions (sub-trees) ?

In the annexed picture you see on top the main expression (the formula itself has only the purpose to explain  what I am looking for) and its tree representation, and below the sub-trees that I have to identify as well as the way the expression is computed (first y, then  x and finally the expression).

All this because the computations are done a large number of times and I want to speed the computation pre-computing the sub-tree expressions.
(PS: as further “complication" could come to the fact that, for example, from the math point of view a+b = b+a, but at the moment I am focuses only on "identical" sub-trees).

I can imagine that this kind of problem has already addressed, but looking on internet I did not find useful information. An example of code or pseudo-code should be very useful to me.

Thank you in advance.
Title: Re: common sub-expressions
Post by: marcov on August 21, 2021, 07:36:49 pm
Dear all,

I wrote a tool to compute math formulas (postifix notation). It can compute derivatives too, in principle of any order.
At the moment it is base on a tree representation, where nodes are operator/function, leaves are value/variable/parameter(another tree).

However, in particular with trigonometric function and derivatives many sub-expression appears several times in the formula tree: the question is how identify common sub-expressions (sub-trees) ?

I did the same once (see unit symbolic), where the derivative was used to establish taylor polynomals that could be more quickly repeatedly evaluated.

I never came as far as common subexpression elimination, but I did try to simple rearrangement in the hope that terms would cancel (if you ignore border conditions). IIRC they are part of the symbolic unit

Title: Re: common sub-expressions
Post by: MarkMLl on August 21, 2021, 08:20:38 pm
Could you label it like a Merkle Tree? Then if two nodes had the same hash you'd have a reasonable degree of confidence that they represented the same expression, irrespective of complexity. (Heckel's Algorithm is an equivalent for e.g. line-by-line text matching.)

MarkMLl
Title: Re: common sub-expressions
Post by: marcov on August 21, 2021, 09:42:24 pm
Could you label it like a Merkle Tree? Then if two nodes had the same hash you'd have a reasonable degree of confidence that they represented the same expression, irrespective of complexity. (Heckel's Algorithm is an equivalent for e.g. line-by-line text matching.)

(In the end I never did the real research for common subexpression elimination, but I would expect for this to work, you would first have to transform the tree into some normative form. (so that x+1 and 1+x get resolved to one of both, iow generating the same hash))
Title: Re: common sub-expressions
Post by: MarkMLl on August 21, 2021, 10:07:40 pm
(In the end I never did the real research for common subexpression elimination, but I would expect for this to work, you would first have to transform the tree into some normative form. (so that x+1 and 1+x get resolved to one of both, iow generating the same hash))

In practical terms I'd expect that a single programmer would almost always stick to a single expression order, at least in the same statement.

However, if every leaf were converted to a hash (i.e. even a small numeric literal became a hash) and every node was sorted provided that the operator were commutative, then I think things would fall out nicely.

It obviously wouldn't recognise --1 as equivalent to 1, but while I can think of a few cases where I've written that I'd not realistically expect the equivalence to be recognised.

All of which leads on to another of my interests which is the rich operator vocabulary of languages such as APL, and whether it can be reconciled with an ALGOL-based language which has the customary operator precedence. But I know nobody else around here cares about that :-)

MarkMLl
Title: Re: common sub-expressions
Post by: Paolo on August 27, 2021, 10:10:04 am
Here my reasoning…
I think I have to work on postfix string and search there the common substring. This could be possible since the math tree has to follow some rules and, as shown in the picture, to be a valid math expression the tree must have a certain structure, there could not be ambiguity, in the picture all the three trees have the same “string” but the latest two are not allowed. At the end I’ll try this route (here largely simplified for sake of simplicity):
1-   Write the passed expression (see first posted picture) in the example ‘++*+abc*+abc+ab’
2-   looking for a triplet made of (op, operand, operand)
3-   Once found check if more then one is present
4-   If 2. Yes remove the sub string, add the sub-formula to the list of sub formulas to be pre-computed, and put there the “precomputed value”
5-   Repeat step 2 until not found substrings triplets
In the example:
At the begin, Sub-formulas list = nil, formla.string= ‘++*+abc*+abc+ab’
Do step 2
Do step 3, found ‘+ab’
Do step 4, found other two ‘+ab’, add Y=sub-formulas[0]=’+ab’, formula now ‘++*Yc*YcY’
Second run, at step 4 found other two ‘*cY’, add X=sub-formulas[1]=’*Yc’, formula now ‘++XXY’
Third run, END.
In my application I have functions like (log, tan, etc..), first do the step above (to precompute the argument of functions) and then looking for triples made of (op, function or operand, function or operand).

That’s should work, I’ll try asap.

Title: Re: common sub-expressions
Post by: MarkMLl on August 27, 2021, 10:43:58 am
I would advise against manipulating stuff at the text level. Build a tree in memory and manipulate that.

I'm speaking from experience here. I relied excessively on text-level operations when I was doing my masters project quite some number of years ago, I did it with a custom command processor rather more recently, and the code that runs the business is largely based on a derivative of Meta-2 which consumes ALGOL-like source and spits out assembly-like object for interpretation without intermediate storage... and the lack of that intermediate storage causes headaches. I plan to migrate stuff when I can to something derived from Tree Meta which /does/ have intermediate storage, and anticipate it will help solve various problems.

MarkMLl
Title: Re: common sub-expressions
Post by: damieiro on August 27, 2021, 11:31:15 am
Paolo, i see your reasoning ok.

I would do:

1.- Make the tree first (as pointed by MarkMLI)
2.- Normalice in someway the triplet
    (For example alphabetically order, so a+b and b+a would be a+b. Or b+1 and 1+b would be 1+b.)
3.- I do not know if more normalizations are needed for your case. For examble a+a being 2*a.
4.- And even Reductions :
-1*b+a -> a-b.
Would your algorithm find this example subformula?
-1*b+a+a-b like Y+Y? -> well +YY






Title: Re: common sub-expressions
Post by: Paolo on August 27, 2021, 12:29:43 pm
thanks to all,

@damieiro,

I am just reasoning on expression like a+b+a+b that is written as string coming from my tree is '+a+b+ab' and here my algorithm cannot catch that there are two "+ab" just comparing sub-string triplets. But I am moving step-by-step and at the moment I am focused on "well written" input formula and then creating derivatives-tree as much as simplified as possible (derivatives tend to have many common sub-expressions a lot of "0" and so on..).

to your notes (first of all thank you for the replay):
1- I have the "user" input string, I parse it and then create the tree, so at first step I make the tree.
2- thank for the suggestion, here each triplet can be (op, number/function/parameter, number/function/parameter), parameter is another tree essentially
3- some normalization are done at fly on the input string (eg triplet like +32 = becomes just a node with "5", and so on), and on the derivatives computation, where formula are necessarily generic, I remove a lot of redundancy (eg f=x^2 in the tree (^,x,2), f'=2*x^1 in the tree (*,2,X) if you apply directly the derivation rules to be generic as much as possible (in this case f'(x)=(h(x)*g(x)^z(x))') you will se the tree grow a lot with useless operation it should be striclty speaking (*,*,0,^,x,1,*,2,*,1,^,x,0) being (a*b)'=a'b+ab' and (x^1)' = 1*X^0 ..) with a lot of saving in computation and tree minimization.
4- f=-b is represented in the tree either f=0-b or f=-1*b to have a standard triplet for operators like (+,-,*..), as said before during the derivative computation some adjustments are done I think I have done something to minimize of -b+a appearance (to check anyway)
Yes I know, see the at the begin of my answer even simple express is not recognized as possible simplification by this approach, but honestly at the moment I still never really attempted to do that... but I am going to... any suggestion is welcome !

concerning at you example a+a=2*a if "a" is a number is immediately substituted by the value of a+a in the tree, if "a" is a parameter it is convenient left a+a being "a" pre computed, if "a" is a function it should be put first in the parameter list (that is what I am looking for at the moment for the generic sub-expression) so again a+a should be more conveniently left as a+a than 2*a (if you did not recognize the two function be the same you are duing two times the same computation o "a")
Title: Re: common sub-expressions
Post by: Martin_fr on August 27, 2021, 01:30:19 pm
I am just reasoning on expression like a+b+a+b that is written as string coming from my tree is '+a+b+ab' and here my algorithm cannot catch that there are two "+ab" just comparing sub-string triplets.
Interesting example.

"a+b+a+b" is easy to spot for a human.
You might be able to find it by doing "text matching". But that becomes very quickly very expensive (step increase in run time, when expressions grows).

Also text matching is flawed because
 "a+b+2*a+b" would be wrong to match the 2nd "a+b".
 "a+a+b+b" is the same, but would not be found.

The latter also applies across multiple individual terms:
2*(a+b+a+b) + 3*(a+a)   // a+a is a subexpression
2*(a+b+a+b) + 3*(b+a+a)   // b+a+a is a subexpression

Also noteworthy that optimizing "a+b+a+b" into "c = a+b" and then "c+c" is about the cost of the "+".
It has nothing to do with the cost of calculating "a" or "b", since even if the are expensive expressions in itself, they would have to be found before (by hash in a tree), and each would be calculated once only.
The transformation simple reduces the calls to "+" from 3 times to twice. So spending time on finding those only matters if "+" is an operator that is in it self expensive.



In any case, the first step must be to break by operator precedence. That can be done by parsing into a tree.
Once you have a tree, you can work in the tree.

Lets say you give each tree-node a hash.
Not sha, but something you can put into a hash map for quick lookup. E.g. an integer in a range from 0 to 1.5*nodes.count (your hashmap then is an array of that size).  There should be ready made code for those hashes...

Leaf node (e.g. "a"): => hash('a')
"+" node: sort the 2 child nodes by their current hash, and then => hash(inttostr(lowchild)+inttostr(highchild))

That already takes care of "a+b" vs "b+a".



To get stuff like "a+b+a+b" you need to flatten out the "+" subtree.
There could be ONE  "+" operator with 4 children: "a","a","b","b" (children sorted)

Then you could apply custom searches on that list.
In this case, if several node, all appear more than once (there are 2 "a" and 2 "b") then you can add one of each first, and then add the result to itself. (but finding this case my cost more time, than just adding them as they were given....)


You could also compare with other "+" nodes, if they have a subset of your list. But be careful, the amount of possible combinations can grow rapidly => then the optimization will quickly get to take more time that the calculation would have.
Lets say
   2*(a+a+b) + 3*(a+b+b) + 4*(a+a+b+b)

which subsets to calculate?
You can do
   c=a+a ; d=b+b;
   2*(c+b) + 3*(a+d) + 4*(c+d)

or
   c=a+b
   2*(a+c) + 3*(c+b) + 4*(c+c)

As you can see, even so the 2nd only extracted one common term (while the first example found 2 common terms), the 2nd example has one "+" less.

or
   c=a+a+b
   2*(c) + 3*(a+b+b) + 4*(c+a)

This example is most expensive (but could be found, if you scan left to right => meaning you would first look at "a+a+b" and find it in the last term.
But if you continue this then you could end up
   d=a+b 
   c=a+d
   2*(c) + 3*(d+b) + 4*(c+a)
And that is as good as the 2nd from the first 2 examples.

In the end, it may be faster to just add them up as they are.

Title: Re: common sub-expressions
Post by: marcov on August 27, 2021, 02:36:19 pm
postfix to tree is fairly simple.  Easier than infix to tree.  Again, the various expression parsers in FPC should give some clues.

Symbolic also uses postfix arrays with one symbol per array entry (a variant record). This is also easier to manipulate than string based in- or postfix. It uses the term RPN for postfix, which I got from HP48(g) calculators that I had in my student days.
Title: Re: common sub-expressions
Post by: damieiro on August 27, 2021, 03:02:43 pm
other view (functional programming) could be find here
https://www.glc.us.es/~jalonso/exercitium/subexpresiones-aritmeticas/

this will give you a list of all subexpresions (common ones or not).

This is in haskell, but can be translated to imperative paradigm.
Note: page is in spanish, but examples are clear and perhaps helps as a starting ground.

Martin_fr gives you good advice (well, all did). I fully agree with his post.
Title: Re: common sub-expressions
Post by: Paolo on August 27, 2021, 03:43:58 pm
Thanks, Martin, for the comments. I’ll go in deep asap on that
Here my first comments:
General comment to introduce the discussion
What I have now is a formula calculator as follow
It has several strings like “id-parameter = formula” (in polish notation, each of them have the corresponding tree), they are “parameters formula”, the fist one Hidden is the variable “string[0]= x = value” (x becomes a reserved word for the user, at the moment just one variable is considered) over iterate the formula evaluation
The last string is the formula “Formula = String[last]”
Then I am able to compute derivatives (sometimes up to 4th ) and also them are on the string list and on the their trees.
The computation proceeds first evaluating the parameters formulas then the final formula so everything is in the parameter’s formula is compute just one time
What I see is that even modest formulas the derivatives can have a large number of common terms, here an example
User input, in human format (a and b are defined by user “z” is a reserved word for the formula result)
a = cos(x)
b = sin(x)^2
z = 10*Log(a/b)
the polish notation coming from the trees (each term separated by “(“ and “)”) (* note the drivatives are wrt “x” variable just to clarify)
Formula = [(*)(10)(Log)(/)(a)(b)]
Formula D1 = [(*)(10)(/)(/)(-)(*)(*)(-1)(Sin)(x)(b)(*)(a)(*)(*)(Sin)(x)(Cos)(x)(2)(Sqr)(b)(*)(2.30258509299405)(/)(a)(b)]
Formula D2 = [(*)(10)(/)(-)(*)(/)(-)(*)(-)(+)(*)(*)(-1)(Cos)(x)(b)(*)(*)(-1)(Sin)(x)(*)(*)(Sin)(x)(Cos)(x)(2)(+)(*)(*)(-1)(Sin)(x)(*)(*)(Sin)(x)(Cos)(x)(2)(*)(a)(*)(+)(*)(Cos)(x)(Cos)(x)(*)(Sin)(x)(*)(-1)(Sin)(x)(2)(Sqr)(b)(*)(-)(*)(*)(-1)(Sin)(x)(b)(*)(a)(*)(*)(Sin)(x)(Cos)(x)(2)(*)(*)(2)(b)(*)(*)(Sin)(x)(Cos)(x)(2)(Sqr)(Sqr)(b)(*)(2.30258509299405)(/)(a)(b)(*)(/)(-)(*)(*)(-1)(Sin)(x)(b)(*)(a)(*)(*)(Sin)(x)(Cos)(x)(2)(Sqr)(b)(*)(2.30258509299405)(/)(-)(*)(*)(-1)(Sin)(x)(b)(*)(a)(*)(*)(Sin)(x)(Cos)(x)(2)(Sqr)(b)(Sqr)(*)(2.30258509299405)(/)(a)(b)]
You will see here that even on such short formula the D2 is grow enormously (please note that at this stage of development the software does not recognize if sin(x) and cos(x) are already parametrized by the user). And also it is important that the expressions, Formula and derivitives, are computed a large number of times.
If I am not wrong in formula, D1 and D2 I can count :
9 times “a”, 
15 times “b”,
10 times “cos(x)” that should be replaced by “a”,
15 times “sin(x)”.
What I would have is
a = cos(x)
c = sin(x) -> added “hidden-to-the user” string (and tree)
b = c^2 -> param “b” take advantage of added param “c”
z = 10*Log(a/b)
now the computed terms are
1 time cos(x) (as “a”) instead of 19, 
1 time sin(x) (as “c”) instead of 15,
1 time Sin(x)^2 (as “b”) instead of 15.

Then I am looking for sub-common expression for further simplification.

For example, there are 5 times SQR(B), But then I have also 5 “c*a”, 6 “2*e”,  4 “a/b”, they can be parametrized too, maybe others… again this saves time because the parameter formulas are computed one time
a = cos(x)
c = sin(x) -> added “hidden-to-the user” string (and tree)
b = c^2 -> param “b” take advantage of added param “c”
d = sqr(b) -> added “hidden-to-the user” string (and tree)
e = c*a -> added “hidden-to-the user” string (and tree)
f= a/b -> added “hidden-to-the user” string (and tree)
g = 2*e -> added “hidden-to-the user” string (and tree)
z = 10*Log(a/b)

Formula = [(*)(10)(Log)(f)]
Formula D1 = [(*)(10)(/)(/)(-)(*)(*)(-1)(c)(b)(*)(a)(g)(d)(*)(2.30258509299405)(f)]
Formula D2 = [(*)(10)(/)(-)(*)(/)(-)(*)(-)(+)(*)(*)(-1)(a)(b)(*)(*)(-1)(c)(g)(+)(*)(*)(-1)(c)(g)(*)(a)(*)(+)(*)(a)(a)(*)(c)(*)(-1)(c)(2)(d)(*)(-)(*)(*)(-1)(c)(b)(*)(a)(g)(*)(*)(2)(b)(g)(Sqr)(d)(*)(2.30258509299405)(f)(*)(/)(-)(*)(*)(-1)(c)(b)(*)(a)(*)(*)(b)(a)(2)(d)(*)(2.30258509299405)(/)(-)(*)(*)(-1)(c)(b)(*)(a)(g)(d)(Sqr)(*)(2.30258509299405)(f)]


Quote
"a+b+a+b" is easy to spot for a human.
Yes

Quote
You might be able to find it by doing "text matching". But that becomes very quickly very expensive (step increase in run time, when expressions grows).
Also text matching is flawed because
 "a+b+2*a+b" would be wrong to match the 2nd "a+b".
 "a+a+b+b" is the same, but would not be found.
OK

Quote
The latter also applies across multiple individual terms:
2*(a+b+a+b) + 3*(a+a)   // a+a is a subexpression
2*(a+b+a+b) + 3*(b+a+a)   // b+a+a is a subexpression
Yes, that is my situation

Quote
Also noteworthy that optimizing "a+b+a+b" into "c = a+b" and then "c+c" is about the cost of the "+".
as showed in the introduction a and b could be themself complex expressions computed several times

Quote
It has nothing to do with the cost of calculating "a" or "b", since even if the are expensive expressions in itself, they would have to be found before (by hash in a tree), and each would be calculated once only.
that is exactly what I do, pre computed value (one time) each time the input variable changes.

Quote
The transformation simple reduces the calls to "+" from 3 times to twice. So spending time on finding those only matters if "+" is an operator that is in it self expensive.
the attempt here is to reduce the computation time of a and b not of + itsefl

Quote
In any case, the first step must be to break by operator precedence. That can be done by parsing into a tree.
Once you have a tree, you can work in the tree.
done

Quote
Lets say you give each tree-node a hash.
Not sha, but something you can put into a hash map for quick lookup. E.g. an integer in a range from 0 to 1.5*nodes.count (your hashmap then is an array of that size).  There should be ready made code for those hashes...

Leaf node (e.g. "a"): => hash('a')
"+" node: sort the 2 child nodes by their current hash, and then => hash(inttostr(lowchild)+inttostr(highchild))

That already takes care of "a+b" vs "b+a".


To get stuff like "a+b+a+b" you need to flatten out the "+" subtree.
There could be ONE  "+" operator with 4 children: "a","a","b","b" (children sorted)

Then you could apply custom searches on that list.
In this case, if several node, all appear more than once (there are 2 "a" and 2 "b") then you can add one of each first, and then add the result to itself. (but finding this case my cost more time, than just adding them as they were given....)
that my next step, to have for the eg for the specific operator all the tems that can be intarcahnged and looking for duplicate (being intechagable the order does not count like for "+" operator)

Quote
You could also compare with other "+" nodes, if they have a subset of your list. But be careful, the amount of possible combinations can grow rapidly => then the optimization will quickly get to take more time that the calculation would have.
Lets say
   2*(a+a+b) + 3*(a+b+b) + 4*(a+a+b+b)

which subsets to calculate?
You can do
   c=a+a ; d=b+b;
   2*(c+b) + 3*(a+d) + 4*(c+d)

or
   c=a+b
   2*(a+c) + 3*(c+b) + 4*(c+c)

As you can see, even so the 2nd only extracted one common term (while the first example found 2 common terms), the 2nd example has one "+" less.

or
   c=a+a+b
   2*(c) + 3*(a+b+b) + 4*(c+a)

This example is most expensive (but could be found, if you scan left to right => meaning you would first look at "a+a+b" and find it in the last term.
But if you continue this then you could end up
   d=a+b
   c=a+d
   2*(c) + 3*(d+b) + 4*(c+a)
And that is as good as the 2nd from the first 2 examples.
here seems to be a sort of arbitrariety in the choice, eg is it better a+a or 2*a ? if "a" is a parameter there is no difference, maybe "+" a little bit faster, on the contraty if "a" is a function not pre-computed I do twice the same computation

Quote
In the end, it may be faster to just add them up as they are.
not tested but I think in my case the saving can be a lot

Title: Re: common sub-expressions
Post by: damieiro on August 27, 2021, 03:52:55 pm
Quote
here seems to be a sort of arbitrariety in the choice, eg is it better a+a or 2*a ? if a is a parameter there is no difference, amybe + alittle bit faster, on the contraty if a i a function not pre computed I do twice the same computation

Rule of thumb: Reduce number of expresions (letters) allways and go for high precedence if possible.. It's better n*a (n being a known number) that a+a.. n times.  and its better a^2 than a*a, for example. But there are more than 25 years i have studied it in university..

Title: Re: common sub-expressions
Post by: Paolo on August 27, 2021, 03:58:57 pm
@damieiro

Yes. for me about 30 years..  :'(
Title: Re: common sub-expressions
Post by: damieiro on August 27, 2021, 04:03:03 pm
 :D :D
It could be funny having a poll of free pascal forum ages...

When did you born?
50-60s
60-70s
70s-80->i' m here more near than 70s :D

And so on..

Title: Re: common sub-expressions
Post by: Paolo on August 27, 2021, 04:17:02 pm
ehm.. let me say just before '70s :-[
Title: Re: common sub-expressions
Post by: Paolo on August 27, 2021, 04:31:25 pm
@markov,

Are there examples of symbolic unit usage ?
Title: Re: common sub-expressions
Post by: J-G on August 27, 2021, 05:42:22 pm
:D :D
It could be funny having a poll of free pascal forum ages...

When did you born?
50-60s
60-70s
70s-80->i' m here more near than 70s :D

And so on..
You started too late   ----  40's !!   8-)
Title: Re: common sub-expressions
Post by: BobDog on August 28, 2021, 11:04:23 am

You are going the wrong way chronologically, you must go back deeper into the mists of time to catch me!

Title: Re: common sub-expressions
Post by: PascalDragon on August 28, 2021, 11:42:03 am
Everyone, please, back to topic. This is not about ones age, but about sub expressions.
Title: Re: common sub-expressions
Post by: MarkMLl on August 28, 2021, 12:09:24 pm
Seconded, FWIW. However I'd suggest that peoples' background /can/ be important in this area: as an engineer (and one who predates most CS departments) I'm inclined to be dismissive of stuff that sounds like CS geekery: but in this case the concept of an Abstract Syntax Tree that can be manipulated systematically is deeply important.

One thing I would say though is that it might turn out to be necessary to store the original source text at each node for debugging/listing purposes, but this can obviously be difficult to manipulate. So one might perforce end up with both storage for a source fragment, plus a way of regenerating something that approximates source from a node where e.g. two leaves have been exchanged to bring it in line with others which might at a later stage be elided as common subexpressions.

MarkMLl
Title: Re: common sub-expressions
Post by: marcov on August 28, 2021, 12:15:21 pm
@markov,

Are there examples of symbolic unit usage ?

there are some in packages\symbolic\examples, but they are quite basic. 
TinyPortal © 2005-2018