Thanks, Martin, for the comments. I’ll go in deep asap on that
Here my first comments:
General comment to introduce the discussion
What I have now is a formula calculator as follow
It has several strings like “id-parameter = formula” (in polish notation, each of them have the corresponding tree), they are “parameters formula”, the fist one Hidden is the variable “string[0]= x = value” (x becomes a reserved word for the user, at the moment just one variable is considered) over iterate the formula evaluation
The last string is the formula “Formula = String[last]”
Then I am able to compute derivatives (sometimes up to 4th ) and also them are on the string list and on the their trees.
The computation proceeds first evaluating the parameters formulas then the final formula so everything is in the parameter’s formula is compute just one time
What I see is that even modest formulas the derivatives can have a large number of common terms, here an example
User input, in human format (a and b are defined by user “z” is a reserved word for the formula result)
a = cos(x)
b = sin(x)^2
z = 10*Log(a/b)
the polish notation coming from the trees (each term separated by “(“ and “)”) (* note the drivatives are wrt “x” variable just to clarify)
Formula = [(*)(10)(Log)(/)(a)(b)]
Formula D1 = [(*)(10)(/)(/)(-)(*)(*)(-1)(Sin)(x)(b)(*)(a)(*)(*)(Sin)(x)(Cos)(x)(2)(Sqr)(b)(*)(2.30258509299405)(/)(a)(b)]
Formula D2 = [(*)(10)(/)(-)(*)(/)(-)(*)(-)(+)(*)(*)(-1)(Cos)(x)(b)(*)(*)(-1)(Sin)(x)(*)(*)(Sin)(x)(Cos)(x)(2)(+)(*)(*)(-1)(Sin)(x)(*)(*)(Sin)(x)(Cos)(x)(2)(*)(a)(*)(+)(*)(Cos)(x)(Cos)(x)(*)(Sin)(x)(*)(-1)(Sin)(x)(2)(Sqr)(b)(*)(-)(*)(*)(-1)(Sin)(x)(b)(*)(a)(*)(*)(Sin)(x)(Cos)(x)(2)(*)(*)(2)(b)(*)(*)(Sin)(x)(Cos)(x)(2)(Sqr)(Sqr)(b)(*)(2.30258509299405)(/)(a)(b)(*)(/)(-)(*)(*)(-1)(Sin)(x)(b)(*)(a)(*)(*)(Sin)(x)(Cos)(x)(2)(Sqr)(b)(*)(2.30258509299405)(/)(-)(*)(*)(-1)(Sin)(x)(b)(*)(a)(*)(*)(Sin)(x)(Cos)(x)(2)(Sqr)(b)(Sqr)(*)(2.30258509299405)(/)(a)(b)]
You will see here that even on such short formula the D2 is grow enormously (please note that at this stage of development the software does not recognize if sin(x) and cos(x) are already parametrized by the user). And also it is important that the expressions, Formula and derivitives, are computed a large number of times.
If I am not wrong in formula, D1 and D2 I can count :
9 times “a”,
15 times “b”,
10 times “cos(x)” that should be replaced by “a”,
15 times “sin(x)”.
What I would have is
a = cos(x)
c = sin(x) -> added “hidden-to-the user” string (and tree)
b = c^2 -> param “b” take advantage of added param “c”
z = 10*Log(a/b)
now the computed terms are
1 time cos(x) (as “a”) instead of 19,
1 time sin(x) (as “c”) instead of 15,
1 time Sin(x)^2 (as “b”) instead of 15.
Then I am looking for sub-common expression for further simplification.
For example, there are 5 times SQR(B), But then I have also 5 “c*a”, 6 “2*e”, 4 “a/b”, they can be parametrized too, maybe others… again this saves time because the parameter formulas are computed one time
a = cos(x)
c = sin(x) -> added “hidden-to-the user” string (and tree)
b = c^2 -> param “b” take advantage of added param “c”
d = sqr(b) -> added “hidden-to-the user” string (and tree)
e = c*a -> added “hidden-to-the user” string (and tree)
f= a/b -> added “hidden-to-the user” string (and tree)
g = 2*e -> added “hidden-to-the user” string (and tree)
z = 10*Log(a/b)
Formula = [(*)(10)(Log)(f)]
Formula D1 = [(*)(10)(/)(/)(-)(*)(*)(-1)(c)(b)(*)(a)(g)(d)(*)(2.30258509299405)(f)]
Formula D2 = [(*)(10)(/)(-)(*)(/)(-)(*)(-)(+)(*)(*)(-1)(a)(b)(*)(*)(-1)(c)(g)(+)(*)(*)(-1)(c)(g)(*)(a)(*)(+)(*)(a)(a)(*)(c)(*)(-1)(c)(2)(d)(*)(-)(*)(*)(-1)(c)(b)(*)(a)(g)(*)(*)(2)(b)(g)(Sqr)(d)(*)(2.30258509299405)(f)(*)(/)(-)(*)(*)(-1)(c)(b)(*)(a)(*)(*)(b)(a)(2)(d)(*)(2.30258509299405)(/)(-)(*)(*)(-1)(c)(b)(*)(a)(g)(d)(Sqr)(*)(2.30258509299405)(f)]
"a+b+a+b" is easy to spot for a human.
Yes
You might be able to find it by doing "text matching". But that becomes very quickly very expensive (step increase in run time, when expressions grows).
Also text matching is flawed because
"a+b+2*a+b" would be wrong to match the 2nd "a+b".
"a+a+b+b" is the same, but would not be found.
OK
The latter also applies across multiple individual terms:
2*(a+b+a+b) + 3*(a+a) // a+a is a subexpression
2*(a+b+a+b) + 3*(b+a+a) // b+a+a is a subexpression
Yes, that is my situation
Also noteworthy that optimizing "a+b+a+b" into "c = a+b" and then "c+c" is about the cost of the "+".
as showed in the introduction a and b could be themself complex expressions computed several times
It has nothing to do with the cost of calculating "a" or "b", since even if the are expensive expressions in itself, they would have to be found before (by hash in a tree), and each would be calculated once only.
that is exactly what I do, pre computed value (one time) each time the input variable changes.
The transformation simple reduces the calls to "+" from 3 times to twice. So spending time on finding those only matters if "+" is an operator that is in it self expensive.
the attempt here is to reduce the computation time of a and b not of + itsefl
In any case, the first step must be to break by operator precedence. That can be done by parsing into a tree.
Once you have a tree, you can work in the tree.
done
Lets say you give each tree-node a hash.
Not sha, but something you can put into a hash map for quick lookup. E.g. an integer in a range from 0 to 1.5*nodes.count (your hashmap then is an array of that size). There should be ready made code for those hashes...
Leaf node (e.g. "a"): => hash('a')
"+" node: sort the 2 child nodes by their current hash, and then => hash(inttostr(lowchild)+inttostr(highchild))
That already takes care of "a+b" vs "b+a".
To get stuff like "a+b+a+b" you need to flatten out the "+" subtree.
There could be ONE "+" operator with 4 children: "a","a","b","b" (children sorted)
Then you could apply custom searches on that list.
In this case, if several node, all appear more than once (there are 2 "a" and 2 "b") then you can add one of each first, and then add the result to itself. (but finding this case my cost more time, than just adding them as they were given....)
that my next step, to have for the eg for the specific operator all the tems that can be intarcahnged and looking for duplicate (being intechagable the order does not count like for "+" operator)
You could also compare with other "+" nodes, if they have a subset of your list. But be careful, the amount of possible combinations can grow rapidly => then the optimization will quickly get to take more time that the calculation would have.
Lets say
2*(a+a+b) + 3*(a+b+b) + 4*(a+a+b+b)
which subsets to calculate?
You can do
c=a+a ; d=b+b;
2*(c+b) + 3*(a+d) + 4*(c+d)
or
c=a+b
2*(a+c) + 3*(c+b) + 4*(c+c)
As you can see, even so the 2nd only extracted one common term (while the first example found 2 common terms), the 2nd example has one "+" less.
or
c=a+a+b
2*(c) + 3*(a+b+b) + 4*(c+a)
This example is most expensive (but could be found, if you scan left to right => meaning you would first look at "a+a+b" and find it in the last term.
But if you continue this then you could end up
d=a+b
c=a+d
2*(c) + 3*(d+b) + 4*(c+a)
And that is as good as the 2nd from the first 2 examples.
here seems to be a sort of arbitrariety in the choice, eg is it better a+a or 2*a ? if "a" is a parameter there is no difference, maybe "+" a little bit faster, on the contraty if "a" is a function not pre-computed I do twice the same computation
In the end, it may be faster to just add them up as they are.
not tested but I think in my case the saving can be a lot