Introducing the `TRestExpressionEvaluation` process

Vindaar · March 16, 2021, 8:10pm

Related pull request with the code: Add `TRestExpressionEvaluationProcess` by Vindaar · Pull Request #25 · rest-for-physics/framework · GitHub

Over the last few days I’ve built something that as far as I understand fills a niche I’ve heard talked about multiple times. Namely: “do what TRestAnalysisPlot does, but keep the results in the tree”. Or in more concrete terms: Evaluate arbitrary expressions yielding boolean or float values for each event in a tree and creating new observables from said values based on strings stored in an RML file.

I wrote a small standalone header only library to handle the actual parsing and evaluation of the given strings, because that is functionality that is completely independent of REST and thus there is no need to force it into a REST process or similar. There are of course much more fancy similar libraries out there, but I thought it would be better to write our own to not have another real external dependency (I’ve put the code into the external dependency more for convenience and being unsure where to place it than anything else). Aside from that there is the fact that once the ROOT dataframe implementation is wrapped into REST all of this becomes obsolete anyways.

Since we are stuck on C++11 I had to implement a basic Either type in the library. std::variant would be a reasonable choice, but that was only added in C++17.

Hopefully this can be of use to some people. At the moment it may be a bit brittle around the edges. There are test cases, but they could be

Explanation taken from the docstring in `TRestExpressionEvaluationProcess.cxx`

Both floating point as well as boolean expressions are supported. For boolean results the data is stored as integers for better compatibility with other boolean variables used in REST.

The return type is determined automatically from the expression.

The expression strings are defined within a special <expressionset> tag in the RML file under this process to allow for iteration over all child tags. Within that tag an arbitrary number of <item> tags can be added, which must have a name and an expr field. The name field is the name of the resulting observable. The expr field is similar to the syntax for ROOT cut strings given to TTree::draw to perform filtering on a tree before drawing it with a few small differences (see below).

<addProcess type="TRestExpressionEvaluationProcess" name="my custom expression" value="ON" verboseLevel="silent">
  <expressionset>
    <!-- a simple, constant boolean expression -->
    <item name="boolExpr" expr="5<10"/>
    <!-- a simple, constant float expression -->
    <item name="floatExpr" expr="5 * 10"/>
    <!-- a float expression using an existing REST observable -->
    <Item name="floatExprObs" expr="hitsAna_energy / 1000.0"/>
    <!-- a bool expression using an existing REST observable -->
    <item name="boolExprObs" expr="hitsAna_energy < 5000.0"/>
    <!-- a bool expression of multiple statements combined by an `and` -->
    <item name="boolAndExprMultiple" expr="hitsAna_energy < 5000.0 and tckAna_nTracks_X==1"/>
    <!-- a bool expression of multiple statements combined by an `or` -->
    <item name="boolOrExprMultiple" expr="hitsAna_energy < 5000.0 or tckAna_nTracks_X==1"/>
    <!-- a complicated expression of nested parenthesis and multiple expressions -->
    <item name="boolComplicated" expr="(hitsAna_energy / 1000.0) < 5.0 or (tckAna_nTracks_X==1 and tckAna_nTracks_Y==1)"/>
  </expressionset>
</addprocess>

The main difference to ROOT cut strings are the two facts that 1. the strings are not limited to boolean expressions and 2. that the boolean operatiors && and || are instead refered to and and or respectively. The latter is due to the fact that & is an invalid character in XML and thus parsing them is broken (or the user has to write &, which is kind of unaccptable). Each expression is stored as REST metadata in form of a lisp-like representation of the input expression. For example the last boolComplicated example is represented as:

(|| (< (/ hitsAna_energy 1000.0)) (&& (== tckAna_nTracks_X 1) (== tckAna_nTracks_Y 1)))

Final words

Note that this process does not perform any kind of filtering or similar. It only creates masks (for boolean expressions) or computes new float values. It is up to the user to combine this with e.g. TRestAnalysisPlot to allow plotting of filtered data or complex expressions.

In addition due to the inherent runtime evaluated nature, it is always going to be slower than a native REST process. Due to this it should not be overly abused. Computations that are done a lot (i.e. in a systematic fashion over many RML files) should be implemented natively. It does however provide a way to quickly gleam insights into the data (and store the applied transformations as metadata) without having to write ROOT macro or even a REST process.

Finally, some functionality has not been implemented so far, namely application of pre-defined mathematical functions (e.g. sqrt, exp etc.) and computing powers (e.g. myObservable^2). These can be added rather easily, if this process is something that is considered of value.

(btw: why the heck are people running long running simulation jobs on sultan? Isn’t there a proper cluster for this kind of thing?)

jgalan · March 16, 2021, 9:37pm

Hi Sebastian, I think that is a very interesting process and it will be useful to many people.

Let me see if I understood. In practice, this process will create a new branch at the analysisTree for each of the items inside expressionset?

How it is the observable name generated? We have the standard observable naming using the process name + observable name.

Let me guess … the following definition

<addProcess type="TRestExpressionEvaluationProcess" name="myExpressions" value="ON" verboseLevel="silent">
  <expressionset>
     <item name="boolComplicated" expr="(hitsAna_energy / 1000.0) < 5.0 or (tckAna_nTracks_X==1 and tckAna_nTracks_Y==1)"/>
  </expressionset>
...

will generate a new observable at the analysisTree named myExpressions_boolComplicated?

======

Here it goes my second question. Why do we need an external library/header/code?

At a first sight i would think that creating an observable with a combination of observables is simpler.

I would just do inside ProcessEvent

fAnalysisTree->SetObservableValue( itemName, fAnalysisTree->EvaluateCuts( expressionStr ) );

where itemName is the field name at item, and expressionStr is the field expr. I didnt know about && problems at XML, but if it is the case, then, there should be problems also at globalCutString definition inside TRestAnalysisPlot. And I am sure I did a quick check, but who knows, might be connected with XML editors? I use a raw text editor.

That implementation would make it straight forward. Of course, that would work only for expressions returning booleans. But we could have simply a method inside TRestAnalysisTree, that could be named EvaluateExpression.

We could use TFormula to automatically include sqrt, exp and other complex mathematical formulation, as we do at the following methods

std::string REST_StringHelper::ReplaceMathematicalExpressions(std::string buffer, std::string errorMessage)
std::string REST_StringHelper::EvaluateExpression(std::string exp) {

So, I think that could be better synergies with existing REST code, without the need of an external package. Perhaps @nkx has also some insights about it.

=====

Did you consider other process names? At the beginning I didnt see the connection with the analysisTree.

Something like:

TRestAddComplexObservablesProcess

or

TRestAddEvaluatedObservablesProcess

Vindaar · March 16, 2021, 10:29pm

In practice, this process will create a new branch at the analysisTree for each of the items inside expressionset ?

Yes, exactly (well, is it a branch or a leaf? I’m not sure about ROOT terminology here).

How it is the observable name generated? We have the standard observable naming using the process name + observable name.

The name is the one given in the name tag of each item. I’m a bit confused about SetObservableValue. I think the overload using the fAnalysisTree implicitly automatically prepends the process name and the one taking an explicit tree creates the branch with the name given. I’m using the one with the explicit tree (partially because I had issues with the other one, but it could have been something else I fixed along the way).

So in this case the observables will just be called boolComplicated.

Why do we need an external library/header/code?

Multiple more or less good reasons:

the cut strings seem extremely limited to me. In particular I was not impressed the last time I went through the ROOT docs trying to figure out how to even apply a TCutString (or whatever that class is called) to a tree in the context of actually filtering a tree instead of just for plotting. I think that’s simply not supported.
I wasn’t even aware of the existence of the API for TFormula. I knew these things existed in ROOT internally, but didn’t know they were exposed. Even with these though, I’m not sure how well that can be applied. For instance in terms of accessing observables of a tree. Is that supported by TFormula? Probably not directly (indirectly certainly, but may result in more complex strings). In addition: the aforementioned && case cannot be solved using TFormula as far as I understand it (well, one can introduce <AND>, <OR> etc. tags into the RML syntax and concat strings based on that, but that seems verbose and complex.
personal reasons: I’m neither super experienced with ROOT, nor a big fan of it. So less ROOT = better in my book. Also writing the expression evaluator was fun (and annoying) so there’s that.

I didnt know about && problems at XML, but if it is the case, then, there should be problems also at globalCutString definition inside TRestAnalysisPlot . And I am sure I did a quick check, but who knows, might be connected with XML editors? I use a raw text editor.

I use emacs. That’s not the problem. The issue is that the XML standard requires & to be escaped. That means any XML parser that follows the standard will eat those characters (since they are escape characters like \ in a shell) before presenting the user the string. Try accessing an XML tag with a && in it see what it looks like after having it parsed by tinyxml. The && simply “disappear”.

So, I think that could be better synergies with existing REST code, without the need of an external package. Perhaps @nkx has also some insights about it.

Well, the additional library has a single public function (parseExpression) and two types Expression and Either. In that sense it is nice to me as it barely has any overlapping logic that makes things complicated. It adds more code though, which is a disadvantage for sure.
The latter I don’t think is a big problem for the reason stated in the OP, namely that once the DataFrame API is wrapped it can be neatly used for such things in a similar manner (just with even more functionality) and in particular much higher performance (since cling JIT compiles expressions for it afaik). At that point this whole library can be thrown out again.

But please, if you explain to me how the functionality provided here can be easily achieved using TFormula I can write that EvaluateExpression function (which shouldn’t return a string, but something like Either).

Did you consider other process names? At the beginning I didnt see the connection with the analysisTree.

Feel free to bikeshed over naming. This name seemed to me to be the one explaining what goes on and sticking to REST terminology the best. But if you want me to change the name, just pick one.

jgalan · March 17, 2021, 9:11am

I am not sure if it is recommended to call to the explicit fAnalysisTree->SetObservableValue() @nkx?

No, TFormula is evaluating valid mathematical expressions. You would need to construct the expression by accessing to the analysisTree. This could be implemented in a method EvaluateExpression.

This is not connected to TFormula simply we are already using something similar at globalCutString. So I am surprised that this is now a problem in your implementation.

I see, so, it is this problem happening to us with tinyXML library? @nkx?

I guess we need first a method at std::string TRestAnalysisTree::ReplaceObservableValues(std::string s) that replaces the identified observable names by their value. I.e. looping to all observable names, find and replace, and use TFormula. I guess thats what TCut is doing, so there is probably some ROOT code it could be reused. But implementing it ourselves has the advantage that we have control over it.

And, yes, thats probably not extremely efficient, but for me it is simple enough to code at a first attempt without excessive coding time. Optimization phase may come later on, once things work, and we have validation pipelines running, and we can benchmark the timing.

It is also clear that this process does not need to be slowing down the main data processing chain, since it is a pure analysis process and it can be launched by users at a final data processing stage.

nkx · March 17, 2021, 10:09am

Calling fAnalysisTree->SetObservableValue() is also supported. The effect is same despite the observable name prefix.

The behavior of escape letters is not guaranteed in tinyxml. Actually the characters > and < shall also be escaped. But we tested in tinyxml that they can be correctly read. For character & maybe indeed it will just disappear.

This shall not be regarded as a problem. We choose xml, then we must face the escape letters. If we add them anyway in the file, then the behavior is unpredictable.

The only way out is to use another format. For example the json config format is more and more used in programming these days. We can consider including it.

Vindaar · March 17, 2021, 11:31am

@jgalan

I guess we need first a method at std::string TRestAnalysisTree::ReplaceObservableValues(std::string s) that replaces the identified observable names by their value. I.e. looping to all observable names, find and replace, and use TFormula . I guess thats what TCut is doing, so there is probably some ROOT code it could be reused. But implementing it ourselves has the advantage that we have control over it.

Ah, now I understand what you mean. That is imo a bad idea. This would imply having to:

read the observable values for each event (same as in my implementation)
perform string interpolation replacing the observable reference by it’s float / bool value to construct the string that can be evaluated by TFormula also for each event. Note that string operations are slow, in particular converting floats to strings and parsing floats from strings is slow. In addition to that there are all sorts of problems related to round trip conversion problems of floats (hence things like ryu exist).

In my implementation instead we do:

parse the string once into a binary tree representing the unary/binary math operations
for each event evaluate the tree. If a leaf is an observable, read that observable and use its value in place of the string

Aside from having values wrapped in an Either type, which requires a single pointer indirection to access the data, we are working with native types the whole way. The overhead compared to native code is walking the tree, function call overhead (and currently existing asserts for sanity checks), missing optimizations the compiler can apply to native math expressions and in particular extraction of the observables from the tree. The latter is currently done using the string names of the observables, but that can easily be replaced by ID based lookups (replace the identifiers by strings once and use those directly to access the correct leaf from the tree; I assume that’s possible) to avoid the hashing of the string for each event.

It is also clear that this process does not need to be slowing down the main data processing chain, since it is a pure analysis process and it can be launched by users at a final data processing stage.

While that is somewhat true, it only holds up to the case where the user wants to apply such evaluations to their whole data. In that case this is sort of a case of Ahmdahl’s law, i.e. the slowest non-parallelizable part of a processing chain will be the bottleneck. It doesn’t matter that the whole processing chain before the evaluation is fast, if the evaluation will still take e.g. 10x the time of the rest of the full chain.

@nkx:

The behavior of escape letters is not guaranteed in tinyxml. Actually the characters > and < shall also be escaped. But we tested in tinyxml that they can be correctly read. For character & maybe indeed it will just disappear.

Indeed. But since I noticed that < and > are already used in the cut string application, I considered this undefined behavior to be acceptable in REST. Who knows what were to happen using a different XML library or updating tinyxml.

The only way out is to use another format. For example the json config format is more and more used in programming these days. We can consider including it.

I agree that this would be the most sane solution. In particular because XML is not really human friendly. However, json has the problem that it does not support comments in the file. yaml is problematic due to insane complexity in the standard.
Personally I would propose TOML files. TOML is a simple (from an implementation standpoint, hence less undefined behavior, less bugs etc), but powerful (from a user standpoint) standard.

jgalan · March 17, 2021, 11:40am

Thats why I mentioned some time ago about encapsulating TinyXML routines inside TRestMetadata, because if in future we want to migrate to something else, we just need to update TRestMetadata. But TinyXML methods are public, and used in inherited classes.

Anyway, I have added a patch so that we can use AND and OR in our construction of cuts conditions.

I added also keywords ABOVE and BELOW to be replaced by > and <. Notice that a white space is required before and after the keyword.

However, it seems < and > are properly interpreted. After testing with @ddiez

jgalan · March 17, 2021, 1:42pm

Right, thats possible, we can also retrieve the observable byID.

I know, but as I said, this process will be not running at the main data chain. It will be more an additional user process.

jgalan · March 17, 2021, 1:44pm

Looks nice! There will be a lot of work updating the code to TOML, and probably an RML to TOLM migration tool will be needed. If you want to develop that in an experimental branch, I will be happy to test.

Introducing the `TRestExpressionEvaluation` process

Explanation taken from the docstring in TRestExpressionEvaluationProcess.cxx

Final words

Explanation taken from the docstring in `TRestExpressionEvaluationProcess.cxx`