I have come across this interesting concept of RDataFrame which was a experimental feature of ROOT until version 6.14 where it was released.
DataFrames are used widely in the data science industry (with the python library pandas). They are similar to a table where you have rows and columns.
I believe the ROOT implementation comes into existence to close the gap between industry and HEP community practices. It feels like a higher level structure to a tree where you have some built-in optimization such as multithreading.
Maybe if we made our processes pipeline make use of RDataFrame’s we could avoid having to optimize many aspects of our code (such as multithreading) and end up with a much simpler and powerful code.
Yes, I think this is the way to go for the future. I will study it in more detail but so far it sounds very promising.
The syntax is much more intuitive to work with and the performance advantages seem also very promising. (and as simple as calling ROOT::EnableImplicitMT() once).
On top of that I believe RDataFrame is built in such a way that it would be possible to transform for example TRestAnalysisTree into a RDataFrame based tree without having to do severe modifications of the code at once (they share many interfaces). I believe the AnalysisTree would benefit greatly from this as all the calculations such as means, sigmas etc. would be done in parallel. also I think it would be an elegant way to eliminate useless variables the user sees when exploring with the browser. I will take a look into this if @jgalan agrees.
Yes, I agree we should explore that. I am just afraid we also should allow people to adapt to changes. If the use of a RDataFrame is similar to a TTree people should adapt quickly I guess. Anyway, it would be good if you could prepare a kind of summary slides so that we try to understand the benefits of taking the effort to do these changes in our next meeting.
Then, we can have in the repository a RDataFrame_experimental for testing this.
Right I forgot it got solved, it just crossed my mind. The idea was that maybe with RDataFrame there is a more natural way to separate the physical variables (observables) from the ones that don’t have a direct physical meaning, but I am not sure if it can bring much value on this regard since its implemented already in the issue you mentioned, its just a thought.
I will study the implications and benefits and prepare a presentation for the next REST meeting.
Somehow we shouldn’t be so aggressive. TTree saving is still needed. Maybe we can develop a plugin to enable RDataFrame feature?
For example, if we compile and install a library libRestRDataFrame.so, the data in output file will be saved as RDataFrame type, and if we remove that library file, it goes the old way.
Interfaces need to be defined for class TRestProcessRunner, I guess.
Yes, I agree we should not be aggressive. Somehow we should have an experimental brach. The experimental branch will never be merged to master, or at least not completely. I think it would be interesting we have something like this to make this kind of tests.
So, in the experimental branch there is no need to think carefully about the implementation.