Improvement idea: implementing RDataFrame

lobis · May 29, 2019, 8:39am

Hello,

I have come across this interesting concept of RDataFrame which was a experimental feature of ROOT until version 6.14 where it was released.

DataFrames are used widely in the data science industry (with the python library pandas). They are similar to a table where you have rows and columns.

I believe the ROOT implementation comes into existence to close the gap between industry and HEP community practices. It feels like a higher level structure to a tree where you have some built-in optimization such as multithreading.

Maybe if we made our processes pipeline make use of RDataFrame’s we could avoid having to optimize many aspects of our code (such as multithreading) and end up with a much simpler and powerful code.

This is just a thought, below you can find some slides explaining the advantages of RDataFrame from the last ROOT meetup (link to all contributions: ROOT Users' Workshop (10-13 September 2018): Contribution List · Indico)

https://root.cern.ch/doc/v614/group__tutorial__dataframe.html

jgalan · May 30, 2019, 8:27am

Looks interesting. We need to understand the advantages this might bring.

I love the following graph on that slides.

This is per user.

nkx · May 30, 2019, 3:29pm

Sounds interesting. Maybe in future we can make AnalysisTree and EventTree become RDataFrame type?

lobis · May 30, 2019, 4:19pm

Yes, I think this is the way to go for the future. I will study it in more detail but so far it sounds very promising.

The syntax is much more intuitive to work with and the performance advantages seem also very promising. (and as simple as calling ROOT::EnableImplicitMT() once).

On top of that I believe RDataFrame is built in such a way that it would be possible to transform for example TRestAnalysisTree into a RDataFrame based tree without having to do severe modifications of the code at once (they share many interfaces). I believe the AnalysisTree would benefit greatly from this as all the calculations such as means, sigmas etc. would be done in parallel. also I think it would be an elegant way to eliminate useless variables the user sees when exploring with the browser. I will take a look into this if @jgalan agrees.

jgalan · May 30, 2019, 5:57pm

Yes, I agree we should explore that. I am just afraid we also should allow people to adapt to changes. If the use of a RDataFrame is similar to a TTree people should adapt quickly I guess. Anyway, it would be good if you could prepare a kind of summary slides so that we try to understand the benefits of taking the effort to do these changes in our next meeting.

Then, we can have in the repository a RDataFrame_experimental for testing this.

I thought this issue was solved in v2.2.10_dev?

Do you mean the following issue?

lobis · May 30, 2019, 6:09pm

Right I forgot it got solved, it just crossed my mind. The idea was that maybe with RDataFrame there is a more natural way to separate the physical variables (observables) from the ones that don’t have a direct physical meaning, but I am not sure if it can bring much value on this regard since its implemented already in the issue you mentioned, its just a thought.

I will study the implications and benefits and prepare a presentation for the next REST meeting.

nkx · May 31, 2019, 5:36am

Somehow we shouldn’t be so aggressive. TTree saving is still needed. Maybe we can develop a plugin to enable RDataFrame feature?

For example, if we compile and install a library libRestRDataFrame.so, the data in output file will be saved as RDataFrame type, and if we remove that library file, it goes the old way.

Interfaces need to be defined for class TRestProcessRunner, I guess.

jgalan · May 31, 2019, 8:23am

Yes, I agree we should not be aggressive. Somehow we should have an experimental brach. The experimental branch will never be merged to master, or at least not completely. I think it would be interesting we have something like this to make this kind of tests.

So, in the experimental branch there is no need to think carefully about the implementation.