I’m processing raw data from .aqs files and I’d like to do all of them at the same time, like in TRestAnalysisPlot, not one by one.
I’ve been trying with restManager --c processing_file.rml --f "/path/file_number_*.aqs", but even though the TRestProcessRunner takes into account all files for the progress bar it only processes the first file. For example, if I have 10 files, it stops at 10%. I’m trying with just two processes in my rml file: TRestRawMultiFEMINOSToSignalProcess and TRestRawSignalAnalysisProcess.
For my pourpose I don’t mind if the output of all procesed files is merged in one single root or it keeps one output file for each input one but I guess the last option is in general more convinient.
Do you know how to do this?
What I have done is just to add a “new” aqs by just creating a duplicate during the pipeline job.
It is clear that right now it only processes the first file, because the validation succeeded, and at the validation macro it checks the total number of final entries.
I never tested this feature before, so, @nkx, the question is if this feature was already existing before?
Also, in case this functionality exists, what is the default behaviour?
It will merge the resulting files into one? I guess, if the outputFileName produces unique names, the processing will be donde in parallel, so that one input file produces an output file.
But, as for the trexdm case, the output filename is RawData.root, so it is common for all files. Will it merge the files in this case?
In the case, that we go on one-to-one processing, would be doing a fork() the best way to parallelise jobs?
It is possible to process multiple binary files, as implemented in TRestRawMultiCoBoAsAdToSignalProcess. But it doesn’t seems to be implemented at TRestRawMultiFEMINOSToSignalProcess. We can update the code to make it reading multiple files.
During initialization REST will open all the input binary files and save them(in type of FILE*) in vector: TRestRawToSignalProcess::fInputFiles. We can access this vector to read all the files in the actual processes.
I think that it would be more generic if restManager would detect that at the --f argument there is a glob pattern.
I.e. that we find something like:
RunA.root
RunB.root
RunC.root
Then, if we do:
restManager --c processEvents.rml --f "Run*root"
Then, restManager would construct internally the execution of each file, fork the process, execute it, and exit.
for (x in nFiles){
string command = "restManager --c processEvents.rml --f " + file[x]
fork();
out = Execute( command);
exit(out);}
That would launch all files in parallel.
In fact, I have added a --fork option to restManager at the following commit.
It will launch all the files found matching the glob pattern. The problem here is that the output is not centralised into the parent, perhaps that can be achieved, it is interesting?
If no --fork option is given it will continue working as it was doing before.
Could you test it? What do you think about this option?
What is the reason you need to merge into one single file? Do you have event IDs split between different rawdata files?
Usually we are willing to preserve the runNumber and subRunNumber of each file. Or you are merging all subRunNumbers into a single runNumber?
In that case, could we have something generic such that if all data processing chains output filename is common we apply the sequential merging of input files?
In my opinion, if one could have a single root file out of several .aqs files it would be very useful. I work with data and sometimes it’s hard to handle hundreds of subruns…
Just to share what I do in case it’s helpful, I also use the for loop to launch jobs, although I include --j 8 instead of & because as I said sometimes I have hundreds of subruns. Although it’s true that sometimes the multithread crashes with some processes (I think it does with TRestDetectorHitsGaussAnalysisProcess). In that case the command I use is: ls -1 /path/Run* | xargs -P 8 -n 1 restManager --c processEvents.rml --f
If you find out that it is that process and you manage to reproduce it, please, create the issue or post the problem with instructions so that others can reproduce it.
In case it’s useful for someone. When one wants to make a plot for each run, and at the same time each run has many subruns, I found an automatic way of generating a list like the one I paste in the image here:
Then this list can be pasted into a bash script and execute it. Very useful when one has 300 runs with 15 subruns each
The command is: for x in `ls /path/FileNamePattern* | cut -d_ -f1 | sort | uniq`; do echo restManager --c hitMapXe.rml --f "\"${x}*\""; done
This will launch all processes in parallel. There is a risk to collapse the system temporally, although currently it is limited to a maximum of 32 jobs being launched in parallel. This would not be a straight forward solution to send 300 jobs in one go. But, it is a bit faster because jobs are sent in parallel.
Additionally, inside a macro it is possible to use the method
Talking about fork it is also possible to fork from the pause menu. When running restManager in the front-end we can press firstly “p” to call the pause menu, the press “d” to detach the process.