I believe I have found a bug. When launching jobs to a cluster some of them would get terminated early due to them exceeding the memory limit, after increasing the memory limit I would still get this problem.
The job is cancelled after the following warning:
INFO: Primary generation
DEBUG: Generator type: virtualwall
DEBUG: Event origin: (-250.776, 355, 178.424) mm
DEBUG: Particle name: mu-
DEBUG: Particle excited energy: 0
DEBUG: Energy distribution: TH1D
DEBUG: Particle energy: 9.29589e+07 keV
DEBUG: Angular distribution: TH1D
DEBUG: Event direction (normalized): (-0.0109355, -0.999921, -0.00617909)
INFO: Start of event ID 342439 (342440 of 1000000)
WARNING : The process ePairProd was not found. It must be added to TRestG4Track::GetProcessID()
slurmstepd: Job 1138696 exceeded memory limit (18099436 > 16777216), being killed
slurmstepd: Exceeded job memory limit
slurmstepd: *** JOB 1138696 CANCELLED AT 2019-09-11T13:16:35 *** on node040
I have added the process “ePairProd” to TRestG4Track in the dev version but this problem is bound to happen again when another new process is added. I have not had time to look into this but it is a big strange that the memory usage jumps to arbitrary high values when this warning is issued.
I would test this kind of problem locally. It is clear that the problem is connected with the unidentified process? If it is the case, there must be a reason in the code that causes this.
Do you have a RML that we can use to reproduce the memory leak?
It is hard to reproduce this problem since it takes a considerable amount of time for this to happen but everytime I have had an error of this kind (not enough memory) I have had this “ePairProd” process appear on the last event so I believe it is connected (I found it 3/3 times).
I attach the RML file. This only seems to happen to me on muon simulations, not on neutrons, I guess because the process that causes this (which I assume is any process that is not correctly indexed) does not happen with cosmic neutrons or is extremely rare.
With that info It is hard to link this memory leak to the missing process id. I believe the only thing it happens is that a common, “unknown” process id is assigned to the hit. It is possible that somewhere in the code we are creating a memory region only in the case the id is “unknown”?
Are you sure that just after adding the process id the problem is solved for muons?
I imagine muon simulations may require more resources than neutron simulations, and that reason is more likely to be related to the memory leak.