Very high memory usage when process is not identified in TRestG4Track

lobis · September 11, 2019, 11:54am

REST version : v2.2.12 (stable)

I believe I have found a bug. When launching jobs to a cluster some of them would get terminated early due to them exceeding the memory limit, after increasing the memory limit I would still get this problem.

The job is cancelled after the following warning:

INFO: Primary generation
DEBUG: Generator type: virtualwall
DEBUG: Event origin: (-250.776, 355, 178.424) mm
DEBUG: Particle name: mu-
DEBUG: Particle excited energy: 0
DEBUG: Energy distribution: TH1D
DEBUG: Particle energy: 9.29589e+07 keV
DEBUG: Angular distribution: TH1D
DEBUG: Event direction (normalized): (-0.0109355, -0.999921, -0.00617909)
INFO: Start of event ID 342439 (342440 of 1000000)
WARNING : The process ePairProd was not found. It must be added to TRestG4Track::GetProcessID()
slurmstepd: Job 1138696 exceeded memory limit (18099436 > 16777216), being killed
slurmstepd: Exceeded job memory limit
slurmstepd: *** JOB 1138696 CANCELLED AT 2019-09-11T13:16:35 *** on node040

I have added the process “ePairProd” to TRestG4Track in the dev version but this problem is bound to happen again when another new process is added. I have not had time to look into this but it is a big strange that the memory usage jumps to arbitrary high values when this warning is issued.

jgalan · September 11, 2019, 12:59pm

I would test this kind of problem locally. It is clear that the problem is connected with the unidentified process? If it is the case, there must be a reason in the code that causes this.

Do you have a RML that we can use to reproduce the memory leak?

lobis · September 11, 2019, 2:16pm

It is hard to reproduce this problem since it takes a considerable amount of time for this to happen but everytime I have had an error of this kind (not enough memory) I have had this “ePairProd” process appear on the last event so I believe it is connected (I found it 3/3 times).

I attach the RML file. This only seems to happen to me on muon simulations, not on neutrons, I guess because the process that causes this (which I assume is any process that is not correctly indexed) does not happen with cosmic neutrons or is extremely rare.

muonsFromVirtualWall.rml (6.2 KB)

jgalan · September 11, 2019, 2:26pm

With that info It is hard to link this memory leak to the missing process id. I believe the only thing it happens is that a common, “unknown” process id is assigned to the hit. It is possible that somewhere in the code we are creating a memory region only in the case the id is “unknown”?

Are you sure that just after adding the process id the problem is solved for muons?

I imagine muon simulations may require more resources than neutron simulations, and that reason is more likely to be related to the memory leak.

jgalan · September 11, 2019, 2:28pm

Just with the RML I cannot launch.

There is a ready to launch muonsFromVirtualWall.rml at the IAXO-D0 reporitory?

lobis · September 11, 2019, 2:33pm

Yes you can launch the RML from the bash script to launch the simulation found in the repo

No, I am not sure, probably not, but I guess its good to add the process anyway.

I should also point out that this happens in around 1 in 1E8 events so it can take a while to reproduce

jgalan · September 11, 2019, 2:37pm

Sure, that was the intention of the warning.

jgalan · September 11, 2019, 2:39pm

I mean, there is an updated version? The one I find in IAXOD0-REST/G4sims/restSims/argon is the V2.1 version.

lobis · September 11, 2019, 2:39pm

use the development branch. You can also find the RML file in my home folder in sultan

nkx · September 12, 2019, 2:54pm

I fixed a bug of memory leak several weeks ago. The fix was in dev branch

Is this process(TRestHitsToTrackProcess) related to your memory leak?

lobis · September 12, 2019, 3:15pm

I don’t think so as the problem happens on the middle of a simulation.