Segmentation fault when launching simulations in batch

System : Linux
REST version : v2.2.19

Hello, I am having problems when launching multiple simulations at once on a cluster. After I launch around 600 simultaneous jobs (they can be identical jobs, the problem appears too), some of the jobs (1-2%) end suddenly due to some problem with the GDML file (all jobs use the same GDML file), while the other jobs complete correctly.

This is the error of one of the failed jobs:

Loading library : libRestCore.so
Loading library : libRestEvents.so
Loading library : libRestMetadata.so
Loading library : libRestProcesses.so
Loading library : libRestTools.so
Adding sources to geant4
Sensitive volume : gas
gas
GDML: initializating variables
GDML: replacing expressions in GDML
GDML: creating temporary file
Info in <TGeoManager::Import>: Reading geometry from file: /home/zar30002/.rest/gdml/SetupSingleTop.gdml
Info in <TGeoManager::TGeoManager>: Geometry GDMLImport, Geometry imported from GDML created
Error in <TXMLEngine::ParseFile>: Unexpected end of xml file
Error in <TGeoManager::Import>: Cannot open file

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00000035cbeac61e in waitpid () from /lib64/libc.so.6
#1  0x00000035cbe3e609 in do_system () from /lib64/libc.so.6
#2  0x00002aaaac8a3a3c in TUnixSystem::StackTrace() () at /home/zar30002/gitlab/root/core/unix/src/TUnixSystem.cxx:2119
#3  0x00002aaaac8a60b3 in TUnixSystem::DispatchSignals(ESignals) () at /home/zar30002/gitlab/root/core/unix/src/TUnixSystem.cxx:3644
#4  <signal handler called>
#5  0x00002aaab7413438 in TRestG4Metadata::ReadStorage() () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#6  0x00002aaab741a920 in TRestG4Metadata::InitFromConfigFile() () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#7  0x00002aaab771ce9c in TRestMetadata::LoadConfigFromFile(TiXmlElement*, TiXmlElement*, std::vector<TiXmlElement*, std::allocator<TiXmlElement*> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestCore.so
#8  0x00002aaab77232de in TRestMetadata::LoadConfigFromFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestCore.so
#9  0x00002aaab741d1d1 in TRestG4Metadata::TRestG4Metadata(char*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#10 0x000000000041aa89 in main ()
===========================================================


The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x00002aaab7413438 in TRestG4Metadata::ReadStorage() () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#6  0x00002aaab741a920 in TRestG4Metadata::InitFromConfigFile() () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#7  0x00002aaab771ce9c in TRestMetadata::LoadConfigFromFile(TiXmlElement*, TiXmlElement*, std::vector<TiXmlElement*, std::allocator<TiXmlElement*> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestCore.so
#8  0x00002aaab77232de in TRestMetadata::LoadConfigFromFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestCore.so
#9  0x00002aaab741d1d1 in TRestG4Metadata::TRestG4Metadata(char*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#10 0x000000000041aa89 in main ()
===========================================================

I also get this error on ocassion:

Loading library : libRestCore.so
Loading library : libRestEvents.so
Loading library : libRestMetadata.so
Loading library : libRestProcesses.so
Loading library : libRestTools.so
Adding sources to geant4
Sensitive volume : gas
gas
GDML: initializating variables
GDML: replacing expressions in GDML
GDML: creating temporary file
Info in <TGeoManager::Import>: Reading geometry from file: /home/zar30002/.rest/gdml/SetupDoubleTop.gdml
Info in <TGeoManager::TGeoManager>: Geometry GDMLImport, Geometry imported from GDML created
Error in <TXMLEngine::ParseFile>: XML syntax error at line 138
Error in <TGeoManager::Import>: Cannot open file

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00000035cbeac61e in waitpid () from /lib64/libc.so.6
#1  0x00000035cbe3e609 in do_system () from /lib64/libc.so.6
#2  0x00002aaaac8a3a3c in TUnixSystem::StackTrace() () at /home/zar30002/gitlab/root/core/unix/src/TUnixSystem.cxx:2119
#3  0x00002aaaac8a60b3 in TUnixSystem::DispatchSignals(ESignals) () at /home/zar30002/gitlab/root/core/unix/src/TUnixSystem.cxx:3644
#4  <signal handler called>
#5  0x00002aaab7413438 in TRestG4Metadata::ReadStorage() () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#6  0x00002aaab741a920 in TRestG4Metadata::InitFromConfigFile() () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#7  0x00002aaab771ce9c in TRestMetadata::LoadConfigFromFile(TiXmlElement*, TiXmlElement*, std::vector<TiXmlElement*, std::allocator<TiXmlElement*> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestCore.so
#8  0x00002aaab77232de in TRestMetadata::LoadConfigFromFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestCore.so
#9  0x00002aaab741d1d1 in TRestG4Metadata::TRestG4Metadata(char*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#10 0x000000000041aa89 in main ()
===========================================================


The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x00002aaab7413438 in TRestG4Metadata::ReadStorage() () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#6  0x00002aaab741a920 in TRestG4Metadata::InitFromConfigFile() () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#7  0x00002aaab771ce9c in TRestMetadata::LoadConfigFromFile(TiXmlElement*, TiXmlElement*, std::vector<TiXmlElement*, std::allocator<TiXmlElement*> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestCore.so
#8  0x00002aaab77232de in TRestMetadata::LoadConfigFromFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestCore.so
#9  0x00002aaab741d1d1 in TRestG4Metadata::TRestG4Metadata(char*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /LUSTRE/home/zar30002/opt/REST/REST_dev/install/lib/libRestMetadata.so
#10 0x000000000041aa89 in main ()
===========================================================

Thanks.

Looks like it is because another job is writing this gdml file. Can you try to wait 1~2 seconds before submitting the next job? Then I guess the error will gone.

In principle one could define a starting time using the slurm batch system using something like:

#SBATCH --begin=now+15

where 15 is the number of seconds. I use a script to generate a different value for each one.

However, sometimes I believe if the batch system is saturated and jobs are pending they might be sent in parallel anyway.

I was trying to find a way to lock the file and unlock once is finished. For GDML perhaps it could be solved adding a prefix with the runNumber.

However, we face another problem if we wish to use auto flag into the runNumber. We also access simultaneously to the runNumber. A straight forward solution is to specify manually the runNumber, but it would be interesting if it exists a way to block access. I believe semaphores would do it.

It is probably system dependent

This is the problem I experience from time to time. It happened now at the pipeline

But if I press Retry I am sure it will succeed

And it succeeded