Check the job.err file for the cause of the error.
Sometimes the issue is due to the machine, and the job can just be re-triggered. Some examples of this are:
Fatal error in PMPI_Init_thread
)Address not mapped to object.
)An MPI_Abort
in the file means the model failed.
This is usually due to an instability which occurs occasionally in this version of the model at high resolution.
To see which component has caused the error:
work/<cycle>/coupled/ocean.output
work/<cycle>/coupled/pe_output/1.<suite-id>.fort6.pe0000
job.out
See below for instructions on getting past these instabilities. These should be documented in the suite page.
If these solutions do not work, you will need to go back and apply the remedies to the previous cycle. See the instructions for re-running the previous cycle
Instabilities in the ocean will appear as:
stpctl: the zonal velocity is larger than 20 m/s
======
kt= 87159 max abs(U): 6.0407E+07, i j k: 2914 3389 8
...
===>>> : E R R O R
===========
MPPSTOP
NEMO abort from dia_wri_state
Re-run the cycle with a reduced timestep in the ocean.
rose-suite.conf
change the following:
CLOCK=7,0,0
UM_OPT_KEYS='... orca12_config_3min_1m'
rose suite-run –reload
and re-triggger the failed task Once the failed task completes successfully, revert back to the original settings:
rose-suite.conf
:
CLOCK=4,0,0
UM_OPT_KEYS='... orca12_config_5min_1m'
rose suite-run –reload
again, and re-triggger the paused taskAtmosphere instabilities usually appear as:
Error message: North/South halos too small for advection
Perturb the atmos dump:
/work/n02/n02/annette/scripts/perturb/submit_perturb.slurm
sbatch
.The script leaves a copy of the original and perturbed dumps as .orig
and .perturb
It also sets PYTHONHASHSEED=0
to create a deterministic perturbation.
Once the script has completed and generated a new dump, re-trigger the failed task.
CICE instabilities appear as:
ABORT: Global i and j: 3347, 3604
ABORT: Lat, Lon: 83.826494669668421, 73.000001621592844
ABORT: aice: 0.9873568057277784
...
About to call abort ice with: Vertical thermo error
ice: Vertical thermo error
In this case, reduce the ocean timestep (see above).