5. Solving Common UM Problems
Aims
- In this section you will learn:
How to troubleshoot common UM errors
How to stop, reload and restart a workflow
This section exposes you to more typical UM errors and hints at how to find and fix those errors.
You may encounter other errors, often as a result of mistyping, for which solution hints are not provided.
5.1. Set up N96 GA7.0 AMIP workflow
Find and make a copy of suite u-dp084.
Firstly make the essential changes required to run the workflow. That is:
The account code (‘n02-training’ if you’re on an organised training event)
Your ARCHER2 user name
The queue to run in.
Hint
Look in the suite conf section. For organised training events you will see that this suite is setup to use reservations listed as Tuesday, Wednesday, Thursday; select the appropriate day. For self-study switch off “Use a reservation on ARCHER2” and select the short queue.
5.2. Errors resolved in the code extraction
Save the suite and then Run it.
The workflow should fail in the fcm_make_um task. This is the task that extracts all the required code from the repository including any branches. The failure will be indicated in the Cylc TUI/GUI with a red square and the state failed.
Question
What is the error?
Examine the job.err and job.out to find the cause of the problem either on the command line or via the cylc GUI or TUI.
Hint
- In the Cylc TUI:
Click on the
fcm_make_umtaskPress the <Enter> key and select Log
Press the <Enter> key to bring up the “Select File” menu
Select the file you wish to view
Press the <Enter> again.
- In the Cylc GUI:
Click on the
fcm_make_umtaskSelect Log to view the job logs
From the “Select File” dropdown select the file you wish to view.
The error indicates that the branch cannot be found due to an incorrect branch name. You will need to look at the UM code repository through Trac on MOSRS (https://code.metoffice.gov.uk/trac/um/browser) to determine the correct name.
To fix the error go to panel fcm_make_um –> env –> Sources and correct the branch name in um_sources.
Save the suite.
Now stop the suite and then re-run it.
On the PUMA2 command line type:
puma2$ cylc stop <workflow-name>
puma2$ cylc vip <workflow-name>
Note
You can also stop the suite from the Cylc TUI or GUI by selecting <workflow-name>/run1 and selecting Stop from the pop up menu.
The suite will fail in the fcm_make_um task again.
Question
What is the error?
Hint
Again look in the job.err file. This kind of error results when changes made in two or more branches affect the same bit of code and which the FCM system cannot understand how to resolve.
Question
Which file does the problem occur in?
In practice, you would need to edit the code branch to fix the problem with the code conflict. To proceed in this case, navigate to fcm_make_um –> env –> sources and remove the branch called vn13.5_training_merge_error by clicking on it and then clicking the - sign.
Save the suite.
Last time we stopped the suite and then re-ran it, however, it is possible to reload the suite definition and then re-trigger the failed task without first stopping the running suite. To do this change to the suite directory:
puma2$ cd ~/roses/<workflow-name>
We then reload the suite definition by running the following Cylc command:
puma2$ cylc vr <worflow-name>
Enter y when asked if you wish to Continue [y/n]. Wait for this command to complete before continuing.
Finally in the Cylc TUI or GUI select the failed task and then select Trigger.
The fcm_make_um task will then submit again.
Question
Is there an error in
fcm_make_umthis time?
If you look in the job.err file now it should be empty and the job.out file indicates SUCCESS.
5.3. Errors resolved in the compile and run
Questions
Has the
fcm_make2_um(compilation) task completed successfully?You should have a failure. Open the
job.errfile - what does it indicate?Which routine has an error?
What is the error?
What line of the Fortran file does it occur on?
In practice, you would need to fix the error in your branch on PUMA2 and then restart the suite. In this case, navigate to fcm_make_um –> sources and remove the branch vn13.5_training_compile_error. Save the suite, Stop the failed run and then Run it again.
Tip
This time we chose to shutdown the failed suite rather than do a reload. In this scenario we need to redo the code extraction (fcm_make_um) step so doing a reload would be slightly more complex; you would need to Reload and then Trigger both the fcm_make_um and the fcm_make2_um tasks. With experience you get to know when it’s better to do a Reload and when to Stop a suite.
Note again that the task submitted successfully.
Questions
Did the
fcm_make2_umtask succeed this time?What about the
install_ainitialtask?What is the error?
Does the start dump exist?
What is the name of the correct start dump?
Hint
Look in the directory where it thinks the start file should be - is there a candidate in there?
Point your suite to the correct start dump. Fixing this problem isn’t quite as easy as it sounds. A search in the Rose edit GUI for the dump file name co764a.da19880901_00_err will not locate anything. For this suite it is not possible to fix this issue through the GUI, for some other suites you can edit the initial dump location in the panel um –> namelist –> Reconfiguration and Ancillary Control –> General technical options.
Suites can be and are set up differently and there will be times when you need to edit the suite definition files directly.
In your suite directory on PUMA2 (~/roses/<suite-name>) use grep -R to search for the start dump name co764a.da19880901_00_err in the suite files. You should see 2 occurrences listed
ros@puma2$ grep -r co764a.da19880901_00_error *
site/meto_cray.cylc:{% set AINITIAL = AINITIAL_DIR + 'N96L85/co764a.da19880901_00_error' %}
site/archer2.cylc:{% set AINITIAL = AINITIAL_DIR + 'N96L85/co764a.da19880901_00_error' %}
Edit the dump name in the appropriate .cylc file for the HPC we are running on, to point to the correct initial dump file.
Hint
This workflow is set up to run on multiple platforms, make sure you edit the file appropriate to ARCHER2.
Reload the suite definition and then Trigger the install_ainitial task. The task should succeed this time.
Question
Has the model run successfully?
This time the model should have failed with an error.
Question
What is the error message?
Hint
Try searching for ERROR - you will soon learn common phrases to help track down problems.
Question
Which PE Ranks signalled the Abort?
In general it can be useful to note which processors failed and then look at the detailed output for those processors. In this scenario, however, all the processors aborted. We’ll now take a look at the individual PE output file. Change to the pe_output directory for the atmos_main task. This is under ~/cylc-run/<workflow-name>/runX/work/<cycle>/atmos_main/pe_output.
Open the file called <workflow-name>.fort6.pe0. Sometimes extra information about the error can be found in the individual PE output files.
Question
At what timestep did the error occur?
The error message indicates that the model has suffered a convergence failure in the routine EG_BICGSTAB_MIXED_PREC. This basically means that the model was not able to find a solution to the requested accuracy with the amount of effort specified. In this case the failure results from the value chosen for gcr_max_iterations. You could try to find what setting similar models use (with the MOSRS repository you have access to all model setups) or looking at the help within rose edit may point you in the right direction. Go to um –> namelist –> UM Science Settings –> Sections 10 11 12 - Dynamics settings –> Solver and set it to the suggested value. Save, Reload and Re-trigger.
The model should fail with the same error. So what’s gone wrong here? We’ve changed the value of the number of iterations to a recommended value so why didn’t it work? The first thing to check is that the new value has indeed been passed to the model. We do this by checking the variable in the namelists which are written by the Rose system. On ARCHER2 navigate to the work directory for the atmos_main task (ie. ~/cylc-run/<workflow-name>/runX/work/<cycle>/atmos_main). In here you will see several files with uppercase names (e.g. ATMOSCNTL, SHARED), these contain the Fortran namelists which are read into the model. Have a look inside one of them to see the structure. Now search (use grep) in these files for the max number of solver iterations variable gcr_max_iterations.
Hint
Search for the string gcr_max_iterations=.
Question
What value does it have?
Is this what you changed it to in the Rose edit GUI?
So why was the change not picked up? Go back to view the setting in the Rose GUI. By the side of the variable gcr_max_iterations there is a little icon of a hand on paper, this indicates that there is an “optional configuration override” for this variable.
Optional configuration overrides add to or overwrite the default configuration. They are useful to make it easier to switch between different configurations of the model. For example switching between different resolutions.
Click on the icon and the list of overrides appears. You will see that the variable is set to 1 in the training override file and it is this value that is being used in the model. Unfortunately optional configuration override files cannot be changed through the GUI so we will need to edit the Rose file directly. Override files for the um app live in the directory ~/roses/<suite-id>/app/um/opt. Open the file rose-app-training.conf and edit the value for gcr_max_iterations. Save, Reload and Re-trigger the suite.
Check the gcr_max_iterations variable in the namelist file again to confirm that it does now have the correct value. This time the model should run successfully. Check the output to confirm that there are no errors. Check that the model converged at all time steps.