Are there any known causes of IRIS entering a deadlock/hang state?
Based on your experience, do you know any reason why IRIS would enter a deadlock/hang state ?
When such thing occurs, it's no more possible to connect to Portal or Studio, despite IRIS service (IRIS.EXE processes) being still active. CPU/memory/network usage are usually very low (eg: it does not occurs because server is overloaded). The only fix is a full restart of IRIS (eg: by clicking on IRIS icon in notification toolbar and choosing appropriate action).
I had that issue on a production server a few weeks ago. Any request sent to IRIS would lead to a timeout (and it was no more possible to enter Studio or Portal). The only solution was a restart of IRIS service. Apache seems fine. Inspecting the logs or doing a performance report (^pButtons) in the next days did not help to find what went wrong.
I did some research and find out at least two ways to recreate similar behavior :
1) too many locks created (much more than what locksizparameter can allow).
This simple loop will crash system in a few seconds (do not try it !), needing a full restart.
for i=1:1:1000000
{
set^A(i) = ""lock +^A(i)
}Since locks are using shared memory (specified by gmheapparameter), is there a possibility of something else (eg: string allocation) using a lot of shared memory (thus leaving very little for locks themselves) ?
2) there is no more space on disk where journal is located.
Do you know any other reasons that can lead to system being down (the symptoms I describe in the top of my post) ?
Comments
Locks are part of gmheap but you could allocate them in advance locksiz.
So what is specified in locksiz is already "reserved" from gmheap ? (eg: you cannot run out of memory for locks because of excessive gmheap usage).
Locksiz is only allocated as needed from gmheap, so if gmheap is used up you could be unable to take out further locks.
I think this is only true for locksiz=0 which is the default.
If you set it to a value that is what it is.
If you look at the locksiz class reference, it is described as "An upper bound on the amount of shared memory heap (see gmheap) that is allowed to be consumed by the lock table as a result of application-level locks."
"Deadlock" is too broad to describe any possibility that could cause the instance to hang. I would recommend reaching out to the WRC/support when that occurs so they can analyze the system with you.
FWIW the first place I would look would be the messages.log which would point to next investigative steps. Alexander's IRIShung suggestion is also a good one.