Troubleshooting an Ensemble instance
One of our development Ensemble instances is misbehaving. We suspect we'll just need to reinstall it - which would be a hassle. Before we do, we wanted to check we weren't missing some kind of easy fix. The symptoms we are seeing:
- A Cache process is running at 100% CPU on one core of the server - its the TASKMGR process
- That Cache process resumes at 100% on Ensemble restart, and indeed after server reboot
- There might be evidence of corruption in the task schedule: there's a "next scheduled date" of 1840-12-31 00:05...! (yes, we know that's $HOROLOG zero), and Description message that looks like it could be badly copied/misaligned pointer from a previous description. See screengrab, highlighted:
- Opening any DTLs in the management portal, in Studio, or via VS Code, across all namespaces, results in "ERROR in page definition: ERROR #6301: SAX XML Parser Error: XML or TEXT declaration must start at line 1, column 1 while processing Schema at line 2 offset 7"
We don't know whether these symptoms are related or not. Nor can we identify anything in what we've done on that server recently that might correlate to problems starting.
Before we scrub and reinstall the Ensemble instance, anything you might try, places you might look?
Thanks!!
Comments
I would check the following:
1. Check the %SYS.Task class with SQL but also do an Integrity check, to see if there are any errors on those globals that hold that task manager data.
2. if the "corrupted"/"copied" task (with $h=0) is the one that consume 100% of CPU, I would try to "re-schedule" it to see if the new "next date" is set to something else. If not, to delete (you don't need to re-create it, looks like 1001 is a copy of 1000)
3. Monitor the 100% CPU task (SMP or JOBEXAM) to try to understand at what commands it's "stuck"
Thanks. Not sure how to do all the things you suggested (1 or 3), but tried (2) using the Scheduled Tasks page in management portal. Couldn't suspend or delete the problematic task - getting SQL errors about not being able to lock or access a table.
Thanks. Not sure how to do all the things you suggested (1 or 3), but tried (2) using the Scheduled Tasks page in management portal. Couldn't suspend or delete the problematic task - getting SQL errors about not being able to lock or access a table.
I'm not answering your question, but just as a side note - with using Docker as a dev environment you don't bother with installing of Ensemble or IRIS at all. You can switch IRIS version at any moment if something went wrong or you need a fresher version.
I would try to export all the classes of the production and import it into a new namespace. After that try to open a DTL and check if it works. If there is no error the problem would be a corruption in the original namespace.
Thanks for suggestion, but we already know it affects all namespaces 😥
Have you tried using the console version of task manager?
I would do that first.
If you have another Ensemble instance you could figure out where the in the system tasks are stored using journalling and making a change to a task.
Once you know where the task is stored to just kill or change the associated global.
Thanks Alexander. Terminal Task Manager:
- can delete other tasks, but
- trying to delete the "corrupted" one get: "ERROR #5803: Failed to acquire exclusive lock on instance of '%SYS.Task'" (same as when using Management Portal)
Re figuring our where system tasks are stored via journalling, I understand the principle of what you are saying but we are probably reckoning the effort in doing that at least as great as scrubbing and reinstalling - we lose some config (we've got it documented, but the developer who did it originally has left), but no important running code.
Some odd questions:
1) Is the license still valid
2) Is it possible to suspend the Task Schedule before changing the problem item.
.png)
Alex, thanks. Yes, license is still valid. And no, not possible to suspend before changing problem item - getting databases locking errors when I try. No idea how we managed to corrupt this quite so thoroughly!!
From terminal / csession do you get "1" returned when running:
write ##class(%SYS.TaskSuper).SuspendSet(1)
Failing that, if the server is not being used currently would stop and then start Ensemble Service in Emergency Mode and disable TaskScheduler / remove problem schedule. Then shut down Ensemble service from Emergency mode and start up Ensemble as normal.
For the sake of having an "accepted" answer on this post: we never got to the bottom of what was going on, and scrubbed and reinstalled Ensemble. Thanks for suggestions - some useful learning, even if we were never able to get to the bottom of what was going on.