Written by

Integrations Developer at NHS Tayside
Question Colin Brough · Jun 6, 2023

Troubleshooting an Ensemble instance

One of our development Ensemble instances is misbehaving. We suspect we'll just need to reinstall it - which would be a hassle. Before we do, we wanted to check we weren't missing some kind of easy fix. The symptoms we are seeing:

  1. A Cache process is running at 100% CPU on one core of the server - its the TASKMGR process
  2. That Cache process resumes at 100% on Ensemble restart, and indeed after server reboot
  3. There might be evidence of corruption in the task schedule: there's a "next scheduled date" of 1840-12-31 00:05...! (yes, we know that's $HOROLOG zero), and Description message that looks like it could be badly copied/misaligned pointer from a previous description. See screengrab, highlighted:  
  4. Opening any DTLs in the management portal, in Studio, or via VS Code, across all namespaces, results in "ERROR in page definition: ERROR #6301: SAX XML Parser Error: XML or TEXT declaration must start at line 1, column 1 while processing Schema at line 2 offset 7"

We don't know whether these symptoms are related or not. Nor can we identify anything in what we've done on that server recently that might correlate to problems starting.

Before we scrub and reinstall the Ensemble instance, anything you might try, places you might look?

Thanks!!

Product version: Ensemble 2018.1
$ZV: Cache for Windows (x86-64) 2018.1 (Build 184U) Wed Sep 19 2018 09:09:22 EDT

Comments

Yaron Munz · Jun 6, 2023

I would check the following:

1. Check the %SYS.Task class with SQL but also do an Integrity check, to see if there are any errors on those globals that hold that task manager data.

2. if the "corrupted"/"copied" task (with $h=0) is the one that consume 100% of CPU, I would try to "re-schedule" it to see if the new "next date" is set to something else. If not, to delete (you don't need to re-create it, looks like 1001 is a copy of 1000)

3. Monitor the 100% CPU task (SMP or JOBEXAM) to try to understand at what commands it's "stuck" 

0
Colin Brough  Jun 7, 2023 to Yaron Munz

Thanks. Not sure how to do all the things you suggested (1 or 3), but tried (2) using the Scheduled Tasks page in management portal. Couldn't suspend or delete the problematic task - getting SQL errors about not being able to lock or access a table.

0
Colin Brough  Jun 7, 2023 to Yaron Munz

Thanks. Not sure how to do all the things you suggested (1 or 3), but tried (2) using the Scheduled Tasks page in management portal. Couldn't suspend or delete the problematic task - getting SQL errors about not being able to lock or access a table.

0
Evgeny Shvarov · Jun 6, 2023

I'm not answering your question, but just as a side note - with using Docker as a dev environment you don't bother with installing of Ensemble or IRIS at all. You can switch IRIS version at any moment if something went wrong or you need a fresher version.

0
Luis Angel Pérez Ramos · Jun 6, 2023

I would try to export all the classes of the production and import it into a new namespace. After that try to open a DTL and check if it works. If there is no error the problem would be a corruption in the original namespace.

0
Colin Brough  Jun 7, 2023 to Luis Angel Pérez Ramos

Thanks for suggestion, but we already know it affects all namespaces 😥

0
Alexander Pettitt · Jun 7, 2023

Have you tried using the console version of task manager?

I would do that first.

If you have another Ensemble instance you could figure out where the in the system tasks are stored using journalling and making a change to a task.

Once you know where the task is stored to just kill or change the associated global.

0
Colin Brough  Jun 7, 2023 to Alexander Pettitt

Thanks Alexander. Terminal Task Manager:

  • can delete other tasks, but
  • trying to delete the "corrupted" one get: "ERROR #5803: Failed to acquire exclusive lock on instance of '%SYS.Task'" (same as when using Management Portal)

Re figuring our where system tasks are stored via journalling, I understand the principle of what you are saying but we are probably reckoning the effort in doing that at least as great as scrubbing and reinstalling - we lose some config (we've got it documented, but the developer who did it originally has left), but no important running code.

0
Alex Woodhead · Jun 7, 2023

Some odd questions:

1) Is the license still valid

2) Is it possible to suspend the Task Schedule before changing the problem item.

0
Colin Brough  Jun 12, 2023 to Alex Woodhead

Alex, thanks. Yes, license is still valid. And no, not possible to suspend before changing problem item - getting databases locking errors when I try. No idea how we managed to corrupt this quite so thoroughly!!

0
Alex Woodhead  Jun 12, 2023 to Colin Brough

From terminal / csession do you get "1" returned when running:

write ##class(%SYS.TaskSuper).SuspendSet(1)

Failing that, if the server is not being used currently would stop and then start Ensemble Service in Emergency Mode and disable TaskScheduler / remove problem schedule. Then shut down Ensemble service from Emergency mode and start up Ensemble as normal.

0
Colin Brough · Jul 18, 2023

For the sake of having an "accepted" answer on this post: we never got to the bottom of what was going on, and scrubbed and reinstalled Ensemble. Thanks for suggestions - some useful learning, even if we were never able to get to the bottom of what was going on.

0