Written by

Senior Developer at Greater Manchester Mental Health Services
Question Andy Stobirski · Nov 14, 2023

Cleaning text, removing characters which break XPATH

Hi All

I'm having a problem with cleaning user inputted text from a HealthCare system my HealthConnect system interfaces with.

The input can be anything posted into an RTF box on an app which stored in oracle, and extracted by HealthConnect from oracle via an XML based API.

When the XML is returned, various values are read out of it using %XML.XPATH.Document and it is the presence certain characters entered into the RTF fields cause XPATH to throw an error. For example,

  • ASCII character 8211 (en dash) causes XPATH to give an exception of  ERROR #6901: XSLT XML Transformer Error: invalid character 0x19 in at line n offset nnnnn
  • 8217 (right single quotation mark) - #6901: XSLT XML Transformer Error: invalid character 0x19 in at line n offset nnnnn

(Note that the character code causing the error is incorrectly identified)

Obviously, I can find and swap out the individual characters using $REPLACE, or even work out which characters break XPATH and write a routine to operate on the XML and clean it. These solutions seem inelegant and clumsy to me, can anyone suggest anything simpler or better?

Cheers

Andy

Product version: IRIS 2022.1
$ZV: IRIS for Windows (x86-64) 2021.2.1 (Build 654U) Fri Mar 18 2022 06:09:35 EDT

Comments

Enrico Parisi · Nov 14, 2023

I suspect you have some inconsistency in the Character Encoding in your XML.

Is the XML Character Encoding declared? If yes, how?

i.e. does the first line contains something like "<?xml version="1.0" encoding="utf-8"?>" ?

How are you crating the %XML.XPATH.Document instance from your XML?

It would be helpful if you can post a tiny code to reproduce the issue.

Enrico

0
Andy Stobirski  Nov 14, 2023 to Enrico Parisi

That's for your prompt reply. I can't post anything without editing text and changing XML structure as it's patient confidential data from a proprietary system. I'll look into what I can do.
I can answer a few though:
 XML Character Encoding is declared as <?xml version="1.0" encoding="UTF-8"?>

This is the XPATH I'm using

//code salient points#dim tDocument as %XML.XPATH.DocumentSet tSC=##class(%XML.XPATH.Document).CreateFromString(pXML,.tDocument)
Set tSC=tDocument.EvaluateExpression(pContext, pExpression,.tResults)

The XML response is being retrieved in the form of a string from Operation with an EnsLib.SOAP.OutboundAdapter adapater, and here's the salient code

// Salient codeset..Adapter.WebServiceURL  = ..URLSet..Adapter.WebServiceClientClass = "rocessMessageSoap"Set tSC = ..Adapter.InvokeMethod("ProcessMessage",.ProcessMessageResult,tRequestMessage.requestMessageXml)  Quit:$$$ISERR(tSC) tSC
Set tSC = tRequestMessage.NewResponse(.pResponse)  Quit:$$$ISERR(tSC) tSC
Set pResponse.ProcessMessageResult=$get(ProcessMessageResult)

//where pResponse.ProcessMessageResult contains the XML response we are analysing
0
Enrico Parisi  Nov 14, 2023 to Andy Stobirski

It seems that character 8211 (en dash) is not utf-8 but utf-16, google is your best friend and I'm not an expert in unicode, utf-8, utf-16 etc.! 😊

Set xml="<?xml version=""1.0"" encoding=""UTF-8""?>"
Set xml=xml_"<Text>This is n-dash "_$wc(8211)_" in xml</Text>"
Set xml=$ZCONVERT(xml,"O","UTF8")
Set sc=##class(%XML.XPATH.Document).CreateFromString(xml, .xmlDoc)
Write sc
Set sc=xmlDoc.EvaluateExpression("/Text","text()",.result)
Write result.GetAt(1).Value,!

Result:

This is n-dash – in xml

Enrico

0
Andy Stobirski  Nov 15, 2023 to Enrico Parisi

Thank's for your reply - that did the trick!

I did discover the $ZCONVERT command, but it never worked for me as I was converting to CP1252 (ANSII) and not UTF-8 as you did! Don't know why I did that 😐!

0