Question Michael Lundberg · Apr 15, 2024

Extract XML from text

Hello!

I wonder if anyone has a smart idea to extract an XML fragment inside a text document (incoming from a stream)?

The XML fragment is surrounded by plain text.

Example:

text...........
text...........
<?xml version="1.0" encoding="UTF-8 ?>
<Start>
...etc
</Start>
text...........
text...........

The XML is not represented by any class or object in the Namespace.

The XML can look different from time to time

Appreciated if anyone knows how to use Objectscript to extract the XML content.

Regards Michael

Product version: IRIS 2023.1

Comments

Julius Kavay  Apr 15, 2024 to Cristiano Silva

As you wrote,  %XML.TextReader is used to read arbtrary XML documents. "A text where in the middle a little bit xml-structure sits" isn't XML!

Maybe there is a Pyhton library for extracting XML from a text. If not, probably you have to read char-after-char, count each "<" (+1) and ">" (-1) and if the counter is 0 then between the first "<"  and the last ">" probably you have a correct XML structure. Oh, and don't forget for <![CDATA[...]]> sequences, which makes the reading more challenging.

0
Cristiano Silva  Apr 17, 2024 to Julius Kavay

Hi @Julius Kavay you are correct. I miss the part: 

The XML fragment is surrounded by plain text.

0
Michael Lundberg · Apr 16, 2024

Hello and thanks for your answers. However, it is not possible to parse the stream to %XML.TextReader as it is without the status reporting error. This is due to the fact that it is not a pure XML but rubbish from other content.

I probably have to sit and extract the XML content manually as Julius describes. Thought I could get away with it :0)

0
Robert Cemper  Apr 16, 2024 to Michael Lundberg

If XML content is well formatted
it might be sufficient to remove all trailing text before
<?xml version="1.0" encoding="UTF-8 ?>

0
Michael Lundberg  Apr 16, 2024 to Robert Cemper

Hello
Yes probably. For the text that is before the XML block. The problem is that it is also text after the end tag. And the end tag can have different names.

0
Herman Slagman  Apr 16, 2024 to Michael Lundberg

The start tag would be right after the XML declaration, i.e. <StartTag (the element name ends when a space is encountered), the end-tag would then be </StartTag. From there find the closing bracket >

0
Manel Trèmols  Apr 16, 2024 to Michael Lundberg

Hi Michael,

Something like this ?

Search where "<?xml " starts

Search where it ends (first >)

Get first tag after xml header

Find where this tag ends

Remove characters in the middle.

test
	set complex=1set crlf=$c(13,10)
	set file="text 1"set file=file_crlf_"text 2"set file=file_crlf_"<?xml version=""1.0"" encoding='UTF-8'?>"if complex {
		set file=file_crlf_"<Results xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'"set file=file_crlf_"     xmlns='urn:tcleDoctorReport'"set file=file_crlf_"         xsi:schemaLocation='urn:tcleDoctorReport DoctorReport.xsd'>"
	} else {
		set file=file_crlf_"<Results>"
	}
	
	set file=file_crlf_"	<ReportPageFormat/>"set file=file_crlf_"	<Department>"set file=file_crlf_"		<Section>"set file=file_crlf_"			<TestSet>"set file=file_crlf_"				<TestSetDesc>Blood Culture (Aerobic+Anaerobic)</TestSetDesc>"set file=file_crlf_"			</TestSet>"set file=file_crlf_"			<TestSet>"set file=file_crlf_"				<TestSetDesc>Blood Culture Positive Result</TestSetDesc>"set file=file_crlf_"			</TestSet>"set file=file_crlf_"		</Section>"set file=file_crlf_"	</Department>"set file=file_crlf_"	<EpisodeData>"set file=file_crlf_"		<EpisodeNumber>240000100</EpisodeNumber>"set file=file_crlf_"		<FirstName>Lily</FirstName>"set file=file_crlf_"	</EpisodeData>"set file=file_crlf_"</Results>"set file=file_crlf_"text 3"set file=file_crlf_"text 4"set xmlheadstart=$f(file,"<?xml ")-6set xmlheadend=$f(file,">",xmlheadstart)-1;zzdump $e(file,xmlheadstart,xmlheadend)set firsttag=$tr($p($e(file,xmlheadend+1,*),">",1)_">",$c(13,10))
	;zzdump firsttagset tag=$p($e($p(firsttag," ",1),2,*),">",1)
	;write !,tagset xmlend=$f(file,"</"_tag_">")
	
	zzdump$e(file,1,xmlheadstart-1)_$e(file,xmlend,*)

What I get:

USER>d^test20000: 7465787420310D0A 7465787420320D0A         text 1..text 2..
0010: 0D0A 7465787420330D0A 746578742034..text3..text 4
USER>

Regards

Manel

0
Michael Lundberg  Apr 17, 2024 to Manel Trèmols

Thanks Manel

It worked great. Admittedly, I got the surrounding text out when I actually wanted the XML out. But by your example I was able to turn it around and get the XML out.

Working string: XMLstr

set xmlheadstart=$f(XMLstr,"<?xml ")-6
set xmlheadend=$f(XMLstr,">",xmlheadstart)-1
set firsttag=$tr($p($e(XMLstr,xmlheadend+1,*),">",1)_">",$c(13,10))
set tag=$p($e($p(firsttag," ",1),2,*),">",1)
set xmlend=$f(XMLstr,"</"_tag_">")
set NewXMLstr = $EXTRACT(XMLstr,xmlheadstart,xmlend-1)

Quit NewXMLstr

The NewXMLstr variable now contains the entire XML fragment.
 
Many thanks!

Regards Michael

0