Extract XML from text
Hello!
I wonder if anyone has a smart idea to extract an XML fragment inside a text document (incoming from a stream)?
The XML fragment is surrounded by plain text.
Example:
text...........
text...........
<?xml version="1.0" encoding="UTF-8 ?>
<Start>
...etc
</Start>
text...........
text...........
The XML is not represented by any class or object in the Namespace.
The XML can look different from time to time
Appreciated if anyone knows how to use Objectscript to extract the XML content.
Regards Michael
Comments
Hi Michael.
The class %XML.TextReader is used to read arbtrary XML documents.
As you wrote, %XML.TextReader is used to read arbtrary XML documents. "A text where in the middle a little bit xml-structure sits" isn't XML!
Maybe there is a Pyhton library for extracting XML from a text. If not, probably you have to read char-after-char, count each "<" (+1) and ">" (-1) and if the counter is 0 then between the first "<" and the last ">" probably you have a correct XML structure. Oh, and don't forget for <![CDATA[...]]> sequences, which makes the reading more challenging.
Hi @Julius Kavay you are correct. I miss the part:
The XML fragment is surrounded by plain text.
Hello and thanks for your answers. However, it is not possible to parse the stream to %XML.TextReader as it is without the status reporting error. This is due to the fact that it is not a pure XML but rubbish from other content.
I probably have to sit and extract the XML content manually as Julius describes. Thought I could get away with it :0)
If XML content is well formatted
it might be sufficient to remove all trailing text before
<?xml version="1.0" encoding="UTF-8 ?>
Hello
Yes probably. For the text that is before the XML block. The problem is that it is also text after the end tag. And the end tag can have different names.
The start tag would be right after the XML declaration, i.e. <StartTag (the element name ends when a space is encountered), the end-tag would then be </StartTag. From there find the closing bracket >
Hi Michael,
Something like this ?
Search where "<?xml " starts
Search where it ends (first >)
Get first tag after xml header
Find where this tag ends
Remove characters in the middle.
test
set complex=1set crlf=$c(13,10)
set file="text 1"set file=file_crlf_"text 2"set file=file_crlf_"<?xml version=""1.0"" encoding='UTF-8'?>"if complex {
set file=file_crlf_"<Results xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'"set file=file_crlf_" xmlns='urn:tcleDoctorReport'"set file=file_crlf_" xsi:schemaLocation='urn:tcleDoctorReport DoctorReport.xsd'>"
} else {
set file=file_crlf_"<Results>"
}
set file=file_crlf_" <ReportPageFormat/>"set file=file_crlf_" <Department>"set file=file_crlf_" <Section>"set file=file_crlf_" <TestSet>"set file=file_crlf_" <TestSetDesc>Blood Culture (Aerobic+Anaerobic)</TestSetDesc>"set file=file_crlf_" </TestSet>"set file=file_crlf_" <TestSet>"set file=file_crlf_" <TestSetDesc>Blood Culture Positive Result</TestSetDesc>"set file=file_crlf_" </TestSet>"set file=file_crlf_" </Section>"set file=file_crlf_" </Department>"set file=file_crlf_" <EpisodeData>"set file=file_crlf_" <EpisodeNumber>240000100</EpisodeNumber>"set file=file_crlf_" <FirstName>Lily</FirstName>"set file=file_crlf_" </EpisodeData>"set file=file_crlf_"</Results>"set file=file_crlf_"text 3"set file=file_crlf_"text 4"set xmlheadstart=$f(file,"<?xml ")-6set xmlheadend=$f(file,">",xmlheadstart)-1;zzdump $e(file,xmlheadstart,xmlheadend)set firsttag=$tr($p($e(file,xmlheadend+1,*),">",1)_">",$c(13,10))
;zzdump firsttagset tag=$p($e($p(firsttag," ",1),2,*),">",1)
;write !,tagset xmlend=$f(file,"</"_tag_">")
zzdump$e(file,1,xmlheadstart-1)_$e(file,xmlend,*)What I get:
USER>d^test20000: 7465787420310D0A 7465787420320D0A text 1..text 2..
0010: 0D0A 7465787420330D0A 746578742034..text3..text 4
USER>Regards
Manel
Thanks Manel
It worked great. Admittedly, I got the surrounding text out when I actually wanted the XML out. But by your example I was able to turn it around and get the XML out.
Working string: XMLstr
set xmlheadstart=$f(XMLstr,"<?xml ")-6
set xmlheadend=$f(XMLstr,">",xmlheadstart)-1
set firsttag=$tr($p($e(XMLstr,xmlheadend+1,*),">",1)_">",$c(13,10))
set tag=$p($e($p(firsttag," ",1),2,*),">",1)
set xmlend=$f(XMLstr,"</"_tag_">")
set NewXMLstr = $EXTRACT(XMLstr,xmlheadstart,xmlend-1)
Quit NewXMLstr
The NewXMLstr variable now contains the entire XML fragment.
Many thanks!
Regards Michael