How to read a pdf file which has images in it
Hi team,
I have a PDF file which has come report data and images (like XRAY) in it. How I can read text from a PDF file in cache
Thanks
Akshay
Comments
What does read mean?
1. I have a PDF file which I need to read from a folder location as text and put data from PDF into HL7 message and send it to downstream system.
2. I have a PDF file which I need to read from a folder location encode it in base64 and put in OBX.5 of MDM message
1. I have a PDF file which I need to read from a folder location as text and put data from PDF into HL7 message and send it to downstream system.
Do you mean OCR/text layer extraction?
2. I have a PDF file which I need to read from a folder location encode it in base64 and put in OBX.5 of MDM message
Do you mean OCR/text layer extraction? yes.
If there's a text layer use LibreOffice to convert to txt (InterSystems IRIS wrapper), for OCR you'll need some thirdparty tool, for example Tesseract can be easily used with Embedded Python.
UPD: LibreOffice can't extract text from PDFs unfortunately. Here's Embedded Python solution:
Class User.PDF
{
/// zw ##class(User.PDF).GetText("/tmp/example.pdf", .text)ClassMethod GetText(file, Output text) As%Status
{
try {
#dim sc As%Status = $$$OKkill text
set dir = $system.Util.ManagerDirectory()_ "python"do##class(%File).CreateDirectoryChain(dir)
// pip3 install --target /data/db/mgr/python --ignore-requires-python typing==3.10.0.0try {
set pypdf2 = $system.Python.Import("PyPDF2")
} catch {
set cmd = "pip3"set args($i(args)) = "install"set args($i(args)) = "--target"set args($i(args)) = dir
set args($i(args)) = "PyPDF2==2.10.0"set args($i(args)) = "dataclasses"set args($i(args)) = "typing-extensions==3.10.0.1"set args($i(args)) = "--upgrade"set sc = $ZF(-100,"", cmd, .args)
set pypdf2 = $system.Python.Import("PyPDF2")
}
return:'$d(pypdf2) $$$ERROR($$$GeneralError, "Unable to load PyPDF2")
kill pypdf2
set text = ..GetTextPy(file)
} catch ex {
set sc = ex.AsStatus()
}
quit sc
}
ClassMethod GetTextPy(file) [ Language = python ]
{
from PyPDF2 import PdfReader
reader = PdfReader(file)
text = ""for page in reader.pages:
text += page.extract_text() + "\n"return text
}
}