HI, I've had a look round the web and drew a blank. Can anyone help? TIA Mark Patterson
![]() |
0 |
![]() |
Hi, Yes this is quite difficult. The most tools are expensive and there is no cheap way I have found yet and also the free OCX of adobe is very limited. First of all you need to understand that PDF is a postscript language and not a textfile. Some PDF have no text at all but only contain pictures and that depends on the scanner or the PDF writer that is used. Normally you see this by opening the pdf in Adobe reader and see if there is a way to select the text. Is there a textcursor select (text) or rectangle select (picture) It is possible that you can select a text but the application is doing an OCR on the fly to extract the text. Then also you have PDF's that are both, text and pictures mixed. At last you have compressed PDF's and non-compressed PDF's. (also depends on the scanner / PDF writer) There is a good PDF tool that allows you to save the PDF as readable text. (If possible, see above..) Download the PDFTK http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ and read the helpfile. If you want to search in the PDF then you can do it in a dirty way. Open your PDF in Notepad. Do you see any readable characters? If not then you can try to do UnCompress with pdftk. If you still can't read the text then the pdf is maybe a picture. If you can read text (with extra code characters between) then you can try to write your own parser in Delphi to search for text. {code} var ReadFile : TextFile; TextLine : string; AssignFile(ReadFile, FFileName); SetLineBreakStyle(ReadFile, tlbsLF); Reset(ReadFile); While not Eof(ReadFile) do begin ReadLn(ReadFile, TextLine); ..Pos ..Copy ..Delete .. end; CloseFile(ReadFile); {code}
![]() |
0 |
![]() |
Hi i use this: http://www.verypdf.com/ http://www.verypdf.com/pdf2txt/pdf2txt.htm ttousends of PDFs... with out problems... Nils > HI, > > I've had a look round the web and drew a blank. Can anyone help? > > TIA > > Mark Patterson
![]() |
0 |
![]() |
> > I've had a look round the web and drew a blank. Can anyone help? > Try searching on "Delphi ifilter" David
![]() |
1 |
![]() |
Am 09.11.2010 12:45, Mark Patterson wrote: > I've had a look round the web and drew a blank. Can anyone help? In addition to the (mostly commercial) libraries for Delphi and depending on target system configuration and programming language knowledge, another option would be to simply call a non-Delphi application. There are many high quality open source PDF libraries for the Java platform, like JPedal with text extraction functions documented in http://www.jpedal.org/support_Extraction.php Hope this helps -- Michael Justin habarisoft - Enterprise Messaging Software for Delphi® http://www.habarisoft.com/
![]() |
0 |
![]() |
> Yes this is quite difficult. The most tools are expensive and there is no cheap > way I have found yet and also the free OCX of adobe is very limited. > > First of all you need to understand that PDF is a postscript language and > not a textfile. Some PDF have no text at all but only contain pictures and that depends > on the scanner or the PDF writer that is used. > Normally you see this by opening the pdf in Adobe reader and see if there is a way to > select the text. Is there a textcursor select (text) or rectangle select (picture) The one I was testing on definitely allows select and copy. > It is possible that you can select a text but the application is doing an OCR on the fly > to extract the text. Then also you have PDF's that are both, text and pictures mixed. > At last you have compressed PDF's and non-compressed PDF's. (also depends on the > scanner / PDF writer) > > There is a good PDF tool that allows you to save the PDF as > readable text. (If possible, see above..) > > Download the PDFTK > http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ > and read the helpfile. I tried that but it didn't seem to do anything.
![]() |
0 |
![]() |
> {quote:title=Nils Bödeker wrote:}{quote} > Hi > > i use this: > > http://www.verypdf.com/ > http://www.verypdf.com/pdf2txt/pdf2txt.htm > > ttousends of PDFs... with out problems... Ta, I tried it and found that it is limited in the length it will process on a given file unless you pay.
![]() |
0 |
![]() |
> {quote:title=David Wilcockson wrote:}{quote} > > Try searching on "Delphi ifilter" THanks, I tried that, especially the code here:http://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q_20293579.html But for the pdf I wanted it only output what looks like a header. It even finishes with the words "Regards Carmel Mulhern Company Secretary" as if this is all that we are supposed to get. It's a company's Financial Results, and I am trying to get the data from it. Regards Mark
![]() |
0 |
![]() |
Mark Patterson wrote: > HI, > > I've had a look round the web and drew a blank. Can anyone help? > I've spent a lot of time on this and have found two commercial libraries that I use. You might want to take a look at these: QuickPDF, Gnostice PDF Toolkit
![]() |
0 |
![]() |
Hi, Did you try to Uncompress the PDF first? You can try to convert the PDF in another (lower) version.. If you open the PDF in a Text Editor (notepad) the first characters in the file is the version number. 1) Use the tool to Uncompress (to be sure..) 2) Convert to version 1.3 3) Use the tool to extract the text. Maybe your PDF is just a Picture and your select text tool in your PDF editor is doing an OCR after the selection. It is difficult to say without an example..
![]() |
0 |
![]() |
I am using http://www.foolabs.com/xpdf/, convert PDF to text and then parse it...
![]() |
0 |
![]() |
Mark Patterson <> wrote in news:304270@forums.embarcadero.com: >> {quote:title=David Wilcockson wrote:}{quote} >> >> Try searching on "Delphi ifilter" > > THanks, I tried that, especially the code > here:http://www.experts-exchange.com/Programming/Languages/Pascal/Delph > i/Q_20293579.html > > But for the pdf I wanted it only output what looks like a header. It > even finishes with the words "Regards Carmel Mulhern Company > Secretary" as if this is all that we are supposed to get. It's a > company's Financial Results, and I am trying to get the data from it. > > Regards > Mark Could it be that the text you want is included as an image?
![]() |
0 |
![]() |
Nils Bödeker wrote: > i use this: > > http://www.verypdf.com/ > http://www.verypdf.com/pdf2txt/pdf2txt.htm Developer licence $2000. OUCH!!! -- Andy Syms Technosoft Systems Ltd
![]() |
0 |
![]() |
> {quote:title=Stanko Milošev wrote:}{quote} > I am using http://www.foolabs.com/xpdf/, convert PDF to text and then parse it... +1 It's free, you just have to call one exe, and you'll get your text in a file. Very easy to use with Delphi.
![]() |
1 |
![]() |