How do you extract text from a PDF file in Delphi ?

HI,

I've had a look round the web and drew a blank. Can anyone help?

TIA

Mark Patterson
0
Mark
11/9/2010 11:45:57 AM
embarcadero.delphi.non-tech 5933 articles. 1 followers. Follow

13 Replies
4701 Views

Similar Articles

[PageSpeed] 38

Hi,

Yes this is quite difficult. The most tools are expensive and there is no cheap 
way I have found yet and also the free OCX of adobe is very limited.

First of all you need to understand that PDF is a postscript language and
not a textfile. Some PDF have no text at all but only contain pictures and that depends 
on the scanner or the PDF writer that is used. 
Normally you see this by opening the pdf in Adobe reader and see if there is a way to 
select the text. Is there a textcursor select (text) or rectangle select (picture)

It is possible that you can select a text but the application is doing an OCR on the fly
to extract the text. Then also you have PDF's that are both, text and pictures mixed.
At last you have compressed PDF's and non-compressed PDF's. (also depends on the 
scanner / PDF writer)

There is a good PDF tool that allows you to save the PDF as
readable text. (If possible, see above..)

Download the PDFTK
http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
and read the helpfile.

If you want to search in the PDF then you can do it in a dirty way.

Open your PDF in Notepad. Do you see any readable characters?
If not then you can try to do UnCompress with pdftk.
If you still can't read the text then the pdf is maybe a picture.
If you can read text (with extra code characters between) then you 
can try to write your own parser in Delphi to search for text.

{code}
var
ReadFile : TextFile;
TextLine : string;

        AssignFile(ReadFile, FFileName);
        SetLineBreakStyle(ReadFile, tlbsLF);
        Reset(ReadFile);
        While not Eof(ReadFile) do
        begin
          ReadLn(ReadFile, TextLine);

          ..Pos
          ..Copy
          ..Delete
          ..
        end;
        CloseFile(ReadFile);

{code}
0
Robert
11/9/2010 2:11:31 PM
Hi

i use this:

http://www.verypdf.com/
http://www.verypdf.com/pdf2txt/pdf2txt.htm

ttousends of PDFs... with out problems...

Nils


> HI,
>
> I've had a look round the web and drew a blank. Can anyone help?
>
> TIA
>
> Mark Patterson
0
Utf
11/9/2010 2:34:18 PM
> 
> I've had a look round the web and drew a blank. Can anyone help?
> 


Try searching on "Delphi ifilter"

David
1
David
11/9/2010 2:57:21 PM
Am 09.11.2010 12:45, Mark Patterson wrote:

> I've had a look round the web and drew a blank. Can anyone help?

In addition to the (mostly commercial) libraries for Delphi and 
depending on target system configuration and programming language 
knowledge, another option would be to simply call a non-Delphi application.
There are many high quality open source PDF libraries for the Java 
platform, like JPedal with text extraction functions documented in 
http://www.jpedal.org/support_Extraction.php

Hope this helps
-- 
Michael Justin
habarisoft - Enterprise Messaging Software for Delphi®
http://www.habarisoft.com/
0
Michael
11/9/2010 5:17:31 PM
> Yes this is quite difficult. The most tools are expensive and there is no cheap 
> way I have found yet and also the free OCX of adobe is very limited.
> 
> First of all you need to understand that PDF is a postscript language and
> not a textfile. Some PDF have no text at all but only contain pictures and that depends 
> on the scanner or the PDF writer that is used. 
> Normally you see this by opening the pdf in Adobe reader and see if there is a way to 
> select the text. Is there a textcursor select (text) or rectangle select (picture)

The one I was testing on definitely allows select and copy.
 
> It is possible that you can select a text but the application is doing an OCR on the fly
> to extract the text. Then also you have PDF's that are both, text and pictures mixed.
> At last you have compressed PDF's and non-compressed PDF's. (also depends on the 
> scanner / PDF writer)
> 
> There is a good PDF tool that allows you to save the PDF as
> readable text. (If possible, see above..)
> 
> Download the PDFTK
> http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
> and read the helpfile.

I tried that but it didn't seem to do anything.
0
Mark
11/10/2010 4:44:52 AM
> {quote:title=Nils Bödeker wrote:}{quote}
> Hi
> 
> i use this:
> 
> http://www.verypdf.com/
> http://www.verypdf.com/pdf2txt/pdf2txt.htm
> 
> ttousends of PDFs... with out problems...

Ta, I tried it and found that it is limited in the length it will process on a given file unless you pay.
0
Mark
11/10/2010 4:47:17 AM
> {quote:title=David Wilcockson wrote:}{quote}
> 
> Try searching on "Delphi ifilter"

THanks, I tried that, especially the code here:http://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q_20293579.html

But for the pdf I wanted it only output what looks like a header. It even finishes with the words
"Regards Carmel Mulhern Company Secretary"
as if this is all that we are supposed to get. It's a company's Financial Results, and I am trying to get the data from it.

Regards
Mark
0
Mark
11/10/2010 4:52:21 AM
Mark Patterson wrote:
> HI,
> 
> I've had a look round the web and drew a blank. Can anyone help?
> 

I've spent a lot of time on this and have found two commercial libraries
that I use.

You might want to take a look at these: QuickPDF, Gnostice PDF Toolkit
0
Thomas
11/10/2010 5:29:26 AM
Hi,

Did you try to Uncompress the PDF first?

You can try to convert the PDF in another (lower) version..
If you open the PDF in a Text Editor (notepad) the first
characters in the file is the version number.

1) Use the tool to Uncompress (to be sure..)
2) Convert to version 1.3
3) Use the tool to extract the text.

Maybe your PDF is just a Picture and your select
text tool in your PDF editor is doing an OCR after
the selection.

It is difficult to say without an example..
0
Robert
11/10/2010 8:39:16 AM
I am using http://www.foolabs.com/xpdf/, convert PDF to text and then parse it...
0
Utf
11/10/2010 9:08:15 AM
Mark Patterson <> wrote in news:304270@forums.embarcadero.com:

>> {quote:title=David Wilcockson wrote:}{quote}
>> 
>> Try searching on "Delphi ifilter"
> 
> THanks, I tried that, especially the code
> here:http://www.experts-exchange.com/Programming/Languages/Pascal/Delph
> i/Q_20293579.html 
> 
> But for the pdf I wanted it only output what looks like a header. It
> even finishes with the words "Regards Carmel Mulhern Company
> Secretary" as if this is all that we are supposed to get. It's a
> company's Financial Results, and I am trying to get the data from it. 
> 
> Regards
> Mark

Could it be that the text you want is included as an image?
0
Christopher
11/10/2010 9:24:20 AM
Nils Bödeker wrote:

> i use this:
> 
> http://www.verypdf.com/
> http://www.verypdf.com/pdf2txt/pdf2txt.htm

Developer licence $2000.  OUCH!!!

-- 
Andy Syms
Technosoft Systems Ltd
0
Andy
11/10/2010 10:15:34 AM
> {quote:title=Stanko Milošev wrote:}{quote}
> I am using http://www.foolabs.com/xpdf/, convert PDF to text and then parse it...

+1 

It's free, you just have to call one exe, and you'll get your text in a file. Very easy to use with Delphi.
1
Arnaud
11/10/2010 4:19:12 PM
Reply:

Web resources about - How do you extract text from a PDF file in Delphi ? - embarcadero.delphi.non-tech

Extracts from the Film A Hard Day's Night - Wikipedia, the free encyclopedia
Extracts from the Film A Hard Day's Night is an EP by The Beatles released on 4 November 1964 by Parlophone (catalogue number GEP 8920.) It was ...

Video 2 Photo - extract still pictures from movies on the App Store on iTunes
Get Video 2 Photo - extract still pictures from movies on the App Store. See screenshots and ratings, and read customer reviews.

Vanilla extract ready to sit - Flickr - Photo Sharing!
You aren't signed in Sign In Help Home The Tour Sign Up Explore Explore Home Last 7 Days Interesting Popular Tags Calendar Most Recent Uploads ...

Garcinia Cambogia Extract Exposed: Side Effects and Warnings - YouTube
3 tips to follow before purchasing garcinia cambogia for smart buyers: 1. Make sure the brand has Hydroxycitric acid in it's formula (at least ...

Gideon Haigh book extract: Certain admissions
Speak of meeting &quot;under the clocks&quot; and no Melburnian mistakes your meaning. The indicator clocks over the archway entrance to Flinders ...

Time to extract ourselves from that futile war on IS
The idea that Australia should decide to participate in dropping bombs on Syria is truly appalling.

Read an extract of Derek Pedley's book of suburban lust, greed and murder in Dead By Friday
BOOK EXTRACT: DEAD By Friday, tells the shocking true story of a father's role in a murder plot. Contains graphic content

Thai police extract $400,000 diamond from jewellery thief’s bottom
A POLICE investigation in Thailand has literally gotten to the bottom of the theft of valuable diamond.

An extract from Dancing with a Cocaine Cowboy
Robyn Windshuttle recalls her long affair with a man who was charming, charismatic ... and a major cocaine dealer.

Extract from Hannie Rayson's 'Hello Beautiful!': When much was unmentionable and toilet rolls were unseen ...
One of the great mysteries of my childhood was a phenomenon known as 'women's problems'.

Resources last updated: 3/1/2016 10:02:16 AM