PDF, the format. By Adobe (reference required)
slightly fedora (my current distro) centric description of the python or other pdf parsing libraries:
library |
where |
example |
capabilities |
availability |
platforms |
l poppler |
yum install pypoppler |
|| l pdfresurrect || yum install
l pdftools |
|
pypoppler is a launchpad hosted project, missing many features compared to c-poppler.
poppler itself is pretty good, but some problems:
- text selection is broken, especially for hebrew text.
- but pdftotext is actually much better.
Concrete Usage Examples
TODO
Extracting Data
Libraries
http://www.757labs.com/projects/pdfresurrect/ - very basic (0.0.4), meant to track changes in pdf documents (history embedded in the pdf).
Python
http://www.boddie.org.uk/david/Projects/Python/pdftools/ - can get number of pages, and each page, but no layout engine - cannot easily (without further manipulation, like regular expressions and worse) get the text (as the user sees it - individual words are easy).