[xml/sgml-pkgs] Bug#692741: Better support for pdftohtml output (specific profile?)

Mathieu Malaterre malat at debian.org
Thu Nov 8 12:33:45 UTC 2012


Package: herold
Version: 6.0.2-1
Severity: normal

It would be really nice if there was a profile for pdftohtml output. Currently pdftohtml generates something like:

<b>Scope</b><br>
TIFF describes image data that typically comes from scanners, frame grabbers,<br>and paint- and photo-retouching programs.<br>
TIFF is not a printer language or page description language. The purpose of TIFF<br>is to describe and store raster image data.<br>
A primary goal of TIFF is to provide a rich environment within which applica-<br>tions can exchange image data. This richness is required to take advantage of the<br>varying capabilities of scanners and other imaging devices.<br>
Though TIFF is a rich format, it can easily be used for simple scanners and appli-<br>cations as well because the number of required fields is small.<br>
TIFF will be enhanced on a continuing basis as new imaging needs arise. A high<br>priority has been given to structuring TIFF so that future enhancements can be<br>added without causing unnecessary hardship to developers.<br>

which get converted into (no profile):

  <para><emphasis remap="b:86:2" role="bold">Scope</emphasis></para>
  <para> TIFF describes image data that typically comes from scanners, frame grabbers,</para>
  <para>and paint- and photo-retouching programs.</para>
  <para> TIFF is not a printer language or page description language. The purpose of TIFF</para>
  <para>is to describe and store raster image data.</para>
  <para> A primary goal of TIFF is to provide a rich environment within which applica-</para>
  <para>tions can exchange image data. This richness is required to take advantage of the</para>
  <para>varying capabilities of scanners and other imaging devices.</para>
  <para> Though TIFF is a rich format, it can easily be used for simple scanners and appli-</para>
  <para>cations as well because the number of required fields is small.</para>
  <para> TIFF will be enhanced on a continuing basis as new imaging needs arise. A high</para>
  <para>priority has been given to structuring TIFF so that future enhancements can be</para>
  <para>added without causing unnecessary hardship to developers.</para>

This make is difficult to use in docbook (too many <para/>).

Also pdftohtml extract PDF headers and place it into HTML/META elements. Eg:

<HEAD>
<TITLE>TIFF6.final.9509</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<META name="generator" content="pdftohtml 0.36">
<META name="author" content="Adobe Systems Inc.">
<META name="keywords" content="TIFF,,.TIF,,TIF">
<META name="date" content="1995-09-14T14:32:50+00:00">
<META name="subject" content="TIFF 6.0">
</HEAD>

It would be really nice to have them in docbook/info !

Thanks

-- System Information:
Debian Release: 6.0.6
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable'), (200, 'testing'), (100, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-0.bpo.3-amd64 (SMP w/8 CPU cores)
Locale: LANG=en_US.utf8, LC_CTYPE=en_US.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages herold depends on:
ii  antlr3                3.2-5              language tool for constructing rec
ii  libcommons-codec-java 1.4-2              encoder and decoders such as Base6
ii  libcommons-jxpath-jav 1.3-3              manipulate javabean using XPath sy
ii  libcommons-logging-ja 1.1.1-8            commmon wrapper interface for seve
ii  libxml-commons-resolv 1.2-7~bpo60+1      XML entity and URI resolver librar
ii  libxmlgraphics-common 1.4.dfsg-4~bpo60+1 reusable components used by Batik 

herold recommends no packages.

herold suggests no packages.

-- debconf-show failed



More information about the debian-xml-sgml-pkgs mailing list