April 11, 2002

Wanted: A foolproof route from word processor to HTML

As I admitted at the end of my entity support test article, knowing which special characters display correctly is a hollow victory.

Sure, now we can easily use the correct typographical marks when working in our favourite HTML editor, but only a small fraction of text on the web is generated like that. The great majority originates in word processors wielded by journalists or editors, and is transformed into HTML either by the word processor itself, or via a CMS, or by an HTML “programmer”.

Even if the writer was versed in the correct use of special characters, the deck is stacked against the typography surviving the transfer to HTML. The most popular word processor in the world generates HTML so bloated that it’s practically worthless as a route to web publishing, and doesn’t convert special characters to the correct numeric entities. Copying and pasting formatted text from Word into a WYSIWYG web editor like Dreamweaver doesn’t convert characters correctly either. And few HTML coders have the time or the knowledge to convert these characters by hand.

What we need

Now, I’m lazy, and I’m writing this in the hope that someone can proffer a (preferably automated) system that ensures the preservation of special characters from the word processing environment to HTML.

This could be

  • a means of converting Word or RTF documents to clean HTML with correct numeric entities;
  • software or a macro that automatically corrects Word’s HTML;
  • an alternative word processor to Word that offers a smoother route to HTML, and can realistically be provided to one’s journalists and clients, for example a simple RTF editor;
  • an editorial workflow that describes who takes responsibility, at what stages and by what means, for the preservation of typographical quality;
  • Word processor templates and/or editorial guidelines that enforce typographical standards and ease conversion to HTML.

For bonus points

  • correctly converting numbered or bulleted lists to HTML lists;
  • correctly converting Headings to HTML <h1><h6>, paragraphs to <p>; optionally removing empty paragraphs;
  • optionally converting Bold and Italic to <strong> and <em>, or stripping out character styles altogether;
  • correctly converting simple tables.

Note that I am not looking for a converter that attempts to mimic in HTML the look of the Word document. The HTML will be styled by the site’s style sheet. I’m just seeking to preserve special characters and hopefully also document structure. I’ll happily settle for plain text with just the special characters correctly converted.

What I know so far

As a modest recompense, I can offer what I’ve learned so far:

  • HTMLTidy can be customised to automatically convert lots of HTML in many, many ways. I admit I haven’t attempted it yet. Could this be trained to clean Word HTML down to the structural bones and convert special characters to the correct numeric entities? Can anyone share this knowledge?
  • Dreamweaver and Textism both ruthlessly clean up Word HTML, but neither converts special characters correctly, and fail on a great deal of Word formatting, for example lists.
  • “The Office HTML Filter is a tool you can use to remove Office-specific markup tags embedded in Office 2000 documents saved as HTML.” This adds the Export To: Compact HTML option to Word’s File menu.
  • wvWare are *nix libraries that allow access to Microsoft Word files. The wvHtml utility converts Word documents into HTML4.0.
  • NuxDocument, an extension of Eric Barroca's MSWordDocument, is a Zope product that represents generic documents by using plugins to convert native productivity suite formats to HTML. Plugins include MSOffice, OpenOffice.org, RTF and PDF.
  • David McRitchie, a microsoft MVP, has a very good MS-HTML resource, with the emphasis on converting Excel spreadsheets to HTML. He mentions on this page, incidentally, that he’s not impressed by the Office HTML Filter mentioned above. Other Microsoft MVPs or forums are good sources of information.
  • Michael Mell is writing RTF2HTML, an extendable Python tool for converting RTF to HTML. He is eager to hear suggestions for further improvements.
  • It is possible to build a good RTF editor into the browser, although this is probably not a solution for migrating content from word processor to HTML.
  • I’ve heard many people sing the praises of the freeware HTML editor HTML-Kit. It’s very customisable, has tons of plug-ins, and comes with lots of built in html-tidy functionality/presets. Maybe it does a better job of converting Word HTML?
  • A discussion on this issue, with many recommendations and links.
Posted by francois at April 11, 2002 11:51 PM

Comments

Let's not forget that other word processors, such as WordPerfect and OpenOffice, have markup-based document formats. WordPerfect has integrated with FrameMaker SGML for years, AFAIR, and OpenOffice docs are proper parseable XML.

Posted by: Jean Jordaan on April 12, 2002 10:36 AM

my own search most recently lead me back to Terry Morse's Myrmidon, a mac print driver that converts postscript to html. Word to HTML is passable, and you can configure it to strip a lot of the usual rubbish that most similar programs leave in there - font styling, spacing etc., leaving fairly clean HTML. Its by no means perfect, or truly web ready, but a damn good start. Myrmidon is now free, the new goClick I've not yet tried.

Posted by: Sam-I-Am on April 16, 2002 05:03 AM

I'd like to know how to get valid, clean XML and XHTML from a word processor. Anyone know of a way? I know there are some commercial plug-ins that cost an arm and a leg, but I can't go that route.

Posted by: michael on April 18, 2002 09:15 PM

I have *an* answer -- forget about HTML and 'word' processors. We need WYSIWYG XML editors (like SoftQuad's XMetat, but easier to use) that allows us to set the rules for what is allowed in a document (dtd or schema) and then give the writer an easy to use, wordprocessor-like interface to write in an apply these styles.

The beautiful thing about this is if a user tries to put something that you consider to be invalid, the processor will block them from doing it.

The other features you want, like versioning, comments, editorial controls, also can be built and embedded within the XML document.

SO, once the document is written, a simple XSLT or translation is used to output the XML as XHTML.

What we need is to put pressure on companies like Microsoft and Adobe to start giving us structured authoring tools that allow for more control on our writers interfaces.

Grip

Posted by: Grip on April 18, 2002 10:02 PM

Who'd a thunk it -- the biggest collection of HTML (to/from) converters... on the W3C site. Most of the information seems older that 3 years, though... Probably 404 city.

Posted by: francois on April 23, 2002 11:26 PM

What I have been doing with e-books, that were posted to Usenet as RTF and DOC files, is to open them with MSWorks 2000 and convert them to HTML.
Works 2000 seems to produce alot cleaner HTML code than the Trash that MS Word puts out.

And it does a OK job of keeping the original Formatting (curly quotes, and other characters are converted to html entities) without bloating the code with 1000's of useless tags and doubling the size of the file.

To get then to Valid XHTML
I do a quick Search and Replace
- replace br with br /
- throw in a few DIVS, div.book, div.toc, div.chapter, ect..
- replace 'most' the p style= tags with a P tag
- and move the p styles to a stylesheet
- Run it through Tidy and that's about it.

If I want XML I take and do thee above process,
and then replace the P tags with Para tags...
and use my 'standard xml-ebook.dtd',
( which is similar to the HWG Gutenberg dtd's )
one of my 'xml-ebook.css' files, then
validate it, fix the errors, re-validate it, ect....
a large book takes about an hour to get it done.

This way is alot easier.....
SoftSnow's BookProofer, HTMLBookFixer,
RtfConverter and TocGen have proved to be
pretty usefull too. (freeware)

The HTMLBookFixer has a mini version of PERL built into it, and will convert plain quotes to Curly Quotes, amongst many other things. Then it runs TIDY for you before its done.

If you look at the File List I think see what is can do:
Perl.exe , Perl56.dll , clean_finereader_html.pl , clean_finereader_html_auto.pl ,
clean_word_html.pl , fix_common_double_quote_errors.pl , fix_common_errors.pl ,
fix_common_single_quote_errors.pl , fix_no_open_paragraphs.pl , initial_html_clean.pl ,
join_paragraphs.pl , make_curly_quotes.pl , strip_gutenberg.pl , strip_text.pl ,
tidy_italic_tags.pl , tidy_paragraphs.pl , ,

All of that for FREE and only
apx. 1MB with Tidy added

Sometimrs I used 'Irun111' from www.pilotltd.com (also free)
to convert RTF into XML or HTML+CSS.
It works OK/

The resulting xml code still needs a little work sometimes and I have to create a Style Sheet to show it. But it works OK

from Irun111 help...
This program converts RTF documents to XML or HTML..

Since RTF and markup language are not 100% compatible some feature of the RTF specifications are excluded from IRun. Some features though,
can be added in the later versions.
Here is the list of features that are
NOT
included in IRun:
Embedded fonts, font extension files - Style shortcut keys - Positioned objects and frames - List override table - Compatibility options - User defined document properties - Associated character properties - Bookmarks - Indexed entries - Drawing objects (Shapes) - Table of contents entries - Embedded tables - Form fields -Revision marks - Foot-notes - end-notes

but for converting eBooks in RTF to HTML+CSS it works OK

i hope that helps....
Knot Valid

Posted by: Knot Valid on April 10, 2003 11:00 PM

I have been looking into this the past couple of days. I'm guessing some of these difficulties go away when MSFT adopts XML as its document format, but for now I have found these tools.

tidy (referenced above) does a great job at sanitizing Word HTML, even making it XHTML-compliant. I recommend it.

Here's my .tidyrc file to show you the options I've been using. Caveat: the "clean" option will rewrite the old school HTML as CSS and insert it into the document. And "write-back" will overwrite the old file so be warned: there's no rollback or undo.


clean: yes
uppercase-tags: no
uppercase-attributes: no
indent: auto
indent-spaces: 3
write-back: yes
alt-text: "navigation image"
wrap: 0
output-xhtml: yes
doctype: auto


WH2FO can take Word's over-the-top HTML (I found more than 700 lines of CSS in a file I converted) and create an XML and XSL file from it. So if you want to keep the content separate from style for later re-use, you can do that. That's where I think I'm headed, which means writing XSL files to fit the presentation options that come up.

there are quite a few tools for re-assembling HTML from the .xml and .xls files (xsltproc, saxon, xalan), but I can't speak to how well they work without a style sheet that suits my needs.

Posted by: paul beard on July 18, 2003 07:23 PM

Microsoft has its own tool, that is suited for cleaning Office-carbage from Office-files:
Office 2000 HTML Filter 2.0

I just mentioned it although it is released 2000.

Posted by: TK on May 19, 2004 11:02 PM

Yeah, tidy does what you want. I haven't thrown a ton of weird stuff at it, but I've seen it convert MSHTML to valid XHTML for a number of documents.

A number of free/open-source products use libtidy to perform markup transformations within the application.

Posted by: Chris Snyder on May 20, 2004 04:01 AM

Post a comment

(HTML is OK. Two linebreaks are converted to a <p>, one linebreak to a <br />. Represent all occurrences of <, >, and & by character or entity references, i.e. &lt;, &gt;, and &amp;.)

Name:


Email Address:


URL:


Comments:


Remember info?