PDA

View Full Version : Losing styles converting HTML to Word/pdf


JonathanS1
September 8, 2006, 12:04:59
Hi, can you help with a styles issue when converting HTML to Word or PDF?

I have parts of a document stored as HTML fragments in SQLServer and am constructing a publication using these fragments. For example, this is some of the text stored in the database:

<span class="standard">CLINICAL MANAGEMENT OF THE TRANSFUSION EPISODE </span>

and

<SPAN class="criterion">A consent policy is in place, which includes an explanation of relevant legislation and best practice guidance. </SPAN>

The code to generate the final publication extracts what it needs from the database and then does the following:

ss.CssSaveMode = CssSaveMode.None
tx.Save(text, StringStreamType.HTMLFormat, ss)
text = http.HtmlDecode(text)

where tx is TXTextControl.ServerTextControl.

Then, I open the CSS file containing the class definitions and insert the contents into the variable 'text' as follows so that ALL class definitions are included within the HTML produced:

text = text.Insert(text.IndexOf("<body"), "<style type='text/css' >" & css & "</style>")

If I place a breakpoint in the code and copy the HTML produced into, for example, Dreamweaver, the document created displays perfectly using the classes inserted from the CSS file, so I know the HTML created is correct.

Now, I need to convert this output to Word or PDF (depends on what the user requested), so the code does this:

tx.ResetContents()
tx.Load(text, StringStreamType.HTMLFormat)

and convert this HTML to, in this case, PDF:

tx.Save(pdf, BinaryStreamType.AdobePDF)

But ALL HTML styles formatting is lost. How can I convert HTML to Word/PDF AND retain all the formatting from the HTML?

Regards,
Jonathan

Gunnar Giffey
September 11, 2006, 18:50:36
Hello Jonathan,

How does the HTML look like, if you save it and load it into TX Words?
Perhaps you could send me or attach the HTML as ZIP archive?

JonathanS1
September 19, 2006, 16:01:30
Hi Gunnar, sorry for the delay in replying - I have been on holiday.

Anyway, I tried loading the HTML created into TX Words and all the formatting is lost. However, when the same HTML is loaded into Dreamweaver, the page is correctly formatted.

I have attached the HTML file as you requested.

I hope we can get this sorted as this is the chosen method for producing many of our new publications and a great deal of work has already gone into the application.

Regards,
Jonathan

JonathanS1
September 19, 2006, 16:34:46
Gunnar, from doing some more testing, it would seem that the TX TextControl HTML filter is very fussy.

I moved the <style>...</style> tag into the <head></head> tag and removed the style type="text/css" attribute and the output formatted correctly in Word and Acrobat.

HOWEVER, a number of style attributes seem to be ignored. For example, 'margin-top:10pt; margin-right:5pt; line-height: 14pt;" are totally ignored in the output. Why is this?

Where is there a list of style attributes which are ignored by the TX TextControl HTML filter?

If the HTML filter simply ignores many formatting options, this seems to me to be a major issue for TX TextControl.

Regards,
Jonathan

Gunnar Giffey
September 19, 2006, 18:18:11
Jonathan,

Thanks for the HTML sample file.
If you like, I can send you a list of supported style attributes. Please send me an email to gunnar@textcontrol.com, so that I can reply with the list.

Gunnar Giffey
September 25, 2006, 15:14:44
I moved the <style>...</style> tag into the <head></head> tag and removed the style type="text/css" attribute and the output formatted correctly in Word and Acrobat.
After checking with our filter department, I have news:
The style definition must be in the head tag, otherwise the styles will not be used correctly.
See http://www.w3.org/TR/1998/REC-html40-19980424/present/styles.html#h-14.2.3 for reference:
The STYLE element allows authors to put style sheet rules in the head of the document. HTML permits any number of STYLE elements in the HEAD section of a document.