Results 1 to 5 of 5
  1. #1
    Join Date
    Mar 2004
    Location
    New Jersey
    Posts
    793

    PHP Script to parse a .doc, .rtf or .mht file needed

    I have a client who insists on using Word to create their newsletters for their website. They send it to me and I convert it to HTML for them so we can send it out in a multipart MIME email. But I'm too busy lately to do these "conversions". I want the client to be able to take their doc file, or save it as one of the above mentioned file types, and have the server do the work.

    Does anyone know of a RELIABLE open source script to convert any of these file types to HTML so that it can easily be mailed from a PHP newsletter script I wrote for them? I've found a couple on Google, but none have worked 100% - having been mostly used to take RTF text from a VB rich text box and make it HTML.

    TIA to all those who reply!

  2. #2
    Join Date
    Jul 2003
    Location
    Kuwait
    Posts
    5,099
    You have a few options here, depending on what version of Word that your client is using.

    If you are lucky, they are using Word 2003 and ask them to simply save the file as XML, then you can parse it in PHP easily. This reference should help you decipher the XML that Word 2003 generates.

    There is always the option of having them saving it directly to HTML, and you can then just parse out the bits that you need. Depending on the version of Word they are using, this could be an issue since Office is famous for generating rather verbose HTML.

    If your webserver is running Linux, you can try your hand at wv (wordview) to convert the files.

    Hope this helps.
    In order to understand recursion, one must first understand recursion.
    If you feel like it, you can read my blog
    Signal > Noise

  3. #3
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    In addition, there are numerous .DOC and .RTF conversion utilities out there:
    catdoc
    antiword (I used this in my mutt mailcap)
    rtf2html
    word2x

    If the solution is on Windows, you can use the Word API to extract text, styles etc. Haven't done this in a while, but its feasible. More recent versions are easier to deal with than the old crap I once converted for a 1,000 page document.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  4. #4
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    PS, happened to remember reading this article once upon a time:

    http://www.xml.com/pub/a/2004/12/08/word-to-xml.html

    When looking for such solutions its often helpful to ignore your language of choice - search for the solution first.

    You can always call a Python (or XYZ-based) tool which does the work for you, from your PHP scripts.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  5. #5
    Join Date
    Mar 2004
    Location
    New Jersey
    Posts
    793
    Wow, great replies! Thanks for your help. I have a lot of reading and testing to do.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •