Results 1 to 15 of 15

Thread: RSS Help

  1. #1
    Join Date
    Aug 2007
    Location
    Oakham England
    Posts
    503

    RSS Help

    Hey,

    I have an rss feed that a script generates. From a database but one main contact in the database has HTML in it witch my RSS feed will not pick up so its not brining anything up!

    Can anyone explain how I make my RSS feed take in HTML??

    Dan
    Streama - Your WordPress Friend
    http://www.streama.co.uk

  2. #2
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    Witches have attacked your RSS feed?
    HTML in brine?

    Ok, off grammar and on to a solution, but only after two questions:

    1. Which element has the offending HTML? Is it the author sub-element of item? Be specific.

    2. "my RSS feed will not pick up" - means exactly what? Are we to guess this means when you try to view the RSS feed directly in a browser like Firefox, you see a blank page? An error? In either case, if you "view source" is there some XML source present?

    First I have to say that RSS is a crappy specification, and for proof of that I present to you the only exhibit in my prima facie case, exhibit A, the specification:

    http://cyber.law.harvard.edu/rss/rss.html

    Wade through that document and what you'll find is a lack of specifications more than anything. Case closed.

    (If you want to see a better specification for something similar but much more complete, read the Atom feed spec: http://www.ietf.org/rfc/rfc4287.txt. Incidentally the Atom feed spec would allow for HTML representation via a type attribute.)

    Since you've not given much to go on lets just guess and assume it is text you are inserting from your database into an "author" subelemen, i.e. <item><author><strong>Some offending html</strong></author> etc. and this is giving you an issue on display.

    You'll note the "spec" says the author subelement can contain an email. It says nothing about optional types / html data. Few RSS publishers actually use the <author> sub-element to contain an email, thanks to spam.

    http://cyber.law.harvard.edu/rss/rss...mentOfLtitemgt

    You have a couple of choices. Stripping the HTML from the input data is probably the best choice. Specifically - adjust your RSS processing script to strip HTML from any field but content-related fields. Strip the HTML from the <author> sub element, leaving only the text -- this is not that difficult.

    The second is to try using a CDATA surround within the offending subelement. I do not favour this because you are likely to run into browser and feed-reader incompatibility problems.

    There is a third option: use Atom instead of RSS, but this is more involved than simply running the non-content RSS element data through an HTML filter.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  3. #3
    Join Date
    Aug 2007
    Location
    Oakham England
    Posts
    503
    Quote Originally Posted by mwatkins View Post
    Witches have attacked your RSS feed?
    HTML in brine?

    Ok, off grammar and on to a solution, but only after two questions:

    1. Which element has the offending HTML? Is it the author sub-element of item? Be specific.

    2. "my RSS feed will not pick up" - means exactly what? Are we to guess this means when you try to view the RSS feed directly in a browser like Firefox, you see a blank page? An error? In either case, if you "view source" is there some XML source present?

    First I have to say that RSS is a crappy specification, and for proof of that I present to you the only exhibit in my prima facie case, exhibit A, the specification:

    http://cyber.law.harvard.edu/rss/rss.html

    Wade through that document and what you'll find is a lack of specifications more than anything. Case closed.

    (If you want to see a better specification for something similar but much more complete, read the Atom feed spec: http://www.ietf.org/rfc/rfc4287.txt. Incidentally the Atom feed spec would allow for HTML representation via a type attribute.)

    Since you've not given much to go on lets just guess and assume it is text you are inserting from your database into an "author" subelemen, i.e. <item><author><strong>Some offending html</strong></author> etc. and this is giving you an issue on display.

    You'll note the "spec" says the author subelement can contain an email. It says nothing about optional types / html data. Few RSS publishers actually use the <author> sub-element to contain an email, thanks to spam.

    http://cyber.law.harvard.edu/rss/rss...mentOfLtitemgt

    You have a couple of choices. Stripping the HTML from the input data is probably the best choice. Specifically - adjust your RSS processing script to strip HTML from any field but content-related fields. Strip the HTML from the <author> sub element, leaving only the text -- this is not that difficult.

    The second is to try using a CDATA surround within the offending subelement. I do not favour this because you are likely to run into browser and feed-reader incompatibility problems.

    There is a third option: use Atom instead of RSS, but this is more involved than simply running the non-content RSS element data through an HTML filter.

    1. Its the Discription. Where the main text shows up.

    2. What happens if in the database if you have something like the following:

    Code:
    Hello my <u>name</u> is <b>daniel</b>
    Then run the script that makes the feed it will now show anything not even the text with no boldness or underlining...! At all, but if you have...

    Code:
    Hello my name is daniel
    It will show Hello my name is daniel....


    Looking at the cource on the XML sheet it just says <description></description>

    Any idea now?... I have looked at your site..., Want to see the code?

    Dan
    Streama - Your WordPress Friend
    http://www.streama.co.uk

  4. #4
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    Simple fix: enclose all data destined for the <description> element with CDATA tags, as in the following example (I've highlighted in bold what you need to add):

    <description>
    <![CDATA[
    Hello my <u>name</u> is <b>daniel</b>
    ]]></description>

    My comment about CDATA in my first message does not apply here; the description element is one of those which virtually all browsers and feed readers *expect* to frequently see (X)HTML. CDATA in an author field on the other hand likely would run into trouble.

    For more information on CDATA see: http://www.w3.org/TR/REC-xml/#dt-cdsection

    Edit: be sure to run your feed through the W3C feed validator:

    http://validator.w3.org/feed/

    You might find this one runs faster though:

    http://beta.feedvalidator.org/

    The W3C validator uses the software from feedvalidator.org so you aren't missing anything by using the second link.

    It will help you uncover any other problems you may have.
    Last edited by mwatkins; 06-19-2009 at 02:26 PM.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  5. #5
    Join Date
    Jan 2006
    Location
    Athens, Greece
    Posts
    1,481
    CDATA is the way but here a strange note on this subject I bumped into today.

    I was "parsing" a KML file (Google Earth) which had the <description> node of the item filled with HTML inside a cdata tag and PHP simple xmlreader didn't read the content of this tag.
    I haven't investigate it yet, but it looks suspicious

  6. #6
    Join Date
    Aug 2007
    Location
    Oakham England
    Posts
    503
    Quote Originally Posted by mwatkins View Post
    Simple fix: enclose all data destined for the <description> element with CDATA tags, as in the following example (I've highlighted in bold what you need to add):

    <description>
    <![CDATA[
    Hello my <u>name</u> is <b>daniel</b>
    ]]></description>

    Hey,

    I have added this to my code and looking at other peoples it should work but it doesnt! Can you take a look my my code?

    I know for a fact its this bit, been doing some tests

    PHP Code:
    while($article mysql_fetch_assoc($items))
    {
        
    $title makeUTF($article["post_title"]);
        
    $url $article["guid"];
        
    $desc $article["post_content"];
        
    $mod_post $article["post_modified"];
        
    $img_url $article["post_img"];
        
    $p_id $article["id"];
        
    $desc trim(substr($desc0256));
        
    $len strlen($desc) -1;
        
    $x strpos($desc'<'); 
        if(
    $x !== false && $x $len$len $x;
        if(
    $desc[$len] != ".")
        {
         while(
    $len 20 && ord($desc[$len]) != 32$len--;
        }   
        
    $desc trim(substr($desc0$len)) . "...";
        
        
    $desc makeUTF($desc);
        
        
    $rss .= "<item>\n";
        
    $rss .= "<title>$title</title>\n";
        
    $rss .= "<link>$url</link>\n";
        
    $rss .= "<guid>$url</guid>\n";
        
    $rss .= "<pubDate>$mod_post</pubDate>\n";
        
    $rss .= "<imagess>$img_url</imagess>\n";
        
    $rss .= "<description><![CDATA[$desc]]></description>\n";
        
    $rss .= "<p_id>$p_id</p_id>\n";
        
    $rss .= "</item>\n";

    Dan
    Streama - Your WordPress Friend
    http://www.streama.co.uk

  7. #7
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    Why not comment out all the code which modifies the $desc; I realize that ultimately you want to publish only a short bit of the content; just for a test, publish the entire content.

    Does this show up properly in the feed? Lets just be sure there is only one problem and then I'll give you my comment on the truncation code chunk you have (will have) commented out.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  8. #8
    Join Date
    Aug 2007
    Location
    Oakham England
    Posts
    503
    Quote Originally Posted by mwatkins View Post
    Why not comment out all the code which modifies the $desc; I realize that ultimately you want to publish only a short bit of the content; just for a test, publish the entire content.

    Does this show up properly in the feed? Lets just be sure there is only one problem and then I'll give you my comment on the truncation code chunk you have (will have) commented out.

    I have done that why I know its that code, I change it to $descb and in the feed did $descb and it worked!! So its something to do with shortening it!

    No things with HTML its just shows "..."!!!

    Dan
    Streama - Your WordPress Friend
    http://www.streama.co.uk

  9. #9
    Join Date
    Aug 2007
    Location
    Oakham England
    Posts
    503
    Help please
    Streama - Your WordPress Friend
    http://www.streama.co.uk

  10. #10
    Join Date
    Apr 2009
    Location
    localhost
    Posts
    175
    Make sure the xml tags are started at the start of the page where you display it. Otherwise it will show xml errors.

  11. #11
    Join Date
    Aug 2007
    Location
    Oakham England
    Posts
    503
    Where is no errors, it just doesnt display the code.

    Dan
    Streama - Your WordPress Friend
    http://www.streama.co.uk

  12. #12
    Join Date
    Aug 2007
    Location
    Oakham England
    Posts
    503
    Still having issues.

    Dan
    Streama - Your WordPress Friend
    http://www.streama.co.uk

  13. #13
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    Try this: strip all HTML from "description". Something like this:

    PHP Code:
    $desc strip_tags($article["post_content"]);
    if 
    strlen($desc) > 256 {
        
    $desc trim(substr($desc0256)); 
        
    $desc '<p>' $desc ' ...</p>';
    }

    // ... snip ...
    $rss .= "<description><![CDATA[$desc]]></description>\n"
    If that fails to deliver, I'd check out what makeUTF is doing.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  14. #14
    Join Date
    Aug 2007
    Location
    Oakham England
    Posts
    503
    Quote Originally Posted by mwatkins View Post
    Try this: strip all HTML from "description". Something like this:

    PHP Code:
    $desc strip_tags($article["post_content"]);
    if 
    strlen($desc) > 256 {
        
    $desc trim(substr($desc0256)); 
        
    $desc '<p>' $desc ' ...</p>';
    }

    // ... snip ...
    $rss .= "<description><![CDATA[$desc]]></description>\n"
    If that fails to deliver, I'd check out what makeUTF is doing.


    I need the HTML in the code though.

    Dan
    Streama - Your WordPress Friend
    http://www.streama.co.uk

  15. #15
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    Quote Originally Posted by Danny159 View Post
    I need the HTML in the code though.
    So your specs are:

    - publish HTML formatted text in a RSS feed using the description element
    - truncate HTML formatted text after N characters while preserving formatting.

    When making this sort of query here it is best that you sum up your requirements in the first post... I could have put you on the right track much earlier, wasting less of your time and of others.

    Your problem is unmatched tags, something I suspected you'd run into ever since you posted the code snippet. RSS is an XML file format, a format which itself pretty much depends on well-formedness, a fancy way of saying that the tags all have to match and key tags have to be there.

    In the case of your description element, you also want to be sure some tags are not there, i.e. it wouldn't make sense to ship across <script> tags, or <head> tags.

    What you really need is a HTML parser, such that as you iterate through the HTML content up to N words or characters (not N characters of *markup*) it keeps track of the "state" of the HTML stream, where "state" here really means what tag is it currently in. That way you can more easily close said tag. This is non trivial stuff.

    Imagine we are truncating on 20 characters. Your crude truncation code could easily take:
    Code:
    <div><p>My name is <strong><em>Stan the Snail</em></strong>, and I live in Hollywood.</p></div>
    And turn it into:
    Code:
    <div><p>My name is <
    Text only gives you:
    Code:
    My name is Stan the
    From:
    Code:
            12345678901            23456789012345              67890
    <div><p>My name is <strong><em>Stan the Snail</em></strong>, and I live in Hollywood.</p></div>
    I can see publishing an HTML snippet in RSS only if you have total control over what HTML will be presented. If not, text only is safer and easy to implement. If you wish to pursue the HTML snippet, do some searching on "safe HTML truncation" and PHP - perhaps you'll find a solution.

    If it were me, I'd go the HTML parser route. I could give you a workable solution in Python in not too many lines but am unwilling to invest that time into a PHP solution.

    Myself: I publish full feeds mostly; when I publish snippets, I convert the XHTML to text and truncate. Simple and efficient. Mind you, your query has prompted me to write a HTML truncation routine myself.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •