Results 1 to 15 of 15
Thread: RSS Help
-
06-19-2009, 05:14 AM #1Web Hosting Evangelist
- Join Date
- Aug 2007
- Location
- Oakham England
- Posts
- 503
RSS Help
Hey,
I have an rss feed that a script generates. From a database but one main contact in the database has HTML in it witch my RSS feed will not pick up so its not brining anything up!
Can anyone explain how I make my RSS feed take in HTML??
DanStreama - Your WordPress Friend
http://www.streama.co.uk
-
06-19-2009, 12:03 PM #2Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
Witches have attacked your RSS feed?
HTML in brine?
Ok, off grammar and on to a solution, but only after two questions:
1. Which element has the offending HTML? Is it the author sub-element of item? Be specific.
2. "my RSS feed will not pick up" - means exactly what? Are we to guess this means when you try to view the RSS feed directly in a browser like Firefox, you see a blank page? An error? In either case, if you "view source" is there some XML source present?
First I have to say that RSS is a crappy specification, and for proof of that I present to you the only exhibit in my prima facie case, exhibit A, the specification:
http://cyber.law.harvard.edu/rss/rss.html
Wade through that document and what you'll find is a lack of specifications more than anything. Case closed.
(If you want to see a better specification for something similar but much more complete, read the Atom feed spec: http://www.ietf.org/rfc/rfc4287.txt. Incidentally the Atom feed spec would allow for HTML representation via a type attribute.)
Since you've not given much to go on lets just guess and assume it is text you are inserting from your database into an "author" subelemen, i.e. <item><author><strong>Some offending html</strong></author> etc. and this is giving you an issue on display.
You'll note the "spec" says the author subelement can contain an email. It says nothing about optional types / html data. Few RSS publishers actually use the <author> sub-element to contain an email, thanks to spam.
http://cyber.law.harvard.edu/rss/rss...mentOfLtitemgt
You have a couple of choices. Stripping the HTML from the input data is probably the best choice. Specifically - adjust your RSS processing script to strip HTML from any field but content-related fields. Strip the HTML from the <author> sub element, leaving only the text -- this is not that difficult.
The second is to try using a CDATA surround within the offending subelement. I do not favour this because you are likely to run into browser and feed-reader incompatibility problems.
There is a third option: use Atom instead of RSS, but this is more involved than simply running the non-content RSS element data through an HTML filter.“Even those who arrange and design shrubberies are under
considerable economic stress at this period in history.”
-
06-19-2009, 12:45 PM #3Web Hosting Evangelist
- Join Date
- Aug 2007
- Location
- Oakham England
- Posts
- 503
1. Its the Discription. Where the main text shows up.
2. What happens if in the database if you have something like the following:
Code:Hello my <u>name</u> is <b>daniel</b>
Code:Hello my name is daniel
Looking at the cource on the XML sheet it just says <description></description>
Any idea now?... I have looked at your site..., Want to see the code?
DanStreama - Your WordPress Friend
http://www.streama.co.uk
-
06-19-2009, 02:17 PM #4Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
Simple fix: enclose all data destined for the <description> element with CDATA tags, as in the following example (I've highlighted in bold what you need to add):
<description>
<![CDATA[
Hello my <u>name</u> is <b>daniel</b>
]]></description>
My comment about CDATA in my first message does not apply here; the description element is one of those which virtually all browsers and feed readers *expect* to frequently see (X)HTML. CDATA in an author field on the other hand likely would run into trouble.
For more information on CDATA see: http://www.w3.org/TR/REC-xml/#dt-cdsection
Edit: be sure to run your feed through the W3C feed validator:
http://validator.w3.org/feed/
You might find this one runs faster though:
http://beta.feedvalidator.org/
The W3C validator uses the software from feedvalidator.org so you aren't missing anything by using the second link.
It will help you uncover any other problems you may have.Last edited by mwatkins; 06-19-2009 at 02:26 PM.
“Even those who arrange and design shrubberies are under
considerable economic stress at this period in history.”
-
06-19-2009, 03:19 PM #5Web Hosting Master
- Join Date
- Jan 2006
- Location
- Athens, Greece
- Posts
- 1,481
CDATA is the way but here a strange note on this subject I bumped into today.
I was "parsing" a KML file (Google Earth) which had the <description> node of the item filled with HTML inside a cdata tag and PHP simple xmlreader didn't read the content of this tag.
I haven't investigate it yet, but it looks suspicious
-
06-20-2009, 05:48 PM #6Web Hosting Evangelist
- Join Date
- Aug 2007
- Location
- Oakham England
- Posts
- 503
Hey,
I have added this to my code and looking at other peoples it should work but it doesnt! Can you take a look my my code?
I know for a fact its this bit, been doing some tests
PHP Code:while($article = mysql_fetch_assoc($items))
{
$title = makeUTF($article["post_title"]);
$url = $article["guid"];
$desc = $article["post_content"];
$mod_post = $article["post_modified"];
$img_url = $article["post_img"];
$p_id = $article["id"];
$desc = trim(substr($desc, 0, 256));
$len = strlen($desc) -1;
$x = strpos($desc, '<');
if($x !== false && $x < $len) $len = $x;
if($desc[$len] != ".")
{
while($len > 20 && ord($desc[$len]) != 32) $len--;
}
$desc = trim(substr($desc, 0, $len)) . "...";
$desc = makeUTF($desc);
$rss .= "<item>\n";
$rss .= "<title>$title</title>\n";
$rss .= "<link>$url</link>\n";
$rss .= "<guid>$url</guid>\n";
$rss .= "<pubDate>$mod_post</pubDate>\n";
$rss .= "<imagess>$img_url</imagess>\n";
$rss .= "<description><![CDATA[$desc]]></description>\n";
$rss .= "<p_id>$p_id</p_id>\n";
$rss .= "</item>\n";
}
Streama - Your WordPress Friend
http://www.streama.co.uk
-
06-20-2009, 06:26 PM #7Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
Why not comment out all the code which modifies the $desc; I realize that ultimately you want to publish only a short bit of the content; just for a test, publish the entire content.
Does this show up properly in the feed? Lets just be sure there is only one problem and then I'll give you my comment on the truncation code chunk you have (will have) commented out.“Even those who arrange and design shrubberies are under
considerable economic stress at this period in history.”
-
06-20-2009, 06:32 PM #8Web Hosting Evangelist
- Join Date
- Aug 2007
- Location
- Oakham England
- Posts
- 503
Streama - Your WordPress Friend
http://www.streama.co.uk
-
06-22-2009, 10:06 AM #9Web Hosting Evangelist
- Join Date
- Aug 2007
- Location
- Oakham England
- Posts
- 503
Help please
Streama - Your WordPress Friend
http://www.streama.co.uk
-
06-22-2009, 07:35 PM #10Temporarily Suspended
- Join Date
- Apr 2009
- Location
- localhost
- Posts
- 175
Make sure the xml tags are started at the start of the page where you display it. Otherwise it will show xml errors.
-
06-23-2009, 03:45 AM #11Web Hosting Evangelist
- Join Date
- Aug 2007
- Location
- Oakham England
- Posts
- 503
Where is no errors, it just doesnt display the code.
DanStreama - Your WordPress Friend
http://www.streama.co.uk
-
06-29-2009, 10:09 AM #12Web Hosting Evangelist
- Join Date
- Aug 2007
- Location
- Oakham England
- Posts
- 503
Still having issues.
DanStreama - Your WordPress Friend
http://www.streama.co.uk
-
06-29-2009, 12:58 PM #13Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
Try this: strip all HTML from "description". Something like this:
PHP Code:$desc = strip_tags($article["post_content"]);
if strlen($desc) > 256 {
$desc = trim(substr($desc, 0, 256));
$desc = '<p>' . $desc . ' ...</p>';
}
// ... snip ...
$rss .= "<description><![CDATA[$desc]]></description>\n";
“Even those who arrange and design shrubberies are under
considerable economic stress at this period in history.”
-
06-30-2009, 04:15 AM #14Web Hosting Evangelist
- Join Date
- Aug 2007
- Location
- Oakham England
- Posts
- 503
Streama - Your WordPress Friend
http://www.streama.co.uk
-
06-30-2009, 11:11 PM #15Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
So your specs are:
- publish HTML formatted text in a RSS feed using the description element
- truncate HTML formatted text after N characters while preserving formatting.
When making this sort of query here it is best that you sum up your requirements in the first post... I could have put you on the right track much earlier, wasting less of your time and of others.
Your problem is unmatched tags, something I suspected you'd run into ever since you posted the code snippet. RSS is an XML file format, a format which itself pretty much depends on well-formedness, a fancy way of saying that the tags all have to match and key tags have to be there.
In the case of your description element, you also want to be sure some tags are not there, i.e. it wouldn't make sense to ship across <script> tags, or <head> tags.
What you really need is a HTML parser, such that as you iterate through the HTML content up to N words or characters (not N characters of *markup*) it keeps track of the "state" of the HTML stream, where "state" here really means what tag is it currently in. That way you can more easily close said tag. This is non trivial stuff.
Imagine we are truncating on 20 characters. Your crude truncation code could easily take:
Code:<div><p>My name is <strong><em>Stan the Snail</em></strong>, and I live in Hollywood.</p></div>
Code:<div><p>My name is <
Code:My name is Stan the
Code:12345678901 23456789012345 67890 <div><p>My name is <strong><em>Stan the Snail</em></strong>, and I live in Hollywood.</p></div>
If it were me, I'd go the HTML parser route. I could give you a workable solution in Python in not too many lines but am unwilling to invest that time into a PHP solution.
Myself: I publish full feeds mostly; when I publish snippets, I convert the XHTML to text and truncate. Simple and efficient. Mind you, your query has prompted me to write a HTML truncation routine myself.“Even those who arrange and design shrubberies are under
considerable economic stress at this period in history.”