Web Hosting Talk







View Full Version : XML Parse but with CDATA problems


paragony2k
01-11-2006, 04:49 PM
Hello,

After 2 weeks of trying to figuar this out on my own to no avail, I present you with my problem that I know has a very simple solution. I have PHP4 on my server. I was pulling a search feed with this parse and it has worked for a long time, and it still works with the given search feed. We have changed our search feed and now I am presented with a problem. This is what the XML structure looks like:

<listing>

<title>

<![CDATA[ Website Hosting Links Best On The Web]]>

</title>

<url>

<![CDATA[ http://www.blahblahblah.com]]>

</url>

<redirect>

<![CDATA[ http://search.blahblah.com]]>

</redirect>

<description>

<![CDATA[ The best blahs and blah blah blah blah blah around. Why not blah
blah blah blah for yourself? Blah!]]>

</description>

<bid>

<![CDATA[ 0.00]]>

</bid>

</listing>


Now the only difference in the xml feed is the added <![CDATA]> and I cannot figuar out how to parse it with the parser I have been using for a while now. WHich is as follows:

<?
function get_blahblahsearch($keyword,$user_ip){

$DATA = fopen("http://www.blahblah.com/feed_string_here");
$feed = '';
while (!feof($DATA)) {
$feed .= fread($DATA, 8192);
}
fclose($DATA);

$feed = utf8_decode($feed);

$n_results = 0;
while (eregi("(<TITLE>)([^<]+)",$feed,$out)){
$results[$n_results]["title"] = $out[2];
$results[$n_results]["title"] = html_entity_decode($results[$n_results]["title"]);
eregi("(<DESCRIPTION>)([^<]+)",$feed,$out);
$results[$n_results]["description"] = $out[2];
$results[$n_results]["description"] = html_entity_decode($results[$n_results]["description"]);
eregi("(<URL>)([^<]+)",$feed,$out);
$results[$n_results]["siteHost"] = $out[2];
eregi("(<REDIRECT>)([^<]+)",$feed,$out);
$results[$n_results]["link"] = urldecode($out[2]);
$feed = substr($feed,strpos($feed,$out[0])+strlen($out[0]));
$n_results += 1;
};
return $results;
};

?>
<?
$ip = getenv("REMOTE_ADDR");
$keyword = $_REQUEST["q"];


$results = get_blahblahsearch($keyword,$ip);

if ($results){
// sample output
foreach ($results as $result){
echo "<tr><td colspan=2><p align=justify>";
echo "<a href=\"" . $result['link'] . "\" target=top><font color=#336699 size=3>" . strip_tags($result['title']) . "</a></font>";
echo " - <font color=#3BA101 size=1><b>" . $result['siteHost'] . "</b></font><br>";
echo "<font color=midnightblue size=1>" . strip_tags($result['description']) . "</font><br><br>";
echo "</p>";
echo "</td><tr>";
}
}

?>

Any help at this point will be greatly appreciated.

Thank you for your time,
Daniel :gthumb:

orbitz
01-11-2006, 06:14 PM
I believe the problem is right here:
eregi("(<TITLE>)([^<]+)",$feed,$out)){

after you added the: <![CDATA[ http://www.blahblahblah.com]]>, eregi will stop
at <TITLE><; hence $out[2] is blank. Same problem with the others.

paragony2k
01-11-2006, 06:19 PM
so should i change it to:
eregi("(<TITLE><CDATA>)([^<]+)",$feed,$out)){

orbitz
01-11-2006, 06:58 PM
how about:
eregi("(<TITLE>)([^>]+)",$data,$out);

paragony2k
01-11-2006, 07:20 PM
I replaced $feed with $data but still no results

orbitz
01-11-2006, 08:40 PM
what did you see when you use this: print_r($out).
I tested on my work PC, and it gave some result.

paragony2k
01-11-2006, 08:46 PM
where do i put print_r($out)

paragony2k
01-11-2006, 08:48 PM
Thank you BTW

orbitz
01-11-2006, 11:25 PM
right below the eregi function.

paragony2k
01-11-2006, 11:38 PM
Like this:

eregi("(<TITLE>)([^<]+)",$feed,$out)){
print_r($out);
$results[$n_results]["title"] = $out[2];
$results[$n_results]["title"] = html_entity_decode($results[$n_results]["title"]);

paragony2k
01-12-2006, 01:15 PM
No, that didnt work but thank you.

Any suggestions on where I could find help on this issue?

orbitz
01-12-2006, 01:33 PM
print_r is just for debugging, ie. trying to see what value the array $out has.

I tested it, and it worked if I change "title" to something else. For some reason, "title" won't work.

paragony2k
01-13-2006, 11:02 AM
No that didnt work either.

I did try this:
while (eregi("(<TITLE>(.*))([^<]+)",$feed,$out)){
$results[$n_results]["title"] = $out[2];
$results[$n_results]["title"] = html_entity_decode($results[$n_results]["title"]);

I added (.*) next to <TITLE>
Funny thing, This is the first time i realy got anything to happen. But its still not working. I see:
<![CDATA [blah blah blah blah blah blah]]> in the title bar of the webpage
and when I do a 'view source' I see all the XML info in the body of the page:

<html>
<head>
</head>

<body>
<tr><td colspan=2><p align=justify><a href="<![CDATA[blah blah blah blah]]></title>
<url><![CDATA[http://www.blah blah blah blah.com/]]></url>
<redirect><![CDATA[http://www.blah blah blah blah.com/blah blah blah blahblah blah blah blahblah blah blah blahblah blah blah blahblah blah blah blahblah blah blah blahblah blah blah blahblah blah blah blahblah blah blah blah]]></redirect>
<description><![CDATA[blah blah blah blahblah blah blah blahblah blah blah blahblah blah blah blahblah blah blah blah.]]></description>
<bid><![CDATA[0.000]]></bid>
</listing>


</font><br><br></p></td><tr></body>
</htmL>


This is sooooo far from the desired results, but its the first time I have seen anything happen.
Hope this helps with any suggestions.

Thank you.

paragony2k
01-13-2006, 04:15 PM
Ok here are the changes I have made:

while (eregi("(<title><!\[CDATA\[(.*)\]\]></title>)",$feed,$out)){
$results[$n_results]["title"] = $out[2];
$results[$n_results]["title"] = html_entity_decode($results[$n_results]["title"]);
eregi("(<description><!\[CDATA\[(.*)\]\]></description>)([^<]+)",$feed,$out);
$results[$n_results]["description"] = $out[2];
$results[$n_results]["description"] = html_entity_decode($results[$n_results]["description"]);
eregi("(<url><!\[CDATA\[(.*)\]\]></url>)([^<]+)",$feed,$out);
$results[$n_results]["siteHost"] = $out[2];
eregi("(<redirect><!\[CDATA\[(.*)\]\]></redirect>)([^<]+)",$feed,$out);
$results[$n_results]["link"] = urldecode($out[2]);
$feed = substr($feed,strpos($feed,$out[0])+strlen($out[0]));
$n_results += 1;
};
return $results;
};

?>



notice i placed:
<title><!\[CDATA\[(.*)\]\]></title>
inside the enrigi tags.

It is now reading the xml feed!!

YEA!!

but here is my new problem:
it is attaching ']]>' to the end of each string.
like this:

Best Blah Blah Blah]]> - The best blash blah blah blah blah around. If you dont here blah blah blah blah blah blah then you just havent visted us yet.]]>
- http://blahblah.com]]>

I have tried the rtrim() and that didnt seem to work for me.

Any suggestions now?
Please help out with an example.

Thank you very much for your time,
Daniel

error404
01-13-2006, 05:49 PM
Why are you parsing XML with regexps when PHP includes an XML module? If you don't want to hack an expat parser, you can even use domxml which a lot of webhosts will probably have installed. Of course, the problem is that your XML is not valid if you have more than one listing (without a container element), so the parser will fail. Anyway, here's a parser that should work with your document, provided there's a container element or only one listing:

<?php

$data = file_get_contents("test.xml");
$listings = array();
$current = array();
$next_cdata_key = '';

function startElement($parser, $name, $attrs) {
// assume you don't care about attributes
global $next_cdata_key;
switch ($name) {
case 'LISTING':
break;
default:
$next_cdata_key = $name;
break;
}
}

function endElement($parser, $name) {
global $listings, $current;
if ($name == 'LISTING') {
$listings[] = $current;
$current = array();
}
}

function cdataElement($parser, $data) {
global $next_cdata_key, $current;
$current[$next_cdata_key] .= trim($data);
}

$xml_parser = xml_parser_create("UTF-8");
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, true);

xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "cdataElement");
if (!xml_parse($xml_parser, $data))
echo "An error occured: ".xml_error_string(xml_get_error_code($xml_parser))."\n";

xml_parser_free($xml_parser);

print_r($listings);

paragony2k
01-13-2006, 07:00 PM
And that works with php4?

Great!

I will give it a shot.

Thank you.

paragony2k
01-17-2006, 12:28 AM
Ive decided using regular expressions instead of the PHP XML parser is a better way. The PHP parser requires loading a lot more code into memory than you actually need. It also requires some extra CPU cycles. - Thank you IAN.

paragony2k
01-25-2006, 10:32 PM
ok,

works great now.

Now how can I write a new php to echo my own results for this:

<?php

class blahblahblahblah {

function blahblahblahblah() {
global $P;
$ip = $this->get_ipaddr();
#$ip = '1.1.1.1'; # test ip
$this->profile = array(
'site_name' => 'blahblahblahblah',
'site_url' => 'http://www.blahblahblahblah.com/',
'request_url' => 'http://www.blahblahblahblah.com/cgi-bin/blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah,
'add_url' => 'http://www.blahblahblahblah.com/?blahblahblahblah',
);
}


function parse_results(&$output) {
#echo $output; # debug
$results = array();


if(preg_match_all('{<title><!\\[CDATA\\[(.*?)\\]\\]></title>\\s*<url><!\\[CDATA\\[(.*?)\\]\\]></url>\\s*<redirect><!\\[CDATA\\[(.*?)\\]\\]></redirect>\\s*<de

scription><!\\[CDATA\\[(.*?)\\]\\]></description>}is',$output,$matches,PREG_SET_ORDER)) {
$last_match = '';
foreach($matches as $r) {
$result = array(
'title' => utf8_encode(html_decode($r[1])),
'real_url' => html_decode($r[2]),
'follow_url' => html_decode($r[3]),
'description' => utf8_encode(html_decode($r[4])),
);
array_push($results,$result);
$last_match = &$r[0];
}
if($last_match) {
$offset = strpos($output,$last_match)+strlen($last_match);
$output = substr($output,$offset);
}
}
#var_export($results); # debug
return $results;
}

function get_ipaddr (){
if (getenv('HTTP_X_FORWARDED_FOR')) {
$ipaddress = getenv('HTTP_X_FORWARDED_FOR');
} else {
$ipaddress = getenv('REMOTE_ADDR');
}
return $ipaddress;
}
}
?>

I was able to use this to pull sponsered results as a plugin in my search engine from blahblahblah

but I want to creat results on all my php pages, not only my search results.

So how can I echo these results?

<?
echo results for blahblahblahblah from above parser
?>