
|
View Full Version : problem with html parsing
goolex 10-29-2005, 04:15 PM i have problem with parsing these info from a html code
<tr onclick='getInfo("company_name")' style="cursor:hand" onmouseover="this.style.backgroundColor='#eeeeee'" onmouseout="this.style.backgroundColor=''">
<td width=45>company_name
<td>company description
<td width=56>857,243
<td width=47>5,359
<td width=40>253
<td width=40>6,281
<td width=40>5,933
<td width=40>6,250
<td width=40>5,940
<td width=40>5,933
<td width=40>5,710
<td width=40>223
<td width=40>3.91
<td width=40>73
<td width=40>7.4
</tr>
Dan L 10-29-2005, 04:56 PM You forgot the </td> :)
goolex 10-29-2005, 11:51 PM yes !
but i dont know they didnt use it !!! :)
goolex 10-30-2005, 03:37 AM okey i found somthings and i have problem with a part of preg_match
preg_match_all('/>([0-9]{1,3}[0-9]{1,3})/', $data[1], $matches);
[0] => Array
(
[0] => >151
[1] => >43
[2] => >24
[3] => >24
[4] => >24
[5] => >24
[6] => >24
[7] => >23
[8] => >19
preg_match_all('/>([0-9]{1,3}\,[0-9]{1,3})/', $data[1], $matches);
[0] => Array
(
[0] => >151,052
[1] => >3,712
[2] => >24,580
[3] => >24,579
[4] => >24,580
[5] => >24,580
[6] => >24,579
[7] => >23,410
[8] => >1,169
)
preg_match_all('/>([0-9]{1,3}\.[0-9]{1,3})/', $data[1], $matches);
[0] => Array
(
[0] => >4.99
[1] => >7.9
)
BUT I NEED THIS OUTPUT ! ...
151,052
3,712
43
24,580
24,579
24,580
24,580
24,579
23,410
1,169
4.99
19
7.9
INEED SOMETHINGS LIKE THIS :
preg_match_all('/>([0-9]{1,3}\MMMM[0-9]{1,3})/', $data[1], $matches);
MMMM= any charachter
goolex 10-30-2005, 11:40 AM nothing ?
its impossible ?
goolex 10-31-2005, 12:44 PM still waiting ......
Christopher Lee 10-31-2005, 01:04 PM This assumes ALL you have is the snippet that you provided. I saved it into a file I named: p.txt
<?php
//should work on PHP >= 3.0.6
//reset variables.
$lines = array();
$line = '';
$myoutput = array();
function stripTDs($string){
//find the FIRST occurrence of the item, then take the rest of the string.
$string = stristr($string, '>');
$string = substr($string, 1); //get rid of the carat.
return $string;
}
//get the contents of the file [or snippet] into an array of lines.
$lines = file('p.txt');
//run through each line.
foreach($lines as $line){
if('<td' == strtolower(substr($line, 0, 3))){
$myoutput[] = stripTDs($line);
}
}
echo '<?xml version="1.0" encoding="ISO-8859-1"?>';
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
<head>
<title> Parse Results </title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<meta name="title" content="Parse Results" />
<meta name="author" content="Christopher Lee" />
<meta name="language" content="en" />
</head>
<body>
<h1>Parse Results</h1>
<p><?php
foreach($myoutput as $p){
echo $p . '<br />';
}
?></p>
</body>
</html>
goolex 10-31-2005, 02:48 PM thanks but it showing empty output ! :(
just showing this line :
Parse Results
Christopher Lee 10-31-2005, 07:23 PM Did you save the p.txt file to the same directory as this file? Is it throwing an error?
goolex 10-31-2005, 07:35 PM yes !
i didnt get any error
can you say what text you put in p.txt ?
following ?
<tr onclick='getInfo("company_name")' style="cursor:hand" onmouseover="this.style.backgroundColor='#eeeeee'" onmouseout="this.style.backgroundColor=''"><td width=45>company_name <td>company description <td width=56>857,243<td width=47>5,359<td width=40>253<td width=40>6,281<td width=40>5,933<td width=40>6,250<td width=40>5,940<td width=40>5,933<td width=40>5,710<td width=40>223<td width=40>3.91<td width=40>73<td width=40>7.4</tr>
Christopher Lee 11-01-2005, 01:18 AM I believe I have discovered the problem.
My code assumed that the text input looked exactly like your input from your original post. I assume this latest post is a more exact example than your first. If this is the case, the the file() function won't work well because there is one big line. Since file() creates an array based on each line, you would have gotten just one element in the array, and it did not start with the '<td' string, therefore, nothing was returned to $myoutput. Is this the best representation of the structure of the document that you are attempting to parse?
|