Web Hosting Talk







View Full Version : problem with html parsing


goolex
10-29-2005, 04:15 PM
i have problem with parsing these info from a html code





<tr onclick='getInfo("company_name")' style="cursor:hand" onmouseover="this.style.backgroundColor='#eeeeee'" onmouseout="this.style.backgroundColor=''">
<td width=45>company_name
<td>company description
<td width=56>857,243
<td width=47>5,359
<td width=40>253
<td width=40>6,281
<td width=40>5,933
<td width=40>6,250
<td width=40>5,940
<td width=40>5,933
<td width=40>5,710
<td width=40>223
<td width=40>3.91
<td width=40>73
<td width=40>7.4
</tr>

Dan L
10-29-2005, 04:56 PM
You forgot the </td> :)

goolex
10-29-2005, 11:51 PM
yes !
but i dont know they didnt use it !!! :)

goolex
10-30-2005, 03:37 AM
okey i found somthings and i have problem with a part of preg_match

preg_match_all('/>([0-9]{1,3}[0-9]{1,3})/', $data[1], $matches);

[0] => Array
(
[0] => >151
[1] => >43
[2] => >24
[3] => >24
[4] => >24
[5] => >24
[6] => >24
[7] => >23
[8] => >19




preg_match_all('/>([0-9]{1,3}\,[0-9]{1,3})/', $data[1], $matches);

[0] => Array
(
[0] => >151,052
[1] => >3,712
[2] => >24,580
[3] => >24,579
[4] => >24,580
[5] => >24,580
[6] => >24,579
[7] => >23,410
[8] => >1,169
)



preg_match_all('/>([0-9]{1,3}\.[0-9]{1,3})/', $data[1], $matches);

[0] => Array
(
[0] => >4.99
[1] => >7.9
)



BUT I NEED THIS OUTPUT ! ...

151,052
3,712
43
24,580
24,579
24,580
24,580
24,579
23,410
1,169
4.99
19
7.9



INEED SOMETHINGS LIKE THIS :
preg_match_all('/>([0-9]{1,3}\MMMM[0-9]{1,3})/', $data[1], $matches);

MMMM= any charachter

goolex
10-30-2005, 11:40 AM
nothing ?
its impossible ?

goolex
10-31-2005, 12:44 PM
still waiting ......

Christopher Lee
10-31-2005, 01:04 PM
This assumes ALL you have is the snippet that you provided. I saved it into a file I named: p.txt


<?php
//should work on PHP >= 3.0.6

//reset variables.
$lines = array();
$line = '';
$myoutput = array();

function stripTDs($string){
//find the FIRST occurrence of the item, then take the rest of the string.
$string = stristr($string, '>');
$string = substr($string, 1); //get rid of the carat.
return $string;
}



//get the contents of the file [or snippet] into an array of lines.
$lines = file('p.txt');

//run through each line.
foreach($lines as $line){
if('<td' == strtolower(substr($line, 0, 3))){
$myoutput[] = stripTDs($line);
}
}




echo '<?xml version="1.0" encoding="ISO-8859-1"?>';
?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">

<head>

<title> Parse Results </title>

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<meta name="title" content="Parse Results" />
<meta name="author" content="Christopher Lee" />
<meta name="language" content="en" />

</head>

<body>
<h1>Parse Results</h1>
<p><?php
foreach($myoutput as $p){
echo $p . '<br />';
}

?></p>


</body>

</html>

goolex
10-31-2005, 02:48 PM
thanks but it showing empty output ! :(

just showing this line :
Parse Results

Christopher Lee
10-31-2005, 07:23 PM
Did you save the p.txt file to the same directory as this file? Is it throwing an error?

goolex
10-31-2005, 07:35 PM
yes !
i didnt get any error
can you say what text you put in p.txt ?

following ?

<tr onclick='getInfo("company_name")' style="cursor:hand" onmouseover="this.style.backgroundColor='#eeeeee'" onmouseout="this.style.backgroundColor=''"><td width=45>company_name <td>company description <td width=56>857,243<td width=47>5,359<td width=40>253<td width=40>6,281<td width=40>5,933<td width=40>6,250<td width=40>5,940<td width=40>5,933<td width=40>5,710<td width=40>223<td width=40>3.91<td width=40>73<td width=40>7.4</tr>

Christopher Lee
11-01-2005, 01:18 AM
I believe I have discovered the problem.

My code assumed that the text input looked exactly like your input from your original post. I assume this latest post is a more exact example than your first. If this is the case, the the file() function won't work well because there is one big line. Since file() creates an array based on each line, you would have gotten just one element in the array, and it did not start with the '<td' string, therefore, nothing was returned to $myoutput. Is this the best representation of the structure of the document that you are attempting to parse?