Web Hosting Talk







View Full Version : Exporting Html to CSV help needed.


KenCoble
02-28-2010, 01:57 PM
Ok so I'm scraping content on my site.(It's from nhl 10 for xbox nothing devious) I wanted to have our stats scraped and then exported to CSV twice a day. If I can figure out how to export I'm sure a cron job could be set to run the CSV export twice a day.

What I have thus far is this.

<html>
<head>
<title>PHP Scrape</title>
</head>
<body>
<?php

// Read html file to be processed into $data variable
$data = file_get_contents('http://www.easportsworld.com/en_US/clubs/partial/762A0001/129082/members-list');

// Commented regex to extract contents from <div class="scrolling">contents</div>
// where "contents" may contain nested <div>s.
// Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{ # recursive regex to capture contents of "scrolling" DIV
<div\s+class="scrolling"\s*> # match the "scrolling" class DIV opening tag
( # capture "config" DIV contents into $1
(?: # non-cap group for nesting * quantifier
(?: (?!<div[^>]*>|</div>). )++ # possessively match all non-DIV tag chars
| # or
<div[^>]*>(?1)</div> # recursively match nested <div>xyz</div>
)* # loop however deep as necessary
) # end group 1 capture
</div> # match the "scrolling" class DIV closing tag
}six'; // single-line (dot matches all), ignore case and free spacing modes ON

// short version of same regex
$pattern_short = '{<div\s+class="scrolling"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>}si';

$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
echo("$matchcount matches found.\n");
// print_r($matches);
for($i = 0; $i < $matchcount; $i++) {
echo("\nMatch #" . ($i + 1) . ":\n");
echo($matches[1][$i]); // print 1st capture group for match number i
}
} else {
echo('No Matches');
}
echo("\n</pre>");
?>
</body>
</html>

The output can be seen on my personal server at. http://74.117.63.249/test.php

If anyone could help me create a script that could take this output and export to a CSV I would greatly appreciate it. I am looking to keep a historical track of our stats so the CSV will be appended each time the Cron job runs.

Host Ahead
02-28-2010, 05:35 PM
You will have to parse the html further. Run the HTML tree down to extract the single values and then concatenate the strings into comma-seperated strings. To do this you will have to examine the HTML file further and extract the patterns and exceptions to layout that exist. You could try to use some xml parser (if it is valid XHTML) to parse the HTML.

If you have that you can attach these strings into a csv.

KenCoble
03-01-2010, 01:17 PM
Don't suppose anyone could look at the source and help me along? I'm horrible at this and even with the snippet of the regex it took me a while to parse and scrape what I have.

KenCoble
03-02-2010, 05:41 AM
Ok this is sorted using Python. You can close this out if you'd like mods.

Capricorn
03-02-2010, 11:15 AM
You will have to parse the html further. Run the HTML tree down to extract the single values and then concatenate the strings into comma-seperated strings. To do this you will have to examine the HTML file further and extract the patterns and exceptions to layout that exist. You could try to use some xml parser (if it is valid XHTML) to parse the HTML.

If you have that you can attach these strings into a csv.

I second this. It's exactly how I would approach it, though I wouldn't mess around with xhtml unless I had to. You might also want to make sure to put error checks in case the page format changes you can be alerted. I got burned when yahoo stock options changed their format last month after being the same for years and years.