Web Hosting Talk







View Full Version : Need some suggestions on a Mysql/data issue.


AdamD
07-08-2009, 12:55 PM
I run a fansite for an online game (Ultima online) and a fellow fansite (forums4games) lost all their data a year or so ago

One of their members scraped all the lost posts/stories via google's cache system

All the data he collected, he put into a mysql database, the data itself comes to a whopping 2.3gb.

I'm in the process of removing the excess colums, the only one I need is the raw data from the scraping itself, but unfortunatly it's the raw html page, so all the html (including css style info etc) is all in there

Can anyone recommend a good way to extract the important bits from the 58,000 or so fields? heh

Basically each field has a copy of a thread or post, which I want to save, but I obviously don't want the CSS info, or html, I just want the raw (written) stories/text.

mwatkins
07-08-2009, 01:33 PM
Depends on how complicated the underlying html is; did he scrape the entire page for each post? i.e. each post entry (row, not "field") in the MySQL table has <html><head>....</head><body>...</body></html> tags, and obviously much more?

If so, perhaps there is a div or other HTML attribute or section which can be used by a HTML parser, or perhaps even just a simple regular expression, to first trim the content in the rows down to only the post.

Following that you could then perhaps strip all the tags using a regular expression, or, if you want to try to preserve some formatting (like bold, italics and such) you could run the html through a html2text converter - you'll find these available in most languages. Or html2bbcode / vbcode:

http://www.seabreezecomputers.com/html2bbcode/


If you feel comfortable doing so, attach as a file to a post here a complete example of the scraped data for one post so we can look at the structure.

AdamD
07-08-2009, 01:41 PM
Sure thing, I've attached an example

There's three fields in the table itself, ID, Title and DATA.

The data seems to comprise of raw html, which includes the post data itself.

mwatkins
07-08-2009, 07:29 PM
Adam,

After looking at the data you provided I see it won't be as straightforward as I'd hoped; if each post was in a separate page it'd be somewhat easier to write a simplistic tool to extract the relevant data.

On the other hand, provided the example page is quite representative of the data your forum-mate has collected, this isn't that hard a task.

I got carried away here and blurted out a chunk of code which with a little further self-assembly would do most or all of what you need it to do.

I utilized the lxml library for Python - you'll need to install that if you want to use this. It is a terrific tool for Python users that makes data scraping from even not-great HTML and XML something of a breeze. I've used simple lxml xpath queries to quickly get at the relevant data.

http://codespeak.net/lxml/ - most linux and certainly FreeBSD have this in their package systems.

Honestly the whole thing came together in about 30 minutes. I had started to write out a text explanation of the sorts of things you'd have to do, but found that more long winded and tiresome than putting together some example code, which grew into working code with not much more effort.

I haven't had a good look at the code for refactoring / cleanup - but it works and I really should be doing something else so hopefully you can carry this on forward.

"""
(This code is Python, not PHP)
This is just a head start on a solution to extract posts from data
collected via Google cache.
"""

import re
from lxml import etree
from time import strptime
from datetime import datetime

html_re = re.compile(r'<!DOCTYPE.*?>', re.I | re.M)
tag_re = re.compile(r'<.*?>')

class Author:

def __init__(self, author_id, author_name):
self.author_id = author_id
self.author_name = author_name

def __str__(self):
return '%s[%s]' % (self.author_name, self.author_id)

class Post:

def __init__(self, author, post_id, post_date, post):
self.author = author
self.post_id = post_id
self.post_date = post_date
self.post = post

def __repr__(self):
return 'Post(%s on %s by %s)' % (self.post_id, self.post_date, str(self.author))

authors = {}
posts = []


def html2text(html):
return tag_re.sub('', html)

def add_author(author_id, author_name):
"""
Returns an author instance; checks first to see if already in the
dictionary (an associative array in php speak)
"""
global authors
assert None not in (author_id, author_name)
if author_id not in authors:
authors[author_id] = Author(author_id, author_name)
return authors[author_id]

def get_html(data):
"""
Strip the google cache pre-pendage; convert crnl to nl
"""
data = html_re.split(data)
assert len(data) == 2
return data[1].replace('\r\n', '\n')

def get_post_tables(html):
"""
Uses lxml xpath select to retrieve a list of tables containing post data
"""
doc = etree.HTML(html)
return [table for table in doc.xpath('//table')
if 'post' in table.get('id', '')]

def get_post_data(table):
"""(table) -> (author, post)

Extracts author and post data from td classes alt2 and alt1 respectively
"""
def get_author(td):
author_id = author_name = None
for a in td.xpath('div/a'):
if a.get('class', None) == 'bigusername':
author_id = a.get('href').split('member.php?u=')[1]
author_name = a.text
return add_author(author_id, author_name)

def parse_date(td):
"""
Example date: 15th-October-2004, 17:46
"""
text = ''.join(td.xpath( 'descendant-or-self::text()'))
date, time = text.lower().split(',')
time = time.strip()
day, month, year = date.split('-')
# strip non numeric characters i.e. th, rd off date
day = ''.join([c for c in day if c.isdigit()])
# reassemble it
sdatetime = '%02d %s %s %s' % (int(day), month, year, time)
timestruct = strptime(sdatetime, '%d %B %Y %H:%M')
return datetime(*list(timestruct)[:6])

def parse_post(td):
"""
This works on the assumption that the first two divs in any post contain
the title (if any) and the post, respectively.
"""
post_marker = 'post_message_'
post_id = None
post_title = None
post = None
title_div, post_div = td.xpath('div')[:2]
title_els = title_div.xpath('strong')
if title_els:
post_title = title_els[0].text

if post_marker in post_div.get('id', ''):
post_id = post_div.get('id').split(post_marker)[1]
post = html2text(post_div.text)
return post_id, post_title, post

author = None
post_id = None
post_date = None
post = None
for td in table.findall('tr/td'):
if post_date is None and td.get('class', None) == 'thead':
try:
post_date = parse_date(td)
except ValueError:
pass
if post_date:
continue
elif td.get('class', None) == 'alt2' and td.get('width', None):
author = get_author(td)
continue
elif td.get('class', 'None') == 'alt1' and 'post' in td.get('id', ''):
post_id, post_title, post = parse_post(td)

# comment this assert out if need be with real production data
# or revise the script to deal with it...
assert None not in (author, post_id, post_date, post)
return Post(author, post_id, post_date, post)

def get_posts_from_page(page):
global posts
for table in get_post_tables(get_html(page)):
posts.append(get_post_data(table))

if __name__ == '__main__':
sample_data = open("raw_post_data.txt", 'rb').read()
sample_data = sample_data.decode('iso-8859-1')

get_posts_from_page(sample_data)
for p in posts:
print p.post_id
print p.post_date
print str(p.author)
print p.post
print "-" * 30


If you save this, and name your sample data page "raw_post_data.txt" and run the code you'll get output such as:

547442
2004-10-15 14:51:00
Dewi Sri[558]
Are many people here using ebay?
------------------------------
547451
2004-10-15 15:32:00
BlackCat[186]
Do you use PayPal on it then?
------------------------------
547459
2004-10-15 16:02:00
Dewi Sri[558]
No I didnt use paypal so far, only transfer from my bank account. I heard some stories about paypal too, I use it only with people I already know.
------------------------------
547508
2004-10-15 17:19:00
Add[222]
Yep, use it quite a bit. Always use paypal when they accept and never had a problem, though i dont buy off people with no feedback. Got a north face goretex jacket with zip in fleece for ã100 brand new, normally ã300 in the outdoor shops, got a 512mb flash drive for ã70 a year and a half ago (when they were more expensive) 512mb sony memory card for my camera for cheap. Almost everything i buy off ebay is actually brand new, at really low prices.
------------------------------
547515
2004-10-15 17:25:00
Ebony[398]
I m too paranoid to use ebay
------------------------------
547523
2004-10-15 17:32:00
Ama[435]
I buy off ebay very rarely... i have 2 feedbacks, but i would like to use it more.
------------------------------
547528
2004-10-15 17:46:00
Dewi Sri[558]
Ebony, why are you paranoid? I think its fairly safe as you can tell by the feedback whether someone is trustworthy or not. So far I didnt have any bad experiences at all.
------------------------------
547547
2004-10-15 18:12:00
dante[463]
something representing paypal took ~ã500 unauthorised from my acc in small goes after using them first and only time. am still waiting for bank fraud team to help me settle it :/
------------------------------

A little tweaking might be needed. Adding code to store the results back into MySQL would be trivial from here.

What you'll need to do is:

- take the collection of "authors" and re-populate a forum user table; you can even save the same user ID's and names, notwithstanding any issues vBulletin may have

- loop through the collection of "pages" stored in your mysql table and apply the get_posts_from_page() function to each page.

- decide if you are ok with plain text or want to try to recover the vbCode with a html2vbcode function of some sort. Sorry, I'm not writing that one.

You don't have to become a Python guru to make use of this; even if you use this tool as is or adapted to extract all the text, you can then use any other tool you are more comfortable with to get the data back into MySQL.

AdamD
07-08-2009, 07:34 PM
Hmm wow, thanks very much!
I'll give it a try later on :)

mwatkins
07-08-2009, 07:42 PM
Let me know if you get stuck anywhere. Other than requiring lxml, there are no other dependencies. This will run on any 2.x version of Python (2.6 is current, 2.4 and 2.5 ship with most Linux; FreeBSD is at 2.6).

If you PM me a couple more sample cache pages I might find time between now and tomorrow to cook up an example database.

Again - if the data you have on hand is similar to this example, you should be able to recover much or all of the posts and user posting history.

Make better backups for next time though... that is much easier! ;)

mwatkins
07-18-2009, 12:52 AM
Adam - I sent you an updated chunk of code that recovers a substantial chunk of the vBulletin posts from google's cache. Page through the sql file I've provide you with "more" (or "less") to get a sense for what was recovered. The code isn't much different than that posted above except had to deal with the realities of the cached data - not all was useful. I added in a html to bbcode translator and a few other goodies as well.


Hope it helps.

AdamD
07-19-2009, 03:45 PM
Thankyou MWatkins, you've been a tremendous help, I can't thank you enough, really

Thanks to MWatkins, I was able to import the data into a VB forum, abeit a few issues, the data for the most part was at least readable

I then imported it all into a MYBB install and it seemed to fix quite a few issues, ironically.

The end result is people can now reference existing data without having to rely on google's cache.

mwatkins
07-19-2009, 05:52 PM
My pleasure. If you need a helping hand cleaning up any of the other remaining issues feel free to ask.