Results 1 to 9 of 9
  1. #1
    Join Date
    Nov 2003
    Posts
    536

    Need some suggestions on a Mysql/data issue.

    I run a fansite for an online game (Ultima online) and a fellow fansite (forums4games) lost all their data a year or so ago

    One of their members scraped all the lost posts/stories via google's cache system

    All the data he collected, he put into a mysql database, the data itself comes to a whopping 2.3gb.

    I'm in the process of removing the excess colums, the only one I need is the raw data from the scraping itself, but unfortunatly it's the raw html page, so all the html (including css style info etc) is all in there

    Can anyone recommend a good way to extract the important bits from the 58,000 or so fields? heh

    Basically each field has a copy of a thread or post, which I want to save, but I obviously don't want the CSS info, or html, I just want the raw (written) stories/text.

  2. #2
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    Depends on how complicated the underlying html is; did he scrape the entire page for each post? i.e. each post entry (row, not "field") in the MySQL table has <html><head>....</head><body>...</body></html> tags, and obviously much more?

    If so, perhaps there is a div or other HTML attribute or section which can be used by a HTML parser, or perhaps even just a simple regular expression, to first trim the content in the rows down to only the post.

    Following that you could then perhaps strip all the tags using a regular expression, or, if you want to try to preserve some formatting (like bold, italics and such) you could run the html through a html2text converter - you'll find these available in most languages. Or html2bbcode / vbcode:

    http://www.seabreezecomputers.com/html2bbcode/


    If you feel comfortable doing so, attach as a file to a post here a complete example of the scraped data for one post so we can look at the structure.
    Last edited by mwatkins; 07-08-2009 at 01:38 PM.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  3. #3
    Join Date
    Nov 2003
    Posts
    536
    Sure thing, I've attached an example

    There's three fields in the table itself, ID, Title and DATA.

    The data seems to comprise of raw html, which includes the post data itself.
    Attached Files Attached Files

  4. #4
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    Adam,

    After looking at the data you provided I see it won't be as straightforward as I'd hoped; if each post was in a separate page it'd be somewhat easier to write a simplistic tool to extract the relevant data.

    On the other hand, provided the example page is quite representative of the data your forum-mate has collected, this isn't that hard a task.

    I got carried away here and blurted out a chunk of code which with a little further self-assembly would do most or all of what you need it to do.

    I utilized the lxml library for Python - you'll need to install that if you want to use this. It is a terrific tool for Python users that makes data scraping from even not-great HTML and XML something of a breeze. I've used simple lxml xpath queries to quickly get at the relevant data.

    http://codespeak.net/lxml/ - most linux and certainly FreeBSD have this in their package systems.

    Honestly the whole thing came together in about 30 minutes. I had started to write out a text explanation of the sorts of things you'd have to do, but found that more long winded and tiresome than putting together some example code, which grew into working code with not much more effort.

    I haven't had a good look at the code for refactoring / cleanup - but it works and I really should be doing something else so hopefully you can carry this on forward.

    PHP Code:
    """
    (This code is Python, not PHP)
    This is just a head start on a solution to extract posts from data
    collected via Google cache.
    """

    import re
    from lxml import etree
    from time import strptime
    from datetime import datetime

    html_re 
    re.compile(r'<!DOCTYPE.*?>'re.re.M)
    tag_re re.compile(r'<.*?>')

    class 
    Author:

        
    def __init__(selfauthor_idauthor_name):
            
    self.author_id author_id
            self
    .author_name author_name

        def __str__
    (self):
            return 
    '%s[%s]' % (self.author_nameself.author_id)

    class 
    Post:

        
    def __init__(selfauthorpost_idpost_datepost):
            
    self.author author
            self
    .post_id post_id
            self
    .post_date post_date
            self
    .post post

        def __repr__
    (self):
            return 
    'Post(%s on %s by %s)' % (self.post_idself.post_datestr(self.author))

    authors = {}
    posts = []


    def html2text(html):
        return 
    tag_re.sub(''html)

    def add_author(author_idauthor_name):
        
    """
        Returns an author instance; checks first to see if already in the
        dictionary (an associative array in php speak)
        """
        
    global authors
        assert None not in 
    (author_idauthor_name)
        if 
    author_id not in authors:
            
    authors[author_id] = Author(author_idauthor_name)
        return 
    authors[author_id]

    def get_html(data):
        
    """
        Strip the google cache pre-pendage; convert crnl to nl
        """
        
    data html_re.split(data)
        
    assert len(data) == 2
        
    return data[1].replace('\r\n''\n')

    def get_post_tables(html):
        
    """
        Uses lxml xpath select to retrieve a list of tables containing post data
        """
        
    doc etree.HTML(html)
        return [
    table for table in doc.xpath('//table')
                if 
    'post' in table.get('id''')]

    def get_post_data(table):
        
    """(table) -> (author, post)

        Extracts author and post data from td classes alt2 and alt1 respectively
        """
        
    def get_author(td):
            
    author_id author_name None
            
    for a in td.xpath('div/a'):
                if 
    a.get('class'None) == 'bigusername':
                    
    author_id a.get('href').split('member.php?u=')[1]
                    
    author_name a.text
            
    return add_author(author_idauthor_name)

        
    def parse_date(td):
            
    """
            Example date: 15th-October-2004, 17:46
            """
            
    text ''.join(td.xpath'descendant-or-self::text()'))
            
    datetime text.lower().split(',')
            
    time time.strip()
            
    daymonthyear date.split('-')
            
    # strip non numeric characters i.e. th, rd off date
            
    day ''.join([for c in day if c.isdigit()])
            
    # reassemble it
            
    sdatetime '%02d %s %s %s' % (int(day), monthyeartime)
            
    timestruct strptime(sdatetime'%d %B %Y %H:%M')
            return 
    datetime(*list(timestruct)[:6])

        
    def parse_post(td):
            
    """
            This works on the assumption that the first two divs in any post contain
            the title (if any) and the post, respectively.
            """
            
    post_marker 'post_message_'
            
    post_id None
            post_title 
    None
            post 
    None
            title_div
    post_div td.xpath('div')[:2]
            
    title_els title_div.xpath('strong')
            if 
    title_els:
                
    post_title title_els[0].text

            
    if post_marker in post_div.get('id'''):
                
    post_id post_div.get('id').split(post_marker)[1]
                
    post html2text(post_div.text)
            return 
    post_idpost_titlepost

        author 
    None
        post_id 
    None
        post_date 
    None
        post 
    None
        
    for td in table.findall('tr/td'):
            if 
    post_date is None and td.get('class'None) == 'thead':
                try:
                    
    post_date parse_date(td)
                
    except ValueError:
                    
    pass
                
    if post_date:
                    continue
            
    elif td.get('class'None) == 'alt2' and td.get('width'None):
                
    author get_author(td)
                continue
            
    elif td.get('class''None') == 'alt1' and 'post' in td.get('id'''):
                
    post_idpost_titlepost parse_post(td)

        
    # comment this assert out if need be with real production data
        # or revise the script to deal with it...
        
    assert None not in (authorpost_idpost_datepost)
        return 
    Post(authorpost_idpost_datepost)

    def get_posts_from_page(page):
        global 
    posts
        
    for table in get_post_tables(get_html(page)):
            
    posts.append(get_post_data(table))

    if 
    __name__ == '__main__':
        
    sample_data open("raw_post_data.txt"'rb').read()
        
    sample_data sample_data.decode('iso-8859-1')

        
    get_posts_from_page(sample_data)
        for 
    p in posts:
            print 
    p.post_id
            
    print p.post_date
            
    print str(p.author)
            print 
    p.post
            
    print "-" 30 
    If you save this, and name your sample data page "raw_post_data.txt" and run the code you'll get output such as:

    547442
    2004-10-15 14:51:00
    Dewi Sri[558]
    Are many people here using ebay?
    ------------------------------
    547451
    2004-10-15 15:32:00
    BlackCat[186]
    Do you use PayPal on it then?
    ------------------------------
    547459
    2004-10-15 16:02:00
    Dewi Sri[558]
    No I didnt use paypal so far, only transfer from my bank account. I heard some stories about paypal too, I use it only with people I already know.
    ------------------------------
    547508
    2004-10-15 17:19:00
    Add[222]
    Yep, use it quite a bit. Always use paypal when they accept and never had a problem, though i dont buy off people with no feedback. Got a north face goretex jacket with zip in fleece for £100 brand new, normally £300 in the outdoor shops, got a 512mb flash drive for £70 a year and a half ago (when they were more expensive) 512mb sony memory card for my camera for cheap. Almost everything i buy off ebay is actually brand new, at really low prices.
    ------------------------------
    547515
    2004-10-15 17:25:00
    Ebony[398]
    I m too paranoid to use ebay
    ------------------------------
    547523
    2004-10-15 17:32:00
    Ama[435]
    I buy off ebay very rarely... i have 2 feedbacks, but i would like to use it more.
    ------------------------------
    547528
    2004-10-15 17:46:00
    Dewi Sri[558]
    Ebony, why are you paranoid? I think its fairly safe as you can tell by the feedback whether someone is trustworthy or not. So far I didnt have any bad experiences at all.
    ------------------------------
    547547
    2004-10-15 18:12:00
    dante[463]
    something representing paypal took ~£500 unauthorised from my acc in small goes after using them first and only time. am still waiting for bank fraud team to help me settle it :/
    ------------------------------
    A little tweaking might be needed. Adding code to store the results back into MySQL would be trivial from here.

    What you'll need to do is:

    - take the collection of "authors" and re-populate a forum user table; you can even save the same user ID's and names, notwithstanding any issues vBulletin may have

    - loop through the collection of "pages" stored in your mysql table and apply the get_posts_from_page() function to each page.

    - decide if you are ok with plain text or want to try to recover the vbCode with a html2vbcode function of some sort. Sorry, I'm not writing that one.

    You don't have to become a Python guru to make use of this; even if you use this tool as is or adapted to extract all the text, you can then use any other tool you are more comfortable with to get the data back into MySQL.
    Last edited by mwatkins; 07-08-2009 at 07:34 PM.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  5. #5
    Join Date
    Nov 2003
    Posts
    536
    Hmm wow, thanks very much!
    I'll give it a try later on

  6. #6
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    Let me know if you get stuck anywhere. Other than requiring lxml, there are no other dependencies. This will run on any 2.x version of Python (2.6 is current, 2.4 and 2.5 ship with most Linux; FreeBSD is at 2.6).

    If you PM me a couple more sample cache pages I might find time between now and tomorrow to cook up an example database.

    Again - if the data you have on hand is similar to this example, you should be able to recover much or all of the posts and user posting history.

    Make better backups for next time though... that is much easier!
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  7. #7
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    Adam - I sent you an updated chunk of code that recovers a substantial chunk of the vBulletin posts from google's cache. Page through the sql file I've provide you with "more" (or "less") to get a sense for what was recovered. The code isn't much different than that posted above except had to deal with the realities of the cached data - not all was useful. I added in a html to bbcode translator and a few other goodies as well.


    Hope it helps.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  8. #8
    Join Date
    Nov 2003
    Posts
    536
    Thankyou MWatkins, you've been a tremendous help, I can't thank you enough, really

    Thanks to MWatkins, I was able to import the data into a VB forum, abeit a few issues, the data for the most part was at least readable

    I then imported it all into a MYBB install and it seemed to fix quite a few issues, ironically.

    The end result is people can now reference existing data without having to rely on google's cache.

  9. #9
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    My pleasure. If you need a helping hand cleaning up any of the other remaining issues feel free to ask.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

Similar Threads

  1. Wiredtree.com data issue
    By cellzddl in forum Web Hosting
    Replies: 16
    Last Post: 01-15-2009, 04:12 PM
  2. Suggestions for displaying MySQL data on external site...
    By jon31 in forum Programming Discussion
    Replies: 8
    Last Post: 04-18-2008, 04:16 AM
  3. HostNine Issue Plus Request for Suggestions
    By bremo in forum Web Hosting
    Replies: 21
    Last Post: 01-12-2008, 10:53 PM
  4. Replies: 7
    Last Post: 11-04-2006, 06:31 PM
  5. Data Center Suggestions
    By wmghori in forum Running a Web Hosting Business
    Replies: 17
    Last Post: 03-17-2006, 04:20 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •