Results 1 to 16 of 16
  1. #1
    Join Date
    Jan 2004
    Location
    NJ, USA
    Posts
    288

    Bad tags with HTMLParser in Python?

    Hey guys,

    I'm creating a little web crawler that is given a page of links, goes through each of the links on the page and retrieves the title, description and keywords for each link. I'm using the BeautifulSoup library to do the parsing of HTML, which works great. I've run into an issue though.

    The HTMLParser (used in BeautifulSoup) dies on pages that use the following tag: <sc ' + 'ript>. The actual error:

    Code:
    HTMLParser.HTMLParseError: bad end tag: u"</sc'+'ript>", at line 106, column 46
    Here's my code so far (I know it's not good Python, I'm just starting out on the script and am very new to Python) I have a bunch of Try and Except in there as I'm trying to debug stuff.

    Code:
    '''
    Created on Oct 22, 2009
    
    @author: Bay
    '''
    from BeautifulSoup import BeautifulSoup
    import httplib
    import urllib2
    import urlparse
    import string
    import sys
    import re
    
    url = sys.argv[1]
    
    try:
        request = urllib2.Request(''.join(url))
        opener = urllib2.build_opener()
        #page = urllib2.urlopen(''.join(url))
        request.add_header('User-Agent','revoew-crawler')
        page = urllib2.urlopen(request)
        cont = 1
    except urllib2.URLError, e:
        cont = 0
        
    if cont:
        soup = BeautifulSoup(''.join(page))
        for item in soup.fetch('a'):
            urlToCrawl = 'http://www.' + item['href']
            try:
                newRequest = urllib2.Request(urlToCrawl)
                newOpener = urllib2.build_opener()
                #page = urllib2.urlopen(''.join(url))
                newRequest.add_header('User-Agent','revoew-crawler')
                newPage = urllib2.urlopen(newRequest)
                newCont = 1
            except urllib2.URLError, e:
                newCont = 0
            
            if newCont:
                print ''
                print 'In: ' + urlToCrawl
                try:
                    newSoup = BeautifulSoup(''.join(newPage))
                    souped = 1
                except:
                    souped = 0
                    print '*** failed to process *** '
                if souped:
                    for title in newSoup.fetch('title'):
                        p = re.compile(r'<.*?>')
                        title = p.sub('',str(title))
                        print '  Title: ' + str(title)
                        for meta in newSoup.fetch('meta'):
                            try:
                                name = meta['name']
                                if name == 'description' or name == 'keywords':
                                    print '  ' + str(name) + ': ' + str(meta['content'])
                            except:
                                pass
    Any ideas how to get around those bad tags so I can parse those pages? In a test of 10 websites, I can go through 7 successfully but 3 of them die.

  2. #2
    Join Date
    Oct 2004
    Location
    Shimonoseki
    Posts
    2,101
    Wait for our Python guru, mwatkins..
    He will probably come up with a solution for ya
    Closed for winter...

  3. #3
    Join Date
    Jan 2004
    Location
    NJ, USA
    Posts
    288
    I eagerly await his response

  4. #4
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    My best suggestion: drop BS[1] and go with lxml.

    http://codespeak.net/lxml/

    Here's an example - maybe see if BS can handle the following broken HTML -- if it fails on that too, then head on over and install lxml since it can manage it. In my experience lxml has handled pretty much everything I've thrown at it. Doesn't have to be full html - can be fragments too. Oh... and it's fast, too.

    (Don't feel you have to be an xpath expert to make it do stuff; there are a number of ways of traversing the parsed (x|html).)
    PHP Code:
    >>> from lxml.html import fromstring
    >>> doc fromstring("""
        <html><title>Broken HTML</title>
        <body><p>something valid</p>
        <sc'+'ript>foo();</sc'+'ript>
        <p><a href="
    #">some other</a> valid content</p>
        
    </body></html>""")
    <p><a href="
    #">some other</a> valid content</p></body></html>""")
    >>> doc
    <Element html at 289b7f5c>
    >>> 
    doc.xpath('.//text()')
    [
    'Broken HTML''something valid''foo();''some other'' valid content']
    >>> 
    doc.xpath('.//title/text()')
    [
    'Broken HTML']
    >>> 
    doc.xpath('.//p/text()')
    [
    'something valid'' valid content'
    [1] http://www.crummy.com/software/Beaut...-problems.html
    The BS author isn't keen on moving that project forward. Use lxml.html.
    Last edited by mwatkins; 10-28-2009 at 09:10 AM.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  5. #5
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    For example, here's something that does most of what your example code snippet does:

    PHP Code:
    # python and lxml 
    >>> from lxml.html import parse
    >>> doc parse('http://www.webhostingtalk.com/showthread.php?p=6467386')
    >>> 
    doc.xpath('.//title/text()')
    [
    ' Bad tags with HTMLParser in Python? - Web Hosting Talk']
    >>> for 
    meta in doc.xpath('.//meta[@name="keywords"]'):
    ...     
    meta.get('content'None)
    ... 
    ' Bad tags with HTMLParser in Python?, Web Hosting Reviews, Forum Discussion, Windows, Unix, Dedicated Server Hosting, Colocation Servers, Hosting Providers'
    >>> meta.get('content','').split(',')
    [
    ' Bad tags with HTMLParser in Python?'' Web Hosting Reviews'' Forum Discussion'' Windows'' Unix'' Dedicated Server Hosting'' Colocation Servers'' Hosting Providers'
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  6. #6
    Join Date
    Jan 2004
    Location
    NJ, USA
    Posts
    288
    That seems like a really good alternative and I will try that out later. Thanks and I'll let you know what I come up with

  7. #7
    Join Date
    Jan 2004
    Location
    NJ, USA
    Posts
    288
    I got it working somewhat; however, it dies on one test case: blogger.com .

    Here's the quick code that I'm using:

    Code:
    '''
    Created on Oct 28, 2009
    
    @author: Bay
    '''
    from lxml.html import parse
    import string
    import sys
    
    url = sys.argv[1]
    
    doc = parse(url)
    
    for item in doc.xpath('.//a'):
        newURL = 'http://' + item.get('href', None) + '/'
        print newURL
        
        try:
            newDoc = parse(newURL)
            for title in newDoc.xpath('.//title'):
                print '  ' + str(title.text)
        except IOError as e:
            print e
    And the output for blogger.com:
    Code:
    http://blogger.com/
    Error reading file 'http://blogger.com/': failed to load external entity "http://blogger.com/"
    When you goto blogger.com, it'll redirect to blogger.com/start , so I've added the special case of if at blogger.com, add /start to the end of the string.

    The same error occurs. Any idea?

  8. #8
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    Thankfully Python has a good solution for you. Rather than relying on lxml to open and retrieve the data from the url, let's switch this up a bit to use urllib2.urlopen. Why? It knows how to handle 301 redirects. Here's an interactive session which also, by the way, shows you how to get fully qualified / no relative links back:

    PHP Code:
    >>># lets do some more Python
    >>> from urllib2 import urlopen
    >>> from lxml.html import fromstring
    >>> conn urlopen('http://blogger.com/')
    >>> 
    conn.geturl()
    'https://www.blogger.com/start'
    >>> doc fromstring(conn.read(), base_url=conn.geturl())
    >>> 
    doc.base_url
    u
    'https://www.blogger.com/start'
    >>> doc.make_links_absolute()
    None
    >>> for a in doc.xpath('.//a'):
    ...     print 
    a.items()
    ... 
    [(
    'href''http://www.blogger.com/forgot.g'), ('title''Forgot your password?')]
    [(
    'href''http://help.blogger.com/bin/answer.py?answer=42054'), ('target''_top'), ('title''What is "remember me"?')]
    [(
    'href''https://www.blogger.com/signup.g'), ('tabindex''0'), ('onclick''return false;'), ('target''')]
    [(
    'href''https://www.blogger.com/tour_start.g')]
    [(
    'href''http://www.youtube.com/watch?v=BnploFsS_tY')]
    [(
    'href''https://www.blogger.com/features')]
    [(
    'href''http://buzz.blogger.com')]
    [(
    'href''http://blogsofnote.blogspot.com/')]]

    #output snipped for brevity 
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  9. #9
    Join Date
    Jan 2004
    Location
    NJ, USA
    Posts
    288
    Thanks again for your help. I'm so close to having this completely done, but I've run into a final error.

    I'm essentially running the crawler, and writing the data to a file so that I can transport the information. But I'm getting this error:

    Code:
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 23-121: ordinal not in range(128)
    So it's essentially dying because there are funky characters that are being crawled from online. What would you suggest be the way I get around this?

    Here's the code:

    Code:
    '''
    Created on Oct 28, 2009
    
    @author: Bay
    '''
    from lxml.html import parse
    from urllib2 import urlopen
    from lxml.html import fromstring
    import datetime
    import lxml
    import string
    import sys
    
    
    start = sys.argv[2]
    end = sys.argv[3]
    url = str(sys.argv[1]) + '?a=' + start + '&b=' + end
    
    conn = urlopen(url)
    doc = fromstring(conn.read(), base_url=conn.geturl())
    
    filename = 'crawl_'+start+'_'+end+'_'+str(datetime.datetime.today().strftime("%Y%m%dT%H%M%S"))+'.sql'
    
    file = open('/path/to/my/directory/sql/'+filename,'w')
    file.write('INSERT INTO crawler3 (link,title,description,keywords,time) VALUES')
    
    count = 0
    sqlstring = []
    for item in doc.xpath('.//a'):
        thislink = item.get('href', None)
        newURL = 'http://' + thislink + '/'
        print thislink
        try:
            newConn = urlopen(newURL)
            newDoc = fromstring(newConn.read(), base_url=newConn.geturl())
            printtitle = ''
            printdescription = ''
            printkeywords = ''
            for title in newDoc.xpath('.//title'):
                printtitle = str(title.text)
            for description in newDoc.xpath('.//meta[@name="description"]'):
                print '  Description: ' + description.get('content', None)
                printdescription = description.get('content', None)
            for keywords in newDoc.xpath('.//meta[@name="keywords"]'):
                print '  Keywords: ' + keywords.get('content', None)
                printkeywords = keywords.get('content', None)
            valueline = '(\''+thislink+'\',\''+printtitle+'\',\''+printdescription+'\',\''+printkeywords+'\',NOW()),'
            sqlstring.append(valueline)
        except IOError as e:
            pass
        except UnicodeEncodeError as uni:
            pass
        except lxml.etree.XMLSyntaxError as syntax:
            pass
    file.write(''.join(sqlstring)[:-1])
    file.close()

  10. #10
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    Ack, just lost a big reply.

    Short form this time - try changing your code as follows:

    PHP Code:
    # always convert (decode) to unicode the encoded (bytestring) you are reading, then work with the unicode string internally, recovert to an encoded bytestring (encode) for output to file/browser/stream

    # makes the content unicode, rather than an encoded *bytestring* as read from the url
    newDoc fromstring(newConn.read().decode('utf-8'), base_url=newConn.geturl())

    # for output, reconvert the unicode string to an encoded bytestring:
    output u''.join(sqlstring)[:-1]
    file.write(output.encode('utf-8')

    # avoid str(something_may_be_text), instead if you must:
    unicode(something_may_be_text_or_null# i.e. ''
    # or...
    value value if value else u'' 
    The problem you are going to run into is not all sites use utf-8 as their content encoding. You'll be able to rely on Content-Type headers, sometimes, but sites often lie about this. You can use a series of fall backs with try/except blocks - start with utf-8, then perhaps latin-1, then a windows encoding.

    Anyway... try making those changes noted. If that doesn't work fire me off a URL I can examine.

    PS: Unicode gets even easier with Python 3.x but lets not go there today.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  11. #11
    Join Date
    Jan 2004
    Location
    NJ, USA
    Posts
    288
    I don't meant to be annoying, but I'm quite stupid when it comes to encodings.

    I'm unable to retrieve the information for differently encoded files. What would the try: except: blocks look like if I were trying to support multiple encodings?

    Here's my current code:
    PHP Code:
    '''
    Created on Oct 28, 2009

    @author: Bay
    '''
    from lxml.html import parse
    from urllib2 import urlopen
    from lxml
    .html import fromstring
    import datetime
    import lxml
    import string
    import sys



    start 
    sys.argv[2]
    end sys.argv[3]
    url str(sys.argv[1]) + '?a=' start '&b=' end

    conn 
    urlopen(url)
    doc fromstring(conn.read(), base_url=conn.geturl())

    filename 'crawl_'+start+'_'+end+'_'+str(datetime.datetime.today().strftime("%Y%m%dT%H%M%S"))+'.sql'

    file open('/loc/'+filename,'w')
    file.write('INSERT INTO crawler3 (link,title,description,keywords,time) VALUES')

    count 1
    totalsites 
    int(end) - int(start)
    sqlstring = []
    for 
    item in doc.xpath('.//a'):
        
    thislink item.get('href'None)
        
    newURL 'http://' thislink '/'
        
    print str(count) + ' out of ' str(totalsites)
        print 
    thislink
        
    try:
            
    newConn urlopen(newURL)
            
    newDoc fromstring(newConn.read().decode('utf-8'), base_url=newConn.geturl())
            
    printtitle ''
            
    printdescription ''
            
    printkeywords ''
            
    for title in newDoc.xpath('.//title'):
                
    printtitle unicode(title.text)
            for 
    description in newDoc.xpath('.//meta[@name="description"]'):
                print 
    '  Description: ' description.get('content'None)
                
    printdescription unicode(description.get('content'None))
            for 
    keywords in newDoc.xpath('.//meta[@name="keywords"]'):
                print 
    '  Keywords: ' keywords.get('content'None)
                
    printkeywords unicode(keywords.get('content'None))
            
    valueline '(\''+thislink+'\',\''+printtitle+'\',\''+printdescription+'\',\''+printkeywords+'\',NOW()),'
            
    sqlstring.append(valueline)
        
    except IOError as e:
            print 
    e
        except UnicodeEncodeError 
    as uni:
            print 
    uni
        except UnicodeDecodeError 
    as de:
            print 
    de
        except lxml
    .etree.XMLSyntaxError as syntax:
            print 
    syntax
        except ValueError 
    as valerror:
            print 
    valerror
        except
    :
            
    pass
        count 
    += 1
    output 
    u''.join(sqlstring)[:-1]
    file.write(output.encode('utf-8'))
    file.close() 

  12. #12
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    Check and see what the encoding is of the sites that are failing; once you have a collection of encodings, what you want to do is early on in your code block do a *nested* set of try/excepts - this is an example, not a recipe, paraphrased:

    PHP Code:
    try: 
     ....
    decode('utf-8')
    except UnicodeDecodeError:
        try:
             ....
    decode('latin-1')
        
    except UnicodeDecodeError:
            try:
                 ....
    decode('windows-1252')
            
    except UnicodeDecodeError as e:
                
    pass/raise/whatever 
    Do this before your content parsing of course. Note the order of encodings will matter and I may not have the ideal selection or order here. This can work fairly reliably only if you are not dealing with totally random world wide pages. If that is what you are dealing with... see my next suggestion.

    If indeed you could be scraping any page from any country in the world, another approach worth looking at can be found at Mark Pilgrim's feedparser.org. In feedparser.py or related to it there is a routine IIRC called chardet (ah, here it is standalone: http://chardet.feedparser.org/ ) which does as good a job as is possible guessing encodings. I suspect you'll find this easy to adapt to your purpose.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  13. #13
    Join Date
    Jan 2004
    Location
    NJ, USA
    Posts
    288
    Quote Originally Posted by mwatkins View Post
    Check and see what the encoding is of the sites that are failing; once you have a collection of encodings, what you want to do is early on in your code block do a *nested* set of try/excepts - this is an example, not a recipe, paraphrased:

    PHP Code:
    try: 
     ....
    decode('utf-8')
    except UnicodeDecodeError:
        try:
             ....
    decode('latin-1')
        
    except UnicodeDecodeError:
            try:
                 ....
    decode('windows-1252')
            
    except UnicodeDecodeError as e:
                
    pass/raise/whatever 
    Do this before your content parsing of course. Note the order of encodings will matter and I may not have the ideal selection or order here. This can work fairly reliably only if you are not dealing with totally random world wide pages. If that is what you are dealing with... see my next suggestion.

    If indeed you could be scraping any page from any country in the world, another approach worth looking at can be found at Mark Pilgrim's feedparser.org. In feedparser.py or related to it there is a routine IIRC called chardet (ah, here it is standalone: http://chardet.feedparser.org/ ) which does as good a job as is possible guessing encodings. I suspect you'll find this easy to adapt to your purpose.
    I honestly don't know what's going on. Whenever I detect the encoding of the page, my crawler doesn't find ANY results for keywords and descriptions, even utf-8 ones!

    Only thing I've added:
    PHP Code:
    encoding chardet.detect(newConn.read())
    print 
    encoding['encoding']
    newDoc fromstring(newConn.read().decode('utf-8'), base_url=newConn.geturl()) 
    I'm just printing out the encoding just to see it. What's going on?

  14. #14
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    The problem is not the encoding, in this case. What is happening is the first call to .read() is setting the file pointer at the end of the file-like object. Your second call to .read() returns nothing, because it is at the file. Grab the content straight away:

    PHP Code:
    content connection.read()
    encoding chardet.detect(content)
    doc fromstring(content.decode(encoding)), base_url=connection.geturl()) 
    The exact behaviour has a lot to do with whatever underlying standard library for file access your OS supports.

    With some file-like objects (on supported systems, most real files as well as objects like StringIO) you can call seek(0L) to move the file pointer back to the start - like rewinding a reel:

    PHP Code:
    >>> from io import StringIO
    >>> sfile StringIO('123456789')
    >>> print 
    sfile.read()
    123456789
    >>> print sfile.read()

    >>> 
    sfile.seek(0L)
    0L
    >>> sfile.read()
    '123456789' 
    Last edited by mwatkins; 11-16-2009 at 10:03 PM.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  15. #15
    Join Date
    Jan 2004
    Location
    NJ, USA
    Posts
    288
    Quote Originally Posted by mwatkins View Post
    The problem is not the encoding, in this case. What is happening is the first call to .read() is setting the file pointer at the end of the file-like object. Your second call to .read() returns nothing, because it is at the file. Grab the content straight away:

    PHP Code:
    content connection.read()
    encoding chardet.detect(content)
    doc fromstring(content.decode(encoding)), base_url=connection.geturl()) 
    The exact behaviour has a lot to do with whatever underlying standard library for file access your OS supports.

    With some file-like objects (on supported systems, most real files as well as objects like StringIO) you can call seek(0L) to move the file pointer back to the start - like rewinding a reel:

    PHP Code:
    >>> from io import StringIO
    >>> sfile StringIO('123456789')
    >>> print 
    sfile.read()
    123456789
    >>> print sfile.read()

    >>> 
    sfile.seek(0L)
    0L
    >>> sfile.read()
    '123456789' 
    Thank you so much! The crawler works pretty damn well and is doing its best. I may have more questions in the future, but you've been such an amazing help.

    I've also really learned to appreciate Python. It's quite an elegant language.

  16. #16
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    Excellent! I'm really glad you stuck through it, and I'm sure you'll continue to do well. Call anytime if you need a hand. Cheers!
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

Similar Threads

  1. Title Tags, Meta Tags and SEO - Keyword Density or Keyword Spamming?
    By websoltech in forum SEO / SEM Discussions
    Replies: 16
    Last Post: 04-21-2008, 11:41 AM
  2. Mp3 tags
    By howsthat in forum Hosting Security and Technology
    Replies: 0
    Last Post: 12-23-2007, 09:10 AM
  3. div tags
    By isheikh in forum Web Design and Content
    Replies: 1
    Last Post: 05-29-2004, 08:01 PM
  4. Alt Tags
    By a777uk in forum Programming Discussion
    Replies: 18
    Last Post: 06-29-2003, 06:32 AM
  5. alt tags
    By tilt! in forum Web Site Reviews
    Replies: 6
    Last Post: 10-22-2001, 06:22 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •