Results 1 to 4 of 4

Thread: grep help

  1. #1

    grep help

    I have a big dump file. It consists of multiple html files combined into 1 text file from a site i have running. I am trying to use grep, to extract urls with certain prefix. say all urls that start with one url per line, with everything else stripped out.

    Any help would be appreciated.

    Last edited by grepnewb; 08-02-2009 at 02:57 AM.

  2. #2
    Join Date
    Jan 2008
    Might not be quite what you're looking for as its in php:

    PHP Code:
    ('~("|\')http\://(?:www\.)?example\.com/folder/(.*?)("|\')~i'file_get_contents('input.txt'), $out);

    $out[0] as $url)
    file_put_contents('output.txt'str_replace(array('"''\''), ''$url) . "\r\n"FILE_APPEND);
    So if "input.txt" contains data like (the urls being within HTML tags):
    <a href="">test123</a><a href=''>test321</a><img src="" />
    It will extract all the URLs and put them into "output.txt" (each url seperated by a new line).
    Last edited by sam0; 08-02-2009 at 03:24 PM.

  3. #3
    grep '^' file.txt

  4. #4
    Join Date
    Nov 2001
    Quote Originally Posted by KC-JRPark View Post
    grep '^' file.txt
    The OP said: one url per line, with everything else stripped out.

    grep isn't going to do that for you. Nor is your regex going to even match what the OP is looking for.

    sed could perhaps get you most or all of the way but whether sed can work for you easily or not is a matter which the format of your html will decide. If every URL sits neatly on one line, then sed can work.

    If URLs cross line boundaries everything gets more complicated. For an example, here's a complex sed "script" (which proves sed isn't the right tool for dealing with complex parsing of HTML):

    Myself, unless the need is dirt simple, I will tend to reach for a programming language (Python first in my case) that supports extended regular expressions and the "not greedy" operator.

    Simple case:

    PHP Code:
    import re

    re.compile(r'<a href="(http(?:s|)://.*?)".*?>(.*?)</a>'re.I|re.M)

    # lets parse a local copy of the "A List Apart" home page
    example_html open('example.html').read()

    def simple_a_extractor(text):
        return [
    link for link in ahref_re.findall(text)
    'img src' not in link[1]]

    __name__ == '__main__':
    linklink_text in simple_a_extractor(example_html):
    "%-65s %s" % (linklink_text)) 
    Code:                               Articles                                 Topics                                  About                                Contact                             Contribute                                   Feed                              No. <em>289</em>       Erskine Design Redesign                 Simon Collison     Redesigning Your Own Site                        Lea Alcantara               Designing Through the Storm               Walter Stevenson                                       Ad via The Deck
    Maybe that's enough. If not - if the HTML is truly awful/complex (often one and the same) then we'd need to feed the HTML to a (X)HTML parser to normalize it and then use the parser to ship us the anchor links. That's beyond the scope of this discussion (so far).
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

Similar Threads

  1. cat and grep
    By High Impact Tech in forum Programming Discussion
    Replies: 4
    Last Post: 01-24-2006, 09:52 PM
  2. grep
    By 2uantuM in forum Programming Discussion
    Replies: 2
    Last Post: 12-13-2003, 04:09 PM
  3. grep either or?
    By Jasber in forum Hosting Security and Technology
    Replies: 2
    Last Post: 04-17-2003, 12:57 PM
  4. ps and grep
    By Jedito in forum Hosting Security and Technology
    Replies: 5
    Last Post: 04-26-2002, 01:45 PM
  5. grep me out of here!
    By freakysid in forum Hosting Security and Technology
    Replies: 2
    Last Post: 01-28-2002, 07:23 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts