Web Hosting Talk







View Full Version : Reg exp help needed! Find every URL in a string...


monkey junkie
01-11-2006, 08:03 PM
Hello,

I hope someone can help me with this.

If I have the entire HTML contents of a website stored in a string, and I want to be able to store all the links within the HTML in an array, how would I do this in PHP?? I'm trying to figure it out, but I'm getting nowhere...

So for example if I had -


<html>
<body>
some text some text <a href="http://www.yahoo.com">a link to yahoo</a> and some
more text and <a href="http://www.google.com/about.html">another link!</a> and finally
<a href="contact.php">a link without http://www.</a>
</body>
</html>


- how would I create an array with http://www.yahoo.com and http://www.google.com/about.html and contact.php in it???

Any help greatly appreciated.

Thank you.

monkey junkie
01-11-2006, 08:17 PM
This is what I'm currently using, but it is stopping at the first link it finds!


// read our HTML page into a variable ($filename contains the HTML)

$fp = fopen($filename, "r");
$html_contents = fread($fp, filesize($filename));
fclose($fp);

// find every link in the HTML page

function get_urls($string, $strict=true) {
$types = array("href");
while(list(,$type) = each($types)) {
$innerT = $strict?'[a-z0-9:?=&@/._-]+?':'.+?';
preg_match_all ("|$type\=([\"'`])(".$innerT.")\\1|i", $string, &$matches);
$ret[$type] = $matches[2];
}

return $ret;
};

$thelinks = get_urls($html_contents);

// print the URLs we've found

$counter = 0;
while (list($key, $val) = each($thelinks)) {

echo "$key => $val[$counter]\n";
$counter++;
}

ThatScriptGuy
01-11-2006, 11:41 PM
Here is part of the code that I am using in a custom made script for a client.

preg_match("!(ht|f)tp(s?)://[a-zA-Z0-9-._]+(.[a-zA-Z0-9-._]+){2,}(/?)([a-zA-Z0-9-.?,'/\+&%\$#_]*)?[^.]!", $body, $link);

This is of course searching $body and storing all the URLs in an array called $link.
Kevin

Edit: I believe that code will take all the URLs out. I am using it to take one primary link out of emails piped to it. The URL is always the first one, so I always do $body[0]. I'm fairly certain that if you do $body[4] it would give you the fifth link in the email.

Christopher Lee
01-12-2006, 12:27 AM
Okay, let me start off with a disclaimer: I'm not comfortable with my skills in regular expressions. Frankly, I avoid them whenever possible. But with that, I really found this problem intreguing, and anything to avoid having to do the busy work around the office. With that, here goes:

The problem:
-Pull all of the urls out of links in a document.

possible issues:
-the expression needs to handle absolute and relative links. It must also handle links to non-http services (https, ftp, mailto)
-case varies from document to document that I might scrape, so I can't just count on the link being lower case.
-people mangle links when they create them, sometimes using single quotes around the url rather than double-quotes (ex: <a href='example.com'>Example website.</a>).
-There probably is a whole bunch of junk inside the link, such as behavior (target), class, id, style, name alt and title junk, not to mention javascript and other junk we don't want. That shouldn't impact our scraping for links.

So, with that in mind, I went about finding everything inside the quotes after an href of any case, with zero or more spaces before the equal and zero or more spaces after the equals. Then I grouped with the parenthesis so that the first subpattern would get me the meat of the url. (I'm really writing all of this crud out more for me than anything.)

Sorry about the chatter:

The file: (notice, I added some bad stuff to see if I could parse it successfully).

<html>
<body>
some text some text <a href="http://www.yahoo.com">a link to yahoo</a> and some
more text and <a href="http://www.google.com/about.html">another link!</a> and finally
<a href="contact.php">a link without http://www.</a> <A TARGET="_BLANK" HrEf='ALBACORE.PHP'>A new kind of fish?</A> <a href = 'badlinky.eu'>A badly formed anchor</a> <a href="mailto:Chris@example.com">EMAIL ME</a> <a id="mylinky" style="color:Red;" href="https://secure.example.com">woo security.</a>
</body>
</html>


The php:

$my_clean_array=array();

$str_regex = '/href\s*=\s*[\'|"]([^\'"]+)[\'|"]/i';

preg_match_all($str_regex, $str_example, $ar_matches, PREG_SET_ORDER);

foreach($ar_matches as $ar_value){
$my_clean_array[] = $ar_value[1];
}

print_r($my_clean_array);


Now, its probably a waste of resources to set up another array, when all I really need to do is reference $ar_matches[x][1] when looking for the url. But, it is simpler for me to see it and work with it just by plopping it into a new array that is single dimensional, and that I can do any dang thing I want to it.

Again, can't vouch to the efficiency of this solution, but it seems to be effective. Hope that helps in some small way. Also, I'm still shaky on the quotes part of the regex, but I better post this before it gets too long. :)

Christopher Lee
01-12-2006, 12:37 AM
grr, I KNOW I should have triple-checked before posting. The pipe in the character class is unnecessary and wrong, and would match <a href=|example.org|>An example</a>.


$str_regex = '/href\s*=\s*[\'"]([^\'"]+)[\'"]/i';


Sorry :/

monkey junkie
01-12-2006, 07:32 AM
Thanks all for your help.

Chris: I LOVE YOU!!!

:)

astellar
01-12-2006, 07:20 PM
grr, I KNOW I should have triple-checked before posting. The pipe in the character class is unnecessary and wrong, and would match <a href=|example.org|>An example</a>.


$str_regex = '/href\s*=\s*[\'"]([^\'"]+)[\'"]/i';


Sorry :/
Small correction:
$URL_pattern = "/\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+/ims";
You should use /ims because HTML could be like this
<a href=
http://somedomain.info
>url title</a>
(This URL pattern used in PHP-Crawler (http://sourceforge.net/projects/php-crawler/) for example)

monkey junkie
01-14-2006, 10:24 PM
Quick question (I have been studying regex)...

In the code -


$str_regex = '/href\s*=\s*[\'"]([^\'"]+)[\'"]/i';


The bit ([^\'"]+) which is matching our actual URL, I don't understand this.

I thought the code above would mean NOT either ' or " at least once. I don't understand how that is referring to some characters. Does it mean anything but ' or " ?

Any help appreciated!!

Thank you.

Christopher Lee
01-14-2006, 10:43 PM
When I wrote it, I wanted a submatch (the parentheses) that matches any character except the single quote or double-quote at least one or more times (the + sign). Don't know if it is right or not, but it seemed to be ;)

astellar
01-15-2006, 06:10 AM
The bit ([^\'"]+) which is matching our actual URL, I don't understand thisActually, it means match everything, until ' or "