Web Hosting Talk







View Full Version : Problems with function.file-get-contents]


hardjoko
12-18-2006, 04:51 AM
Hi I am experimenting with PhP. Most of the time I can use file_get_contents fine.

However,

Once in a while I got message like this:
Warning: file_get_contents(http://www.animationinsider.net/forums/archive/index.php?t-15765.html) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 406 Not Acceptable in /home/interne4/public_html/index.php on line 20

What's the problem?

I can open http://www.animationinsider.net/forums/archive/index.php?t-15765.html just fine.

Also, even if that's the problem. Is there a way so I can ensure that the warning doesn't show up.

ub3r
12-18-2006, 05:03 AM
It looks like the remote machine has mod_security installed. This was determined by adding "?wget" to the end of the url in a deliberate attempt to trigger mod_security. Sure enough, it triggered it, and sent back the 406 response.

When the admin installs mod_security from cpanel, the default configuration is to return a 406 error.

It's likely that the remote server has a mod_security rule which probably blocks based on the user agent php is sending with the request. Most likely a rule which blocks requests when the user-agent reply is blank.

If you own animationinsider.net, add this to your www/forums/archive/.htaccess file:

SecFilterEngine Off

If you don't, add the following to your php.ini (you have to be root):

user_agent="PHP-4.x"

You may also be able to add the following to your own websites's .htaccess, in order to change the user-agent php sends with the request. I tested this on my own machine which runs apache 1.3, and php 5, and it ran just fine:

php_value user_agent "PHP-4.x"

If none of that helps, let us know.

hardjoko
12-18-2006, 09:53 AM
It works on animationinsider. However, I still get error message when assessing other websites, like:
http://ar.wikipedia.org/wiki/1955

Warning: file_get_contents(http://ar.wikipedia.org/wiki/1955) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in /home/interne4/public_html/index.php on line 20

madguy24
12-18-2006, 10:03 AM
Most probably, your server have mod_security enabled. Add the below lines in your .htaccess after any existing rules and then try to see the issue is still there, after that addition.

<IfModule mod_security.c>
SecFilterScanPost
</IfModule>

hardjoko
12-18-2006, 10:19 AM
This is another URL I cannot get
Warning: file_get_contents(http://en.wikipedia.org/wiki/Andy_Williams) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in /home/interne4/public_html/index.php on line 20

madguy24
12-18-2006, 10:26 AM
If you disabled the POST scanning of mod_security module, then read on. Or else please do it and see whether it fixes your issue.

Did you check the logs ? the error_log of apache is what I am talking about . If you have cPanel, you may be able to see the error from the panel. I believe it is not an issue with your script, instead the conflict with the server configuration and that's why I recommended turning off POST scanning for the account.

hardjoko
12-18-2006, 02:45 PM
Oh most pages are working fine though. It seems that only wikipedia sites I have problems with now.

Can you use file_get_contents("http://en.wikipedia.org/wiki/Andy_Williams") yourself?

Did you succeed?

What do you mean by disbling post scanning of mod_security module?

I already added
<IfModule mod_security.c>
SecFilterScanPost
</IfModule>

Before, I also cannot read http://www.animationinsider.net/forums/archive/index.php?t-15765.html

but now I can after I add user_agent="PHP-4.x" in my .htaccess

However, nothing I can do allow me to read wikipedia except the very front page of wikipedia.

So, file_get_contents('http://en.wikipedia.org/wiki/Andy_Williams') doesn't work, but file_get_contents('http://wikipedia.org') works just fine.

ub3r
12-18-2006, 04:56 PM
try changing "PHP-4.x" to "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0".

madguy24
12-18-2006, 10:03 PM
Yes. ub3r is right..It is because of robots.txt. Thanks hardjoko for that question :-) looks like ub3r has seen that issue before ?

Maddy.

hardjoko
12-19-2006, 02:44 PM
It seems to be working. Thanks.

ub3r
12-19-2006, 05:29 PM
Yes. ub3r is right..It is because of robots.txt. Thanks hardjoko for that question :-) looks like ub3r has seen that issue before ?

Maddy.
Actually, it's because of mod_security. robots.txt is only a file that tells robots what they are should and shouldn't crawl. It does not prevent any host from accessing anything. It only tells robots what they should do, and the robot can even ignore it if it wants to.

The Prohacker
12-19-2006, 05:50 PM
However, nothing I can do allow me to read wikipedia except the very front page of wikipedia.

So, file_get_contents('http://en.wikipedia.org/wiki/Andy_Williams') doesn't work, but file_get_contents('http://wikipedia.org') works just fine.

Your earlier message hinted at it:
Warning: file_get_contents(http://en.wikipedia.org/wiki/Andy_Williams) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in /home/interne4/public_html/index.php on line 20


Take note of the HTTP/1.0, Wikipedia uses Squid proxies to load balance the site, and I have a feeling they cannot take HTTP/1.0 headers. I'm sure they have been configured to require HTTP/1.1


[root@linux cron.daily]# telnet en.wikipedia.org 80
Trying 66.230.200.100...
Connected to en.wikipedia.org (66.230.200.100).
Escape character is '^]'.
GET /wiki/Andy_Williams HTTP/1.0

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The requested URL could not be retrieved</TITLE>
<STYLE type="text/css"><!--BODY{background-color:#ffffff;font-family:verdana,sans-serif}PRE{font-family:sans-serif}--></STYLE>
</HEAD><BODY>
<H1>ERROR</H1>
<H2>The requested URL could not be retrieved</H2>
<HR noshade size="1px">
<P>
While trying to process the request:
<PRE>
GET /wiki/Andy_Williams HTTP/1.0


</PRE>
<P>
The following error was encountered:
<UL>
<LI>
<STRONG>
Invalid Request
</STRONG>
</UL>

<P>
Some aspect of the HTTP Request is invalid. Possible problems:
<UL>
<LI>Missing or unknown request method
<LI>Missing URL
<LI>Missing HTTP Identifier (HTTP/1.0)
<LI>Request is too large
<LI>Content-Length missing for POST or PUT requests
<LI>Illegal character in hostname; underscores are not allowed
</UL>
<P>Your cache administrator is <A HREF="mailto:nobody">nobody</A>.

<BR clear="all">
<HR noshade size="1px">
<ADDRESS>
Generated Tue, 19 Dec 2006 21:49:24 GMT by sq16.wikimedia.org (squid/2.6.STABLE5)
</ADDRESS>
</BODY></HTML>
Connection closed by foreign host.


And with HTTP/1.1 and Host tag:


[root@linux cron.daily]# telnet en.wikipedia.org 80
Trying 66.230.200.100...
Connected to en.wikipedia.org (66.230.200.100).
Escape character is '^]'.
GET /wiki/Andy_Williams HTTP/1.1
Host: en.wikipedia.org

HTTP/1.0 200 OK
Date: Sun, 17 Dec 2006 06:08:31 GMT
Server: Apache
X-Powered-By: PHP/5.1.2
Content-Language: en
Vary: Accept-Encoding,Cookie
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Last-Modified: Sun, 03 Dec 2006 14:53:02 GMT
Content-Type: text/html; charset=utf-8
Age: 103615
X-Cache: HIT from sq29.wikimedia.org
X-Cache-Lookup: HIT from sq29.wikimedia.org:80
Via: 1.0 sq29.wikimedia.org:80 (squid/2.6.STABLE5)
Connection: close

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="keywords" content="Andy Williams,1927,1930s,1944,1947,1951,1952,1955,1956,1960,1960s" />
<link rel="shortcut icon" href="/favicon.ico" />
<link rel="search" type="application/opensearchdescription+xml" href="/w/opensearch_desc.php" title="Wikipedia (English)" />
<link rel="copyright" href="http://www.gnu.org/copyleft/fdl.html" />
<title>Andy Williams - Wikipedia, the free encyclopedia</title>
<style type="text/css" media="screen,projection">/*<![CDATA[*/ @import "/skins-1.5/monobook/main.css?32"; /*]]>*/</style>
<link rel="stylesheet" type="text/css" media="print" href="/skins-1.5/common/commonPrint.css?32" />
<!--[if lt IE 5.5000]><style type="text/css">@import "/skins-1.5/monobook/IE50Fixes.css?32";</style><![endif]-->
<!--[if IE 5.5000]><style type="text/css">@import "/skins-1.5/monobook/IE55Fixes.css?32";</style><![endif]-->
<!--[if IE 6]><style type="text/css">@import "/skins-1.5/monobook/IE60Fixes.css?32";</style><![endif]-->
<!--[if IE 7]><style type="text/css">@import "/skins-1.5/monobook/IE70Fixes.css?32";</style><![endif]-->
<!--[if lt IE 7]><script type="text/javascript" src="/skins-1.5/common/IEFixes.js?32"></script>
<meta http-equiv="imagetoolbar" content="no" /><![endif]-->

<script type= "text/javascript"><!--
var skin = "monobook";
var stylepath = "/skins-1.5";
var wgArticlePath = "/wiki/$1";
var wgScriptPath = "/w";
var wgServer = "http://en.wikipedia.org";
var wgCanonicalNamespace = "";
var wgNamespaceNumber = 0;
var wgPageName = "Andy_Williams";
var wgTitle = "Andy Williams";
var wgArticleId = "527989";
var wgIsArticle = true;
var wgUserName = null;
var wgUserLanguage = "en";
var wgContentLanguage = "en";
var wgBreakFrames = false;
var wgBreakFramesExceptions = ["babelfish.altavista.com", "translate.google.com"];
var wgCurRevisionId = "91791711";
--></script>
[snip]
<!-- Served by srv89 in 0.099 secs. --></body></html>
Connection closed by foreign host.
[root@linux cron.daily]#

hardjoko
12-20-2006, 08:23 AM
Okay.. I did what you told and it works.

However,

I found another problem with ANY wordpress blog:

file_get_contents('http://genetips.com/cat'); doesn't work
file_get_contents('http://genetips.com/'); works
file_get_contents('http://genetips.com/2006/07/09/help-the-needy-buy-their-votes/archive.htm'); works

I wonder why?

The only different between http://genetips.com/cat and http://genetips.com/2006/07/09/help-the-needy-buy-their-votes/archive.htm is that http://genetips.com/cat will show wordpress 404 template. However, neither files exist. There is no such thing as file http://genetips.com/cat as much as there is no such thing as http://genetips.com/2006/07/09/help-the-needy-buy-their-votes/archive.htm

So, how come one works and the other doesn't?

ub3r
12-20-2006, 09:36 AM
http://genetips.com/cat sends back a 404 file not found response code. I think wordpress sends that back, by using the header() function.

hardjoko
12-20-2006, 01:08 PM
http://genetips.com/cat works fine in browser. Why it doesn't work fine in file_get_contents?

hardjoko
12-20-2006, 01:18 PM
telnet genetips.com 80
get / HTTP/1.1

root@server [/home/stliven/public_html]# telnet genetips.com 80
Trying 69.46.27.111...
Connected to genetips.com (69.46.27.111).
Escape character is '^]'.
get / HTTP/1.1
HTTP/1.1 400 Bad Request
Date: Wed, 20 Dec 2006 17:17:59 GMT
Server: Apache/1.3.36 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.4 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.27 OpenSSL/0.9.7a
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>400 Bad Request</TITLE>
</HEAD><BODY>
<H1>Bad Request</H1>
Your browser sent a request that this server could not understand.<P>
The request line contained invalid characters following the protocol string.<P>
<P>
<HR>
<ADDRESS>Apache/1.3.36 Server at server.dealsreferals.com Port 80</ADDRESS>
</BODY></HTML>
Connection closed by foreign host.
root@server [/home/stliven/public_html]#

The Prohacker
12-20-2006, 02:12 PM
telnet genetips.com 80
get / HTTP/1.1

root@server [/home/stliven/public_html]# telnet genetips.com 80
Trying 69.46.27.111...
Connected to genetips.com (69.46.27.111).
Escape character is '^]'.
get / HTTP/1.1
HTTP/1.1 400 Bad Request
Date: Wed, 20 Dec 2006 17:17:59 GMT
Server: Apache/1.3.36 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.4 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.27 OpenSSL/0.9.7a
Connection: close
Content-Type: text/html; charset=iso-8859-1
#


For a HTTP/1.1 request you need to send the Host: header.


[root@linux ~]# telnet genetips.com 80
Trying 69.46.27.111...
Connected to genetips.com (69.46.27.111).
Escape character is '^]'.
GET / HTTP/1.1
Host: genetips.com

HTTP/1.1 200 OK
Date: Wed, 20 Dec 2006 18:09:42 GMT
Server: Apache/1.3.36 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.4 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.27 OpenSSL/0.9.7a
X-Powered-By: PHP/4.4.4
X-Pingback: http://genetips.com/xmlrpc.php
Status: 200 OK
Content-Type: text/html; charset=UTF-8



When you are looking for /cat, the server is returning 404, more than likely this is Wordpress ending the 404 error code back.


[root@linux ~]# telnet genetips.com 80
Trying 69.46.27.111...
Connected to genetips.com (69.46.27.111).
Escape character is '^]'.
GET /cat HTTP/1.1
Host: genetips.com

HTTP/1.1 404 Not Found
Date: Wed, 20 Dec 2006 18:10:48 GMT
Server: Apache/1.3.36 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.4 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.27 OpenSSL/0.9.7a
X-Powered-By: PHP/4.4.4
X-Pingback: http://genetips.com/xmlrpc.php
Status: 404 Not Found
Content-Type: text/html




[root@linux ~]# telnet genetips.com 80
Trying 69.46.27.111...
Connected to genetips.com (69.46.27.111).
Escape character is '^]'.
GET /2006/07/09/help-the-needy-buy-their-votes/archive.htm HTTP/1.1
Host: genetips.com

HTTP/1.1 200 OK
Date: Wed, 20 Dec 2006 18:12:20 GMT
Server: Apache/1.3.36 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.4 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.27 OpenSSL/0.9.7a
X-Powered-By: PHP/4.4.4
X-Pingback: http://genetips.com/xmlrpc.php
Status: 200 OK
Content-Type: text/html; charset=UTF-8

hardjoko
12-20-2006, 03:41 PM
Allright, I'll look into it.

However, http://genetips.com/cat does show a page.

So it's kind of strange that browsers can see the page but somehow php cannot. What's the trick browser do to get the page anyway?

The Prohacker
12-20-2006, 04:04 PM
Allright, I'll look into it.

However, http://genetips.com/cat does show a page.

So it's kind of strange that browsers can see the page but somehow php cannot. What's the trick browser do to get the page anyway?
I'm sure it picks up the 404 and just dies assuming that the 404 page wouldn't be the desired content.

hardjoko
12-20-2006, 04:12 PM
I'm sure it picks up the 404 and just dies assuming that the 404 page wouldn't be the desired content.

What do you mean by "it". Which one just dies?

Again, if you go to firefox and try to access http://genetips.com/cat you'll see a page. Not a 404 error message.

So what does firefox do when it "telnet" to http://genetips.com/cat

The Prohacker
12-20-2006, 04:24 PM
What do you mean by "it". Which one just dies?

Again, if you go to firefox and try to access http://genetips.com/cat you'll see a page. Not a 404 error message.

So what does firefox do when it "telnet" to http://genetips.com/cat


'It' being PHP :)

If you go to http://www.webhostingtalk.com/cat you get a simple HTML page telling you the file is not found. On GeneTips.com the 404 is being sent to the client along with HTML, just like here on WebHostingTalk, it's just that GeneTips's HTML is more complex :) This is just how WordPress works, whenever you create a custom page, it counts on the user's browser to display the HTML it gives you and disregard the 404.

To get around this, you will more than likely need to replace file_get_contents() with something else like the Snoopy class.

hardjoko
12-21-2006, 03:21 AM
Wow. That's beyond my capability. I guess that question is settled.

But just curious. What is snoopy class?

ub3r
12-21-2006, 03:57 AM
You might also have good luck just using something like..

<? echo shell_exec('curl http://yoursite.com/something/'); ?>

with the forged UA header:

<? echo shell_exec('curl --user-agent="Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1" http://yoursite.com/something/'); ?>

That would show the whole page, regardless of status code