View Full Version : server hanging from too many httpd processes
classifieds
Mar 24th 2005, 5:38 am
Every few days my server will spawn 450 or so httpd daemons in a few minutes and effectively go offline. It requires a shutdown -r to resolve and that usually takes 45 minutes to complete.
I'm trying to track down what's causing it. I suspect that either I've got a misbehaving bot or an obnoxious email address harvester hitting it.
I've looked at the logfiles via webalyzer but don't see anything obvious and the message log does not have any entries that look suspicious.
Any advice on how to figure out what’s causing it?
Are there apache configuration parameters that will help?
Any suggestions would be appreciated.
Regards,
-jay
J.D.
Mar 24th 2005, 6:53 am
This kind of condition may also be triggered by a bug in the code that is being executed. For example, if you have an endless loop that will tie up one of the worker threads, eventually all of them will end up hanging, locking up the server. Try to see if the amount of traffic a few minutes before the server hangs is increasing compared to the time it's working all right. Check CPU usage - if you have an endless loop that does something like string comparison, your CPUs will go to 100%.
J.D.
nullbit
Mar 24th 2005, 8:10 am
Next time it happens it would be quicker to reset the httpd daemon, instead of doing a full reset, how to do this depends on your distro, redhat/fedora which are the most common, would be:
service httpd restart
When it crashes check the end of your and access and error, to see what was hitting the server before it went down:
Access logs:
tail /var/log/httpd/access_log -n 30
Error logs:
tail /var/log/httpd/error_log -n 30
You might need to change the log paths to reflect your directory structure.
If you're logged while it happens you can do this to see what clients are hitting your box:
netstat -tpu
If it's one host causing the problem you can block it with your firewall:
iptables -I INPUT 1 -s xxx.xxx.xxx.xxx -j DROP
xxx.xxx.xxx.xxx is the ip to block.
This will also block a host on most systems:
echo xxx.xxx.xxx.xxx >> /etc/hosts.deny
Otherwise you will probably have to make some changes to the apache config file to limit the number of allowed processes.
digitalpoint
Mar 24th 2005, 9:17 am
You should lower the allowed number of clients within your httpd.conf file.
For example, this would allow a maximum of 50 httpd processes to be spawned:
MaxClients 50Whatever it's set to now, it seems your server can't handle that number, so I would lower it. Even a very high traffic site usually would be fine at 100. I run digitalpoint.com with it set to 100 and never had an issue.
classifieds
Mar 24th 2005, 12:57 pm
Thanks for the suggestions.
I'll have a window later tonight to make these changes and dig into exactly what's going.
Regards,
-jay
classifieds
Mar 25th 2005, 5:14 am
Thanks for the help :)
Here's what I discovered (and not).
1. There was a lot of "files not found errors" in the error log - *fixed*
2. My .htaccess "deny" is working well for the Nigerian 419 scammers :) I'm looking at the host.deny and other IP level blocks to improve efficiency - my .htaccess is at 16k (even with using CIDR for the addresses).
3. There was no indication of sudden traffic from a single IP address or range of addresses - So at this point I'm going to assume that there's some buggy code somewhere. I've set up some traps to try to isolate it.
4. The MaxClient was set at 400, I changed it to 100 and will lower it further if the problem shows up again.
5. Restarting the http daemon took 2 minutes instead of 45 minutes for a full reboot :)
Thanks again for the advice!
Regards,
-jay
J.D.
Mar 25th 2005, 9:25 am
5. Restarting the http daemon took 2 minutes instead of 45 minutes for a full rebootDid you say 45 *minutes*?!
classifieds
Mar 25th 2005, 9:27 am
YES I DID :eek:
It was very frustrating.
-jay
nullbit
Mar 25th 2005, 9:33 am
45 minutes is an extremely long reboot time. You probably need to look at your init scripts, and check your logs, somethings causing that lapse.
J.D.
Mar 25th 2005, 9:35 am
YES I DID :eek:
It was very frustrating.If a restart/reboot takes longer than a few minutes, I usually kill the process that causes it.
J.D.
classifieds
Mar 25th 2005, 9:37 am
A normal reboot takes about 4-5 minutes.
The 45 minute reboot happens when 300-400 httpd daemons are spawned in several minutes and overloads the server.
I'm still trying to determine the cause and I'm hoping that the recommendations made earlier will help mitigate the impact on the server (at least until I figure out what's causing it).
-jay
nullbit
Mar 25th 2005, 9:38 am
A normal reboot takes about 4-5 minutes.
The 45 minute reboot happens when 300-400 httpd daemons are spawned in several minutes and overloads the server.
I'm still trying to determine the cause and I'm hoping that the recommendations made earlier will help mitigate the impact on the server (at least until I figure out what's causing it).
-jay
OK, 4-5 minutes is OK for a normal reboot.
Does the large spawning happen at a particular time of day, or is it totally random?
J.D.
Mar 25th 2005, 9:44 am
A normal reboot takes about 4-5 minutes.
The 45 minute reboot happens when 300-400 httpd daemons are spawned in several minutes and overloads the server.I understand. What I'm saying is that I cannot imagine a machine being out of circulation for over 4-5 minutes and when this kind of thing happens, I usually kill the process that causes the problem after a short timeout (typically a minute or so). The only thing to watch out here for is that if it's not httpd, but something that else (e.g. DBMS), then killing the process may have repercussions on integrity of the data it's the killed process was working on at the time it was killed.
J.D.
classifieds
Mar 25th 2005, 10:01 am
When it gets in this vegetative state the response time on the SSH/telnet session is so slow that it takes 5 minutes to enter one command and at this point its spawned so many processes that its difficult and slow to dig through them looking for the culprit.
As you can tell from my posts I’m not a sys admin nor am I a programmer – (not since the early eighties anyway –I loved those old Sperry Univac DCPs!).
I appreciate your experience, insights and recommendations so please keep them coming!
-jay
nullbit
Mar 25th 2005, 10:04 am
When it gets in this vegetative state the response time on the SSH/telnet session is so slow that it takes 5 minutes to enter one command and at this point its spawned so many processes that its difficult and slow to dig through them looking for the culprit.
The top program will print out CPU/Memory/etc usage for the most active processes
J.D.
Mar 25th 2005, 10:07 am
When it gets in this vegetative state the response time on the SSH/telnet session is so slow that it takes 5 minutes to enter one command and at this point its spawned so many processes that its difficult and slow to dig through them looking for the culprit.That means your CPU is pegged at 100%. I think it is a bad loop somewhere in the code. May be not your code, but something is spinning its wheels when this happens.
J.D.
classifieds
Mar 25th 2005, 10:22 am
I'm running Linux Fedora Core 2.
Is it "top" - no arguments?
nullbit
Mar 25th 2005, 11:01 am
Yes just top. It can take arguments to customize the output (which might be useful in your case), do "man top" for more info.
classifieds
Mar 25th 2005, 11:08 am
This and the other suggestions should give me plenty to do this weekend!
Thanks again for the help.
-jay
J.D.
Mar 25th 2005, 12:13 pm
You can record top's output every 60 seconds or so, in case if you want to leave it running for some time:
top -d 60 -b > top.txt
Each output will be 5-10K, so watch your drive usage if you change the timeout or want to leave it running for a while.
J.D.
hulkster
Mar 30th 2005, 9:32 am
I'll echo Shawn's comment about limiting number of httpd processes in the httpd.conf - note that once that limit is hit, subsequent connections will timeout, so you need to be a bit careful. One additional thing to consider is turning KeepAlive off, which means you do NOT keep a connection open (so a little more overhead for subsequent ones), but was helpful for me when I was getting hammered.
Does sound like you have some dynamic content that is spinning up the CPU though that is also contributing to this.
J.D.
Mar 30th 2005, 5:05 pm
One additional thing to consider is turning KeepAlive off, which means you do NOT keep a connection open (so a little more overhead for subsequent ones), but was helpful for me when I was getting hammered.There are two different things at play here, though - connections and worker threads/processes. Any HTTP server can handle more connections than the number of workers and keeping connections alive is a good thing in general (setting up and tearing down a connection is about 3-7 packets).
On top of that, if the application has a problem (e.g. a run-away loop), turning off keep alives *will not* help even smallest bit, as the worker handling the request will be tied up processing the loop and the connection will be maintained while it's doing so anyway.
J.D.
hulkster
Mar 30th 2005, 5:15 pm
My experience was the infamous slashdot effect (http://www.komar.org/faq/slashdot-effect/) (which I've had 5 times on www.komar.org) where turning KeepAlive helped a LOT since I was running out of clients even when I bumped MaxClients from 150 to 256 - the later is the statically compiled limit, and going higher would have caused RAM issues with the CGI that everyone was running - yes, I was using mod_perl which totally ROCKS btw.
Now I was running Apache 1.3.x ... I know in Apache2.x things are changed around a bit and I understand had some spiffy threading architecture, so maybe doesn't apply as much ... but for me, it made a decent difference ... and yea, my CGI did its thing and then exit'ed ... if it sat around, that would defeat the purpose as you state above.
J.D.
Mar 30th 2005, 5:26 pm
My experience was the infamous slashdot effectThis is different from the original question. In your case the server had to handle too many small requests and most likely ran out of sockets (which is about 2-4K or so). It is easy to check how many established connections are there - netstat -an.
J.D.
hulkster
Mar 30th 2005, 5:32 pm
This is different from the original question. In your case the server had to handle too many small requests and most likely ran out of sockets (which is about 2-4K or so). It is easy to check how many established connections are there - netstat -an.
J.D.
That could be ... although I recall when I bumped MaxClients from 150 to 256, I got more "responsiveness", but the 1 GByte RAM system moved fairly quickly into a swapping situation when caused other problems, so my conclusion was that Apache was throttle rather than the operating system. Yes, my situation was many small requests from a large number of IP's, and once I turned KeepAlive off, everything ran relatively smooth, despite peak hit rates on the CGI of 18/second - not much for Google, but a lot for me! ;-)
vBulletin® v3.6.8, Copyright ©2000-2008, Jelsoft Enterprises Ltd.