Not sure if you need to know about this error I rcvd a few mins ago, but here it is. It said my browser OK - someplace in Dallas - OK --- then some kind of cloudflare thing/cache - not OK. alright, thats all for now. Error 502 Ray ID: 1ed68d07b7c2115f • 2015-05-28 02:32:51 UTC Bad gateway
I cant leave people a message. Its a problem. Literally/ The following error occurred The server responded with an error. The error message is in the JavaScript console. It took an hour to drop this one
Yeah, there's something funky going on with some PHP-FPM segfaulting for some reason once in awhile. Still trying to debug it, but I think it might be something in the new version of PHP, so might just roll back to an older version.
I think it was related to the issue last week where PHP processes were trying to allocate close to 2GB of memory for nonsensical tasks and just wrecking the servers (you get 1,000 concurrent processes each trying to allocate 2GB of memory and you have a problem... lol) Anyway, it seemed to start with PHP 5.6.9, and I've rolled back all servers to 5.6.6 now. So hopefully whatever it was will stop.
That wasn't even (directly) the biggest chore. lol PHP-FPM going crazy with memory allocation requests caused some memory corruption on the servers (in theory it shouldn't even be possible, but somehow it did). That caused one of the database cluster data nodes to fail. Which wasn't an issue for end users since they are redundant. But when bringing that data node back online, 2 more data nodes failed with the same sort of memory corruption issues. Also not a problem since there's enough redundancy to handle 3 failed servers at the same time. But then when bringing those 3 data nodes online, 2 more failed... meaning 5 database cluster servers went offline concurrently with the same sort of memory corruption issues. And 5 servers going down then fails the entire database cluster because there isn't enough online servers for a complete set of data. Brought entire cluster back online fairly quickly, but when all servers came back online concurrently, they got internally confused I think about who was the nominated "president" (1 server is always nominated the one that makes the decisions about stuff). For really no good reason it seemed like multiple data nodes were trying to be authoritative, but there wasn't a way to tell WHICH servers were the problem ones. So then started doing rolling restarts of the data node processes, which takes about 90 minutes per data node because we have such a massive amount of data in our databases. Normally you can do rolling restarts without end users noticing, but with multiple "presidents" it was causing funky issues like certain tables being unavailable (like threads) depending no the record IDs you were getting. Got enough of the data nodes restarted (90 minutes per) to form at least 1 whole set of data. So now we are back online. There are still more redundant data nodes in the process of coming online, but at least we have a whole (working) set of data online now... so site is back online. So yeah... super fun morning. lol
The 3 remaining data nodes (for redundancy) should be coming online fully shortly. I'm going to be super pissed off if they don't come online properly. haha ndb_mgm> show Cluster Configuration --------------------- [ndbd(NDB)] 8 node(s) id=11 @192.168.10.20 (mysql-5.6.24 ndb-7.3.9, Nodegroup: 0, *) id=12 @192.168.10.21 (mysql-5.6.24 ndb-7.3.9, Nodegroup: 1) id=13 @192.168.10.22 (mysql-5.6.24 ndb-7.3.9, Nodegroup: 2) id=14 @192.168.10.23 (mysql-5.6.24 ndb-7.3.9, Nodegroup: 3) id=15 @192.168.10.24 (mysql-5.6.24 ndb-7.3.9, Nodegroup: 0) id=16 @192.168.10.25 (mysql-5.6.24 ndb-7.3.9, starting, Nodegroup: 0) id=17 @192.168.10.26 (mysql-5.6.24 ndb-7.3.9, starting, Nodegroup: 0) id=18 @192.168.10.27 (mysql-5.6.24 ndb-7.3.9, starting, Nodegroup: 0) Code (markup):