11/18/2010 edit: after more troubleshooting on this box, we learned more about what was happening, and while the post below stands as valid items to look at, our issues were perhaps more complex and may have been related to a different issue. See this post as well as all three posts on some gotchas we have run into while running lots of databases on Solaris zones.
I just made my first post to forums.oracle.com (finally saw something go across my RSS feed that I felt I could respond to) and I thought I would reprise it here.
We use Sun M5000s with 256G of memory for our non-production servers, so we load them up with a lot of databases. Some are separated out into Solaris zones, but some zones still have a lot of smaller databases running on a single listener. Our app admins reported that at times they could not restart the application; Glassfish would not connect, even from an admin console test. Here are the two things we found and addressed to fix these issues.
First thing to check for …
By default, the queue the listener has for incoming requests is very small. It is enough for most database servers, so no one ever talks about it or changes it from the default, but when you run this many databases AND the app is using connection pooling, then the startup of the app (with multiple clustered app servers all starting up the default number of threads in each pool all at once) tends to exhaust this listener queue. From Oracle’s doc:
10.3.2 Handling Large Volumes of Concurrent Connection Requests
If you expect the listener to handle large volumes of concurrent connection requests, then you can specify a listener queue size for its TCP/IP or IPC listening endpoints. To specify the listener queue size, specify the QUEUESIZE parameter at the end of the protocol address with its value set to the expected number of concurrent requests, similar to Example 10-2.
Example 10-2 listener.ora File with Queue Size
The default number of concurrent connection requests is operating system-specific. The defaults for TCP/IP on the Solaris Operating System and Windows follow:
* Solaris Operating System: 5
* Windows XP Professional SP2: 10
* Windows 2003 Server Enterprise Edition: 200
In conjunction with this, it can also exhaust the tcp/ip stack, so there are some settings you should have the server admin increase (at least we experienced this on Solaris). Here’s an article that explains how to test for this, and if you see evidence of it, what settings to change (for Solaris).
We set our queuesize to 512 (for all database hosts) and our Solaris tcp_conn_req_max_q to 1024. Resolved the “connection storm” issue caused by app side connection pooling.
Second thing to check for …
We thought the first problem had recurred after we fixed it with queuesize/tcp_conn_req_max_q but it just had the same symptoms, and here’s what it was …
We use SAN/fiberchannel disk for our database storage. However, we have an NFS mount which we use solely for RMAN backups and restores. Our unix admins are awesome and they have tuned the NFS within an inch of its life — and we found it was too fast. (Meaning, we had to stop being stupid and not run 3T multi-channel restores and backups at the same time.) NFS sits on top of the tcpip stack, so if you share the pipe in and out of the box (10G for us, no bottleneck there) with NFS and you really hit it hard, you can exhaust the tcpip stack again and that manifests as listener drop errors without seeing any “listendrops” in netstat (see article above) and without any errors making it to your listener log, because it never gets that far.