For the last few months, our T1000’s have been randomly hanging in the morning time. The only solution was to reset the machine so it start working again.
After attempting to use scat to analyze the core dump, there was just too much information going on. We have 9 zones on these machines and at the time of the hang, there was over 3k processes. The next step was sending Sun the core dump and they quickly found the problem using scat!
Basically we had some nfs threads stuck in top_end_sync() and top_begin_sync() until we break’d and sync’d where it obviously stopped.
I was directed the following bug:
6710329 – ufs top_begin_sync, top_end_sync hang
And then the fix was already released in patch 139483-05 that is obsoleted by 139555-08. I was at the kernel revision below that one which means I could have been close to fixing it accidentally on the next patch cycle.