My database crashed today. My ‘production’ database that is. The one I rely on for product demonstrations, that thing I do that Quest gives me money for. Pretty funny considering in the last few weeks my blogs have included such gems as:
- Are You a Real DBA – I argue that you are indeed NOT if you have not went through the crash and recovery nightmare
- What Databases Should You Back Up – I argue that any database you depend on should be backed up, production or not.
The cynics out there are thinking
Jeff, you crashed your database on purpose, just to give you an excuse for another blog post!
1 – I am not that smart, and
B – It crashed 15 minutes before I started a customer call.
Actually, it crashed several hours before that, the night before to be precise. I just didn’t notice it until I happened to need the database.
A Real DBA Would Have Caught This Immediately in the Alert Log
A half-decent DBA would have checked the alert log as soon as they noticed the instance was hosed. A DBA intern would have asked “Hey, isn’t there some log somewhere we can check to see what happened?” Yes, of course there is. Here’s the documentation for those developers that are morbidly curious.
Too Smart for My Own Good?
When I tried to connect to the database, I was told there was nothing handling that service. Well thank you Mr. Listener. I just assumed my database had shut down for some reason, or maybe the virtual image had a hiccup (this is pretty much what actually happened).
Instead of checking the alert log, I logged in as SYS to SQLPLUS and did a shutdown..except it was already shutdown. And then I did a STARTUP and got the ORA-03113.
I had never seen this before, what the heck?
So I go to Google, which takes me to OTN
First, check for network problems and review the SQL*Net setup. Also, look in the alert.log file for any errors.
I skimmed right over Mr or Missus 845470 advice who is labeled as a ‘Newbie.’
I am an Oracle ACE, surely I know better than this ‘n00b’!
Ten minutes into the situation, and I am still overlooking the obvious. Instead, I think – “Hey wait, I have backups! I’ll just restore my database and figure out what happened later.”
Guess what? Didn’t work.
I was able to restore and recover the database.
RMAN>ALTER DATABASE OPEN
“Sorry Sucker! We gots a ORA-03113 for you!”
OK, that’s not what Oracle told me, but it’s close. At this point, I was frustrated enough to start over and just do what I should have done an hour earlier.
Hey, ya know, maybe I’ll check the Alert.log and see what it says.
Tip #27 – Have the Alert.Log directory bookmarked so you can find it 🙂
It took me about 0.3 seconds to figure out what had happened once I saw the log. The database had ran out of space. No longer able to archive the redo logs, the database hung. Around the same time, the Virtual Machine ran out of memory, guess why? That’s right, the entire virtual disk was full. I’m not exactly sure if the memory was the chicken or the egg, but it didn’t really matter.
I did end up needing to recovery the database to make up for the crash. Once I freed up disk space, the database opened almost immediately.
Here’s what I saw in the log:
ALTER DATABASE OPEN
LGWR: STARTING ARCH PROCESSES
Thu May 19 11:11:02 2011
ARC0 started with pid=20, OS id=2912
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Thu May 19 11:11:03 2011
ARC1 started with pid=21, OS id=2916
Thu May 19 11:11:03 2011
ARC2 started with pid=22, OS id=2920
ARC1: Archival started
ARC2: Archival started
ARC1: Becoming the ‘no FAL’ ARCH
ARC1: Becoming the ‘no SRL’ ARCH
ARC2: Becoming the heartbeat ARCH
Thu May 19 11:11:03 2011
ARC3 started with pid=23, OS id=2924
Errors in file c:appadministratordiagrdbmsorclorcltraceorcl_arc2_2920.trc:
ORA-19504: failed to create file "C:APPADMINISTRATORPRODUCT11.2.0DBHOMEDATABASEARCHIVEARC0000001100_0664818017.0001"
ORA-27044: unable to write the header block of file
OSD-04008: WriteFile() failure, unable to write to file
O/S-Error: (OS 112) There is not enough space on the disk.
This Will Probably Never Happen To You
Most systems have more than 1 disk. Most systems have enough space that this is never an issue. Most systems have a process in place to warn when space is running low. Most DBAs actually check or read their Alert Logs on a regular basis.
However, if you ever come across a really mundane error message, before you jump in and start doing stuff, please just check the darn logs first.
I will give myself some credit (my ego demands constant feeding). I didn’t get flustered. I knew where to look when I wanted to look. And I knew how to fix the issue. Next time…there won’t be a next time.
Yeah! You definitely pointed out the issues regarding ORA-03113, I check the alert log file and found the same as you discuss, It is pretty well
I solve this issue by increasing the archived destination space as well as restart the archiving process, but unfortunately , It may not work several times, I need to recreated control file or may be rewind back the database.
If any suggestion, You are most welcome
What a bummer about the crash. I wish I knew how to speed up my website, but I’m not overly tech savvy. I just know what seems to work. 🙁
Only a bummer in the sense that I ignored everything I had learned over the past decade of working with databases. It actually turned out for the best as it ‘learned me’ to not forget about the basics of trouble-shooting and not try to out-smart the situation.
Great story .. and what`s better – it helped me 🙂
Thanks Angela! You just made my day 🙂
I can see that this DB is on Windows, but Mike Smithers has a post on XE / Ubuntu where he described how to add some desktop notification messages to the database startup script.
you are a lucky man if you have still enough space for the alert.log. With the diag dest on the same disk as archivelogs, you can get a _really_ full disk and therefore even no information in the logs.
Maybe a small comment about ‘real dba’: I guess a real dba knows the ORA-3113 at least so far to know it’s only a symptom of another issue behind.
If it had been that bad, I think I would have checked the disk sooner. I have been fighting disk space on this image for awhile now, I have no idea why that wasn’t the first thing I checked.
About the 3113 – definitely. I’m spoiled in such that I expect those errors to be daisy-chained together.
So the question remains… Are *you* a real DBA? Maybe you are now after this incident. 😉
Well, I did stay at a Holiday Inn last night! I’ve recovered a few production databases over the years. But, it’s not my day job anymore, so I don’t think I can say yes.
“Becoming the heartbeat” < I am putting this on a t-shirt right now.
We always "know better" than others, don't we?
Great post; glad you shared it.
Yeah, I was reading through the nitty-gritty steps of the instance coming up, and then I got to that line. I’m pretty sure I’ve read that several times before, but I hadn’t paid attention to it. Pretty much says it all!
I always enjoy reading about Chet’s battles in production over at oraclenerd and didn’t think it was quite fair of me to not own up and confess my sins.
Loving my data,