NASA Operations as a Role Model

I have always been fascinated with space, space travel, the stars, etc. I always wanted to be a pilot, but alas my vision does not permit that. I still have a love for all things fast! Because of my love of space, I have always been interested in NASA, the space shuttle, etc. My dad is also into that stuff, so he partly drove my interest. I would never have thought of making similarities regarding the technical work I do and NASA.

A few years back Gus got me thinking about technical operations. How a data center is run. How a corporate network is run. He always made the comparison of run the network like NASA runs a space mission. Granted I will be the first person to say what I do is not NASA, but it is the process that we are trying to mimic and not space travel!

Now at first I was like how do you compare the two? The answer is it is all in the procedure. It doesn’t matter what you are doing (putting a person into space, or launching a website) it is the operational theory that ensures that things go right. Sounds far fetched? Maybe. Maybe you have to manage some sort of technical operations to understand?

The basis for this theory as I see it is the following:

1. Never assume anything

2. Have a minute by minute plan for all tasks

3. Have contingencies for everything (when possible). Also note points of no return, roll back’s etc.

4. If a mistake happens, isolate it and ensure it never happens again.

There is a ton more things to go over, but sitting here those are the biggest things that come to mind. It boils down to never leaving anything to chance.

This type of thinking got me (kicking and screaming sometimes) to put together firm policies and procedures to keep our network operating. Every time we have an issue come up it is, how do we prevent this from happening again? I am learning that Technical Operations is really a way of thinking, not just a job!

I am thinking about it more today because I am in the middle of reading “Failure is not an option” by Gene Kranz. Gus gave me the book over a year ago. I started to read it but got side tracked. For some reason I sat down to read it today and I finished half of the book. It is a fascinating read. I don’t know why I stopped reading it last summer. It makes me want to continue my efforts at “operationalizing” things at work. And everyone wonders why I am such a pain in the ass about documenting everything. I do have a method for my madness! It is all about the process. No one person can hold any bit of information. That is why the NASA guys would pretend people had accidents during simulations to ensure that if the real thing happened, everything would work normally! Like I said, a fascinating read. Back to the book for me!

Storage

Jayson is pushing an online disk storage system. He is thinking NAS. This will solve a problem we have with one nights tape backup taking so long is that it gets in the way with SQL server db backups for the next day. We just don’t have enough time in the day to run the amount of tape we need. if we backup to a disk array and then to tape weekly we can increase restore time, and backup times. from an engineering standpoint it is a great idea.

We toyed with it last year, but never got to budgeting for it. Now we think we really need it. I have to see if we can get the money for it. Jay is going to do some more research for me and then we will come up with options.

Travel Day Complete

I am done with my day of travel. Amtrak sucked. The trains were older and in worse shape than normal. When I asked the conductor he said that the normal trains are in use on Acela since those trains were down for maintenance. Something about the “elite” route gets the good trains! The train was late 20 minutes going up, and 45 minutes late coming home. I should be used to it by now, but I am not. On a plus note Amtrak’s automated IVR was awesome. I understand why they won all sorts of awards for the voice responce system. Hey I deal with telecom, network and VOIP gear most of my day, so hearing a good IVR actually interests me now. Some people look at nice cars, I admire IVR’s. No one ever said I was normal.

We got 10 users downstairs. Kai had issues with the other group of 10 they wanted to move. Turns out the electrical guys didn’t plug in a conduit so a bank of stations didn’t have power. that set us back enough that we didn’t have time to finish moving users today. They should still meet the deadline of friday to have half the users downstairs.

I also dealt with issues regarding our upgrading our cage at our data center, and canceling circuits in NYC. Also design woes about upgrading our circuits in our call center. I have been busy with our telecom provider to iron out all the issues. More meetings with them tomorrow!

AD is New Again

We had problems with one of our active directory servers the other day. No big deal since we had more than 2 of them. A redundant back died. We rebuilt it, but that got us looking at our other AD servers wondering if we wanted to rebuild them. I tinkered with the new box today, and looked at our AD layout to see how we can juggle servers a bit.

This change in box’s also got me to start our project of moving our servers to a new ip block. This is mainly for tidiness. Our other office is newer and has a better internal ip layout. We are just taking our office and getting it up to standards with the call center.

New Technologies

We have been really busy recently. Normal day to day work has been keeping some other projects on hold. For one is our replacement of some servers in our data center.

Other projects include trying to clean up our notification system. I need a day or two of just doing that, and I cannot find the time.

Jayson, Kai, and I also have several “new technologies” that we want to try out and see if we can use them for the office. Many of them we think would be great, but are months if not a year or so away from seeing the light of day. Because the lead time may be so long, we don’t actually look at some of this stuff. We need to because that is how we advance our office. I want to find time to make sure we all can test what we think will be helpful. On that list we have sharepoint portal services for an intranet site, Novel eDirectory for a possible Active Directory replacement, ZEN Works to use to deploy software, SMS for the same possible reasons, several Suse based projects, and more.

Documentation

I try to keep good documentation on what goes on in the office. Gus says it is good. I always think it can be better. The problem with documentation is that it gets stale quick if you are in a very dynamic environment like I work in.

We have a document we aptly call what runs where? It is a list of all tasks and jobs that run in our production environment. The problem I keep having is that no one updates the dam thing. We tell everyone, update the file. No one does. Today will be a day to bust heads and get the doc updated. I hope it will work. I need certain information out of this document to be complete so I can work on a new notification project I am planning.

After that I really should update our network visio map’s. Then our standards documents, and disaster recovery, etc. You get the picture. All these documents are really detailed, but after a few weeks the details tend to change. And if you haven’t ever done documentation before, it is really boring. Except Visio. I find working in Visio to be therapeutic. Not sure why, but I do. Call me crazy…

In The Pool Monday, And Some Work

Monday Jayson came over and we moved lots of data around at work. We finished moving all user data off of a file server. To do this we had to replicate it to another server. I did that earlier in the week. Then we had to disable all network accounts so we could, refresh any files that have been changed since. To do this we needed no one logged into the system so we didn’t get any files locked. Once that was done we manually went in and changed each users home and roaming profile directories. Once that was changed we tested the changes and enabled all the accounts.

We will keep the old data and the old file server around for a week or so in order to make sure we have no issues with the change over. We have been using the new server for a few weeks with other data on it, so I am not that worried. You can’t be too careful with other people’s data.

Once the work was done, John and Dave came over and we all went to the pool in my building. We hung out there for a while and then went out to eat at the outback. Mmm, Outback. I haven’t gone there two days in a row in ages. Can’t do that too often, but once in a while is nice. I didn’t even order a meal. I got stuffed on appetizers. It was great.

Dave, John, & Jayson went back to John’s to play halo. I had to do my laundry and my cold was making me feel like crap so I stayed away from a late night of blowing stuff up.

Monday was productive and fun at the same time. Well not at the same time, but it was productive for some of the day and fun the other part:)

Remote Power Control

I think we designed our Data Center pretty well. Keith and I spent hours toiling over different ways we can make it better over the past 2 years. One area that I think we need improvement is the ability to power cycle a server remotely. We can view what is on our KVM remotely. I can go into the bios or configure a network card even if the computer is not on the internet. This is thanks to a backup dial up connection, and an IP KVM. the only thing I cannot do is hit the power button on a computer if it crash’s. That would be the ultimate ability. Besides changing tapes, I would have no other reason to goto our data center on a daily basis.

This would also save us money. We pay to have someone go to our cage and do what is called “remote hands”.

3TB, Still Building

Jayson started building a 3.x terrabyte disk array friday afternoon. It began to initilize friday. As of last night it was only at 53% complete. I will give it a few more days before I get worried. 3 terrabytes is allot of drive space. We currently buy Promise SATA and IDE RAID array’s. For the money, they are awsome. You get an unbelievable amount of storage space for so little, and the speed is not far off its more expensive SCSI disk cousin’s. These array’s actually use a SCSI backplane back to the servers they connect to.

In the past year I have gone from technology snop only wanting nice SCSI disk array’s from the likes of HP & Dell, to a huge proponent of Serial ATA. I have to say Promise makes a good product. The only downside may be we go through drives quckly. Not sure if it is coincidence or what we do, but I have had 4 drives go bad over the past year on a total of 5 disk array’s. Now when you add up that each array has 15 drives, it may not seem like allot. I may be over-reacting. In fact I think I am, but I have other SCSI arrays with no problems. It isn’t an issue yet, but I am keeping my eye on the situation.

All in all, SATA, or IDE RAID is solid and more than half the cost of it’s SCSI equivalent.

Work, It Was Hot

I got allot done at work today. I setup 4 websites. I configured IIS, our Cisco Local Directors, etc. now all I need to do is get the SSL id’s for them. Jayson got cracking on the 3 new servers we just got in. 1 is done, and another is partially done. We also got through a bunch of support tickets. All in all a good day, except for the heat.

We fled the heat to the server / telecom room for allot of the day. it was just too hot, and I couldn’t sit in my office all day knowing AC was 20 feet from me. I still ended up out in the heat for at least half the day.

Most of the office left early to play softball. We stayed and finished up some work. I am not ready to play softball yet.

Check out the Mo-blog. One of our developers, James went out and bought a portable pool. They filled it with water and I got some pictures of people sitting in the office with their feet in the pool. it is really funny. yes it was that hot…