Patience Wins the Battle of the Bug
Have you had a system bug drive you up the wall, across the ceiling and then back down again? Have you stared at a screen over and over hoping to find the cause of your angst? If you answered “yes” to either of these questions, you are going to easily understand what I just went through when cloning some systems on AWS.
To ease the deployment of Red Hat Linux Enterprise Linux systems on AWS, I developed a set of scripts that I put on every system I build. They configure the systems as I would like, and they also dynamically set parameters at boot around networking and security. This allows my applications to adjust to changing server names and addresses a bit more easily (often a problem in cloud environments).
I had a need for multiple systems of one type that I had previously configured. I wanted to clone a few of the systems using AWS AMIs. I kicked off the process on AWS, made my images and then launched the new instances. Everything looked great — until I rebooted them. I couldn’t connect. Since this was EC2, I did not have the option of investing through console. Amazon’s EC2 interface indicated that nothing was wrong with the systems. My lack of an SSH connection was telling me otherwise. Interestingly, I could see that my scripts were running by noticing that DNS on route 53 was correctly being updated on each reboot. Still, though, a system that I can not SSH into is just about useless to me.
I redid the cloning a few times with the same unfortunate results. I had the machine in different states as I cloned it. Nothing fixed the problem. I then thought that since my scripts change security and networking parameters, they must be breaking something. I took them all off and cloned the system. A few hours later, I eliminated them as a culprit when SSH still failed. At this point, after a bit more troubleshooting, I gave up. By now, I had spent a few days tearing my hair out over this. I stopped trying to clone my configured systems and implemented a workaround.
A few months later, I needed to clone again. I ran into the same problem. But this time, I wasn’t going to let it defeat me. After a few hours of investigating, I took a look at the SSH configuration file, sshd_config, on the initial system launch. After all, SSH was what really wasn’t working, so I focused my energy there. At the end of the file I found:
permitrootlogin without-passwordUseDNS no
Wait a minute. That’s not valid! There needs to be a carriage return and linefeed there! I changed it to
I rebooted and everything worked, including SSH. I rebooted again and still no problems. I still do not know what causes this to happen after the initial reboot, but I was thrilled to figure out what the problem was. It seems to happen on about 90 percent of RHEL 6.4 clones. Now, my scripts clean this up on launch and I do not have anything to worry about. It took a while to get to the bottom of this, but there was a great sense of satisfaction when I did.