Failure is not an option: Ideas

This page is repository for ideas for development in the book.

  1. Cover Art
    World Trade Center Disaster
    World Trade Center disaster
    World Trade Center Disaster
    Fire in a server room

    Fire in a server room

  2. Books to read, things to do
    1. Interview the IT departments of some of the companies that were affected by 9/11.  See this list.
    2. Go read
      1.  Fire in the Computer Room: what do I do now?  Fire in the computer room, what now?.
      2. Try  Automating UNIX and Linux administration
      3. Microsoft has an interesting page about reliable computing.  I should probably add something about Microsoft systems.
    3. Read articles on the Wikipedia
      1. Fault Tolerance

    4. Read pages on the internet
      1. Fault handling at EventHelix
    5. Interview Edmond
      1. We've used Hunt Engineering ( for datacenter mechanical
        and electrical engineering. I haven't personally worked with them, but
        the results and feedback from coworkers were very good.

        Members mailing list
    6. Reread


      DeMarco, Tom. Controlling Software Projects: Management, Measurement and Estimation.
      Englewood Cliffs, NJ: Prentice-Hall, 1982.
      Dorset House, New York, 1995. ISBN 0-932633-34-X

  3. Management  How about a CISO? (16-Dec-2005 SANS)
     --State of Information Security 2005 Report Finds Security-Related
    Events on the Rise
    (12 December 2005)
    The State of Information Security 2005 report from CIO Magazine and
    PricewaterhouseCoopers found that security-related events have increased
    22.4 percent since last year. Just 37 percent of the companies
    responding to the survey have established a security plan; twenty-four
    percent plan to implement one in the next year. The number of
    organizations with a CISO or CIO rose from 31 percent last year to 40
    percent this year. Among organizations with a chief information
    security officer (CISO) or Chief Security Officer (CSO), 62 percent have
    security plans in place. The study surveyed more than 8,200 IT security
    executives in 63 countries around the world.
    [Editor's Note (Schultz): The fact that only 37 percent of the companies
    that responded to this survey have a security plan is not a very good
    sign. I fear that Donn Parker may have been right when he asserted that
    the practice of information security is more like "folk art" than
    anything else. ]
    1. S&M
      1. Goals of measurement
        1. What works and what doesn't
        2. capacity planning
        3. identify systems for sunsetting

      2. Things to measure
        • uptime
        • software defect density
        • hardware reliability
        • machines ready to sunset
        • Access logs
        • Error logs
      3. Value of processes
      4. Uses and abuses of the data
  4. Human Relations
    1. A quote from SANS:
      --Engineer Indicted for Alleged Theft of Trade Secrets
      (23 December 2005)
      An engineer has been indicted for alleged theft of trade secrets.
      Suibin Zhang allegedly downloaded proprietary files from Marvell
      Semiconductors, Inc, after accepting a position with Broadcom, a Marvell
      competitor. Zhang had access to the Marvell data because his former
      employer, Netgear Inc., was a Marvell customer. Zhang then allegedly
      loaded the files onto a Broadcom-issued laptop and emailed some trade
      secrets to other Broadcom employees. Zhang entered a not guilty plea
      and was released on a US$500,000 bond. If convicted on all counts,
      Zhang faces a maximum jail sentence of 75 years and a fine of in excess
      of US$2 million.
      [Editor's Note (Honan) Most companies access policy disables user
      accounts for employees who have left the company. This is an example
      of how that policy should be extended to include external users from
      partner companies or suppliers with employees who have access to
      sensitive data.]
  5. Facilities
    1. Site location - don't go in a high rise!  Don't go in a basement!  Ground floor or second floor is best.
    2. Physical Security.  Floor to true ceiling walls, be wary of hanging ceilings.  Access controls.  Personnel security for both you and your vendors.  Facilities clowns
    3. Small onsite staff for repairs and installations (and deinstallations), larger ops staff elsewhere
    4. Disaster recovery site
  6. Technical stuff
    1. Design for reliability
      1. Predicting reliability
      2. Reducing failure through redundant systems
      3. Reducing failure through highly reliable systems - internal redundancy
      4. Reducing failure through increasing quality
      5. Improving MTTR
    2. The  goal: "lights out" computing   Lights Out Computing
    3. logging - syslogdng, log file processing, mod_log_spread.  What doesn't work.
    4. disaster recovery site
    5. distributed database
    6. redundant SSH daemons on different ports (see port scrambling)
    7. Scripting and script cookbook
      1. who's hitting the webserver?
        [root@angel root]# for i in `fgrep -h finao /var/log/httpd/access_log* | awk '{print $1}' | sort | uniq`; do
        > host $i
        > done
        Host not found: 3(NXDOMAIN) domain name pointer domain name pointer
        Host not found: 3(NXDOMAIN) domain name pointer domain name pointer domain name pointer domain name pointer domain name pointer
        [root@angel root]#
      2. Populating the equipment database
    8. Configuration management
      1. Problem ticket system
      2. Change control using RCS, CVS or perversion
        1. Set up repositories for system software and for each project
        2. check in code
        3. move code between environments
        4. check out code to dev, int, load, test, production environments using tags
        5. reverting
      3. fallback planning
    9. security
      1. The philosophy of securitySecurity
        1. Separating identity, authenticity, and authorization
          1. The roles of kerberos, LDAP and NIS
          2. Active Directory

      2. port scrambling
      3. Reliable software (This is misplaced)
        1. Windows just doesn't cut it (From 16-Dec-2005 SANS mailing list):
           --Versions of Windows Server 2003, Windows XP Receive Common Criteria
          Certification at EAL 4+
          (14 December 2005)
          Six versions of Microsoft Windows Server 2003 and two versions of
          Microsoft Windows XP have earned Evaluation Assurance Level (EAL) 4+ of
          the Common Criteria. Meeting the standards set by the Common Criteria
          is necessary to win federal contracts that involve dealing with
          classified information.

          [Editors' Note (Schultz): Achieving EAL 4+ certification is no small
          feat. Microsoft has truly made a lot of progress when it comes to
          security in its operating systems.
          (Guest Editor (Donald Smith): Microsoft windows evaluation was against
          the CAPP. From:
          "The CAPP provides for a level of protection which is appropriate for
          an assumed non-hostile and well managed user community requiring
          protection against threats of inadvertent or casual attempts to breach
          the system security. The profile is not intended to be applicable to
          circumstances in which protection is required against determined
          attempts by hostile and well funded attackers to breach system
          (Multiple): When a government agency says a product meets a high
          security standard, and that is a product in which dangerous flaws are
          continuously discovered and for which the vendor chooses not to release
          an existing patch while exploits for the flaw are circulating on the
          Internet, perhaps the standard (Common Criteria) is part of the problem,
          and should be reconsidered.]
          Commerical Ventilation and Vacuum (home of Failure is Not an Option)
          Real Networks 
          Microsoft time between boots
 time between boots
          Running linux and apache
           Real Networks time between boots
          Running Linux and coyote
          California Institute of Technology
          National Cash Register (NCR)
          Uptime at Caltech
          Running Linux and AOLserver/3.3.1+ad13
          Hewlit Packard  Running HP-UX and Apache (impressive)
          National Cash Register Uptime 
          NCR (underwhelming)
          MacWorld Expo 
          My server world 

          Windows Server/2003 and IIS
          Windows server 2003 and IIS
          Windows server 2003 and IIS

        2. Kernel panics, blue screens of death, kernel logging
      4. Access control
        1. Bastion host
        2. Authentication server
          1. NIS
          2. Kerberos
          3. LDAP
          4. Active Directory?
        3. Public keys
      5. selinux
      6. file system security and how to test it
      7. Application security
    10. non-stop Software updates
      1. testing
    11. Monitoring
      1. SNMP
      2. Customer view monitoring
      3. Internal only monitoring
      4. PC vs Web-based vs VT100 based (or thick client, thin client, anorexic client)
      5. The "known problem" problem
    12. BackupsBackups
      1. What if somebody steals a backup?
      2. Backups on tape vs. sent electronically
      3. readonly, read mostly and read/write data.  Metadata
      4. Recovery
      5. Special considerations for the registry in MS-Windows/NT, 2000, 2003/server, longhorn/vista
    13. Load balancers
      1. Dedicated load balancers
        1. keepalived
        2. Linux Virtual Server
      2. Application based load balancers
      3. Hot standbys
      4. Warm standbys
      5. Cold standbys
    14. file systems
      1. unmount /boot, mount /bin, /sbin, /usr readonly, mount /etc /home /var /tmp read-write
      2. chroot jail
      3. ext3 vs. ext2 vs. GFS
      4. The problems with NFS
      5. locking files against simultaneous access.
      6. Moving data: rsync, scp
    15. Hardware
      1. White box vs. a brand name (whatever happened to DEC?)
      2. Dual power supplies - get short power cables
      3. RAID
      4. Remote management hardware - BMC cards (IPMI) DRAC
      5. smartd
    16. NetworkingNetwork Reliability
      1. Security
      2. VPNs and LANS
      3. asssymetric routing
      4. Utilities
        1. DNS
        2. NTP
        3. DHCP
        4. arpwatch
      5. Fiber optic vs UTP
    17. Testing
      1. Test plans and objectives
      2. Regression testing
      3. Failure testingFailure testing
      4. Test environments
        1. dev
        2. int
        3. test
        4. load
        5. production
        6. ancillary
      5. Virtual machines, IP address aliases
      6. post release testing
    18. Operations
      1. Policies, procedures, and documentation
      2. Drills
      3. training
      4. Common failure modes
        1. Power problems
        2. cooling problems
        3. IP addressing problems
      5. Equipment database
        1. location
        2. Update from DNS, arp
        3. Inventory control
        4. pointer to documentation
        5. How to deal with exceptions (one-of), multiple IP addresses, change of equipment, virtual machines
      6. Machine room organization
        1. Air flow
        2. Racks
        3. Cable Management
    19.  Cable Management
    20. A case study
      1. Design
      2. Implementation
      3. testing
  7. Structure of the book
    1. Acknowlegements page and resources page.
    2. Page last modified java script
    3. Navigation links (how to do that?  Use a program?)
    4. HTML to Docbook translator?
    5. Cartoons:
      • Who knows what clowns facilities is letting into the data closets? (A clown entering a room labeled "data closet"... and he's carrying a hacksaw)
      • Measuring user satisfaction is always a good idea (a sysadmin has his feet on his desk and is talking on the phone about something he read on slashdot. Meanwhile, his users are approaching with torches, pitch forks, shovels and boy do they look mad!)
      • Test your backups before your customer's computer fails (One man is chasing another with an ax while in the background is a computer with wisps of smoke coming from it)
      • Cable management is useful (A rack with cables going every which way. In the middle is a bubble coming from the mass with the word "help?"
      • Reliability testing (a devil pulling a wire from the back of a computer)
      • Lights out computing. (Before and after: A person grinning broadly reaching out for a lightswitch and standing next to a rack full of wires. Hidden in the rack is a cat. After, the room is pitch dark except for the eyes of the sysadmin, the teeth of the sysadmin, and the eyes of the cat).
      • The medievel model of computing: in a clearing is a castle, surrounded by a moat, with high walls, battlements, a high tower, with a stout door, and a princess in the tower.
      • "Good morning, your network is about to go down"  As the sun is rising, a huge earth scooping machine is about to take a slice out of the ground.
Customer Service
Facilities clowns
Customer service is considered A Good Thing
You never know what clowns facilities is going to let into your data closets
Cable Management
Failure testing
Cable Management
How well do your systems work when failures occur?
Lights Out Computing
Lights Out Computing
The Medieval model of computing security
Network Reliability
Nobody worries about Backups until you need them
Good morning, your network is about to go down.

$Log: ideas.html,v $
Revision  2006/10/01 23:36:20  cvsuser
Initial checkin to CVS
Revision 1.2  2006/09/20 21:22:45  jeffs
Added the all_files link to a PHP script which generates a list of all files in this directory
for search engines

Revision 1.1 2006/01/05 06:02:19 jeffs
Initial revision