We are experimenting with "Jidoka" to improve reliability. In Toyota's car manufacturing version of Jidoka, any worker on the assembly line who notices a problem can push a button to stop the line. Then, they fix whatever is causing a quality problem. Similarly, we created a list of "Must be fixed today" tickets. Any team member can add something to that list, where it automatically becomes the top priority of the entire team.
At Assembla, we have to meet a high standard of reliability, because if the site has problems, my mobile phone (that's where the company phone number goes after hours, and people do call) rings next to my bed, and after I get up and deal with the problem, my wife makes me sleep on the couch. Fortunately, we have reached a high level of reliability on the core repository systems -up all but a few hours in the last year - and we have never lost any data.
However, I will tell you a secret. The Assembla application is quite complicated, and it has bugs. It also has a lot of different servers and processes, and some of them are new. Sometimes the new stuff doesn't quite connect, especially after a new release. Sometime a server is overloaded, or a queue doesn't process, or a process doesn't queue. When this happens, people are inconvenienced, and we need to fix the problem as quickly as possible. For example, after the release we did early this week, our FTP publishing tool stopped publishing.
We used this incident to test our Jidoka-style response. Our customer service rep started getting complaints, and he escalated them onto the "Must fix today" list.
How did we do? Not so well. It took two days to find and fix the problem with the FTP tool. We assigned this as a full time job to one person, the programmer who wrote the feature originally, and kept going with all of our other jobs. It turned out that we needed admin help, since it wasn't really a code bug. If really wanted to get this fixed in one day, we would have stopped work on everything else and put more hands on the problem.
What about fixing problems BEFORE the deployment? Can't we do root cause analysis, and use the "five why's", and improve our testing and quality control in the development phase, and eliminate this whole class of problems? Yes, to some extent. That's the value of getting the developers involved in a deployment problem, and motivating them to make sure it doesn't happen again. But in my experience, a lot of problems show up only when you go to production deployment, and, proportionally more as you get fewer bugs coming out of development. For example, this ftp problem originated not with a development or code problem, but with a deployment problem - a configuration wasn't set to pass the commit messages into the correct queue.
It is a goal of Jidoku to force people to go back and fix the root cause, so similar problems will not happen again. We can achieve that goal through team incentives. If we force people to stop whatever they are doing (which is very irritating) and work on the problem until it is fixed, they will be motivated to make sure that it doesn't happen again. It's similar to the motivation provided by my lovely wife when she makes me sleep on the couch with my phone. That is another reason that I am leaning toward stopping ALL other work in the development and deployment team while we are working on a "Fix today" problem. It is very inefficient, but it provides the right incentives for fixing the root cause.
We have to get better at this process. We will practice. On the other hand, I would not go as far as this site, which recommends "Employees should have at least twelve hours of training in Root Cause Analysis, four hours in Process Mapping and six months of intensive practice before Jidoka is employed." That sort of bureaucratic response always makes me laugh - and I built a whole company around supporting 6-sigma, which is riddled with this crap. Maybe that's why I don't work there anymore. If the process makes sense, you should understand it the first time someone suggests it to you, and if it is useful, you will be motivated to use it and work on it, with a lot less than 6 months of "intensive practice". I am open for suggestions.