What I learned from a 7-year rewrite
Sept 12th, 2012, the new Simulink and Stateflow editors are available to the public: www.mathworks.com/downloads
Different people account the timeline differently, but to an approximation the rewrite took about 7 years and occupied a number of developers ranging from a low of 4 to about 18 at maximum intensity. The resulting software unifies and replaces the entire front-ends of two separate diagram editing platforms: Simulink (22 years old, millions of lines and testpoints, 100K+ customers), and Stateflow (16 years old, many hundreds of thousands of lines, 50K+ customers). So for almost one third of Simulink’s and half of Stateflow’s history, their front ends have been under rewrite. It was a massive project internally known as the “Unified Editors.”
We expect them to be a major success with users. They certainly represent a total overhaul both architecturally and interactively. They’re a nice piece of technology. And I don’t say that because I helped lead it—it’s rare for me to be able to see anything but flaws in projects I’m involved in. They have been in the hands of pre-release customers for some time, and are receiving very favorable feedback.
It is unusual for rewrites of this magnitude to succeed. MathWorks management deserves a great deal of credit for allowing the project to converge instead of losing faith and cutting it off midway. That said, I will never in my career undertake another project in the same way I did this one. The Unified Editors have shaped me as much as I’ve shaped them.
Here is a laundry list of what a 7-year rewrite has taught me. I expect to write more on a bunch of these items in follow-up pieces. But for now, here is a pure dump of what I know now that I didn’t know when I started (none of which younger me would have taken on faith). And neither should you. It’s worth noting that other people involved in the project have different views and took different lessons. I was a chief initiator, the technical lead, an individual contributor, and one of three development managers for the project. And, of course, 7 years ago we promised the work in 2 years. So go ahead with your 2-year rewrite and we’ll compare notes in 2019.
Estimation
- You can’t estimate anything longer than 6 months
- You can’t estimate anything that isn’t broken into 1 week tasks that are specific
- Tasks that say “Implement X” are not understood
- Putting large-scale items on a long timeline is fun but useless
- Team members have a better assessment of readiness than direct management
- Outside observers may have a better assessment of readiness than team members
Big bang vs. staged delivery
- Big bang looks good because of early underestimation
- Things will end up taking as long as incremental staged delivery anyway
- Big bang happens because of doubts about sustained institutional investment in long programs
- Staged delivery can be cut off at any point when more important pressures arise, leaving a program half-complete
- It’s easy to know you’re converging when you are converging
- It’s impossible to tell if you will converge if you are not yet converging
Rewrites & backwards compatibility
- A large system has more behavior than anyone thinks it does (maybe 10 – 100 X)
- Everything is the way it is for a reason
- All the absurdities are that way because they needed to be that way at some point
- People write what they can get away with
- All quirks are baked in as assumptions to other existing systems
- Avoid replacing successful legacy systems
- Develop something else instead
- Think creatively about how not to do a rewrite
- A new product is 10-100X easier than a replacement to a large legacy system
- Write something new that can gradually come to eclipse the feature set of the old
- Backwards compatibility is a drag on developers, products, and quality (but may be necessary for customer/business reasons) (Look at what Apple gets away with. Nobody loves them for their lack of commitment to backwards compatibility. People love Apple for the products and technology that ditching old standards permits them to produce.)
- In a system designed as a new framework and port of a legacy system to that framework, production of the framework is 10-100X easier than the port
- It’s hard not to consider them 50/50 in planning, but they’re not
Architecture
Sitting on top of legacy systems instead of cleaning them up has several characteristics
PROS
- You don’t disturb anybody working in that codebase
- You don’t regress existing functionality
- You decouple shipping schedules
- You rely on no other teams for deliverables and they don’t rely on you
CONS
- You are subject to all the vagaries of the existing codebase
- Rather than smooth out rough patches, you make new code rough to conform to them
- At the end, you still have all the cleanup to do
- You do not have to communicate with other teams, so you have to force yourself to (we didn’t)
Team
- The team has a more realistic assessment of readiness than management
- Unrealistic targets are really demoralizing and demotivating
- Missing targets, realistic or not, is demoralizing
- Protracted stabilization is soul-crushing
- Customer exposure is a big morale boost
Scope
- Scope should be aggressively minimized
- Features that management believes in more than developers do are demoralizing
- Cutting features is great, the more the better
- Minimum viable product considerations are very hard to evaluate when replacing an existing system
Performance
- Modernizing an old codebase will require more memory
- Dedicated performance engineers really help
- Performance, especially of interactions, is very hard to lock down
Testing
- Test coverage of the existing system will not be good enough
- Passing the old tests is essential, but doesn’t indicate anything about the quality of the new work
- The failures of the new system will be very different and the existing tests cover mostly the old failures
Refactoring
- It really makes a difference
- People don’t want to work in a dirty environment
- The team knows what isn’t working and needs support to be allowed to fix it properly
- Done properly it does make remaining work go faster (can make the difference between converging and diverging)
Full-stack Iterations
- Must eventually stop rejecting and throwing out iterations and settle on one to prepare for shipment
- Key to utility of iterations is quantity and speed
- Anything that impedes speed or increases cost of production or throwing away is getting in the way
- Never ship features based on an iteration that may not be the final one
Semi-related features delivered on the way
- They are a distraction
- They are never excellent features because they aren’t what the team really means to do
- The work required to bring them to shipping state and maintaining them during the main effort is a huge distraction
- There is a little bit learned about existing systems and what bringing them to production quality entails
- They remain a huge drag on attention and resources even after the primary shipment because they need to be ported
- Requires organizational support not to demand them from a long program
Clients
- Having clients too early is deadly
- Mismatched schedules and requirements will warp growing systems
- The integrity of the framework is compromised as shortcuts are taken to satisfy immediate needs of clients out of the appropriate construction sequence
- You will never feel ready for clients even when you are
- At the point the work is ready, turning away clients is destructive
- It takes three clients to sufficiently drive generalization of a framework
Stabilization
- You have to turn it on before it’s ready in order to get it ready
- The issues involved in really running a new system in production cannot be simulated
- You should not plan to turn on for the first time and ship in the same release
Requirements
- Shifting requirements are a reality
- But in-flight design changes must be minimized
- Choices made off-the-cuff need to be considered for their expense over leaving things the way they are
- Complex systems need a design document for developers, testers, doc, and usability to work off jointly
- These must be done at a fairly low level, a high-level one doesn’t specify anything sufficiently
Prototypes, walkabouts, demo nights
- Never ship features based on early iterations (did I already say that?)
- Prototypes always appear closer to ship-readiness than they are in reality
- Prototype code must be kept out of the production stream
- That requires development procedures that enable it
- Prototype code that leaks into production will cause problems for a very long time
- Customer exposure is a big morale boost and mitigates risk
- Walkabouts have to be well-managed and infrequent so as not to leave people waiting and uncertain
- Some people’s work shows more easily than others
- Some people like this kind of exposure more than others
- It’s good for upper management to meet the team and talk to them
- There is a danger of pressure to change design on-the-fly at these events
- Pressures to show work for a deadline leads to shortcuts
- That’s OK in prototype code, not in production
- The deadline for something to be shown in demo/walkabout must be a decoupled from the deadline for submission to a production stream (they may have very little to do with one another)
- But it’s nearly impossible for a viewer of the work to understand that it is nowhere near complete
That’s all I can think of at the moment. There’s a lot I want to write more about. I can’t wait to apply what I’ve learned to making more software, better, faster.