What I Learned From a 7-Year Rewrite

Sept 12th, 2012, the new Simulink and Stateflow editors are available to the public: www.mathworks.com/downloads

Different people account the timeline differently, but to an approximation the rewrite took about 7 years and occupied a number of developers ranging from a low of 4 to about 18 at maximum intensity. The resulting software unifies and replaces the entire front-ends of two separate diagram editing platforms: Simulink (22 years old, millions of lines and testpoints, 100K+ customers), and Stateflow (16 years old, many hundreds of thousands of lines, 50K+ customers). So for almost one third of Simulink’s and half of Stateflow’s history, their front ends have been under rewrite. It was a massive project internally known as the “Unified Editors.”

We expect them to be a major success with users. They certainly represent a total overhaul both architecturally and interactively. They’re a nice piece of technology. And I don’t say that because I helped lead it—it’s rare for me to be able to see anything but flaws in projects I’m involved in. They have been in the hands of pre-release customers for some time, and are receiving very favorable feedback.

It is unusual for projects of this magnitude to succeed. Experienced project managers have told me they have never seen it happen. MathWorks management deserves a great deal of credit for allowing the project to converge instead of losing faith and cutting it off midway. That said, I will never in my career undertake another project in the same way I did this one. The Unified Editors have shaped me as much as I’ve shaped them.

Here is a laundry list of what a 7-year rewrite has taught me. I expect to write more on a bunch of these items in follow-up pieces. But for now, here is a pure dump of what I know now that I didn’t know when I started (none of which younger me would have taken on faith). And neither should you. It’s worth noting that other people involved in the project have different views and took different lessons. I was a chief initiator, the technical lead, an individual contributor, and one of three development managers for the project. And, of course, 7 years ago we promised the work in 2 years. So go ahead with your 2-year rewrite and we’ll compare notes in 2019.

Estimation

  • You can’t estimate anything longer than 6 months
  • You can’t estimate anything that isn’t broken into 1 week tasks that are specific
  • Tasks that say “Implement X” are not understood
  • Putting large-scale items on a long timeline is fun but useless
  • Team members have a better assessment of readiness than direct management
  • Outside observers may have a better assessment of readiness than team members

Big bang vs. staged delivery

  • Big bang looks good because of early underestimation
  • Things will end up taking as long as incremental staged delivery anyway
  • Big bang happens because of doubts about sustained institutional investment in long programs
  • Staged delivery can be cut off at any point when more important pressures arise, leaving a program half-complete
  • It’s easy to know you’re converging when you are converging
  • It’s impossible to tell if you will converge if you are not yet converging

Rewrites & backwards compatibility

  • A large system has more behavior than anyone thinks it does (maybe 10 – 100 X)
  • Everything is the way it is for a reason
  • All the absurdities are that way because they needed to be that way at some point
  • People write what they can get away with
  • All quirks are baked in as assumptions to other existing systems
  • Avoid replacing successful legacy systems
  • Develop something else instead
  • Think creatively about how not to do a rewrite
  • A new product is 10-100X easier than a replacement to a large legacy system
  • Write something new that can gradually come to eclipse the feature set of the old
  • Backwards compatibility is a drag on developers, products, and quality (but may be necessary for customer/business reasons) (Look at what Apple gets away with. Nobody loves them for their lack of commitment to backwards compatibility. People love Apple for the products and technology that ditching old standards permits them to produce.)
  • In a system designed as a new framework and port of a legacy system to that framework, production of the framework is 10-100X easier than the port
  • It’s hard not to consider them 50/50 in planning, but they’re not

Architecture

Sitting on top of legacy systems instead of cleaning them up has several characteristics

PROS

  • You don’t disturb anybody working in that codebase
  • You don’t regress existing functionality
  • You decouple shipping schedules
  • You rely on no other teams for deliverables and they don’t rely on you

CONS

  • You are subject to all the vagaries of the existing codebase
  • Rather than smooth out rough patches, you make new code rough to conform to them
  • At the end, you still have all the cleanup to do
  • You do not have to communicate with other teams, so you have to force yourself to (we didn’t)

Team

  • The team has a more realistic assessment of readiness than management
  • Unrealistic targets are really demoralizing and demotivating
  • Missing targets, realistic or not, is demoralizing
  • Protracted stabilization is soul-crushing
  • Customer exposure is a big morale boost

Scope

  • Scope should be aggressively minimized
  • Features that management believes in more than developers do are demoralizing
  • Cutting features is great, the more the better
  • Minimum viable product considerations are very hard to evaluate when replacing an existing system

Performance

  • Modernizing an old codebase will require more memory
  • Dedicated performance engineers really help
  • Performance, especially of interactions, is very hard to lock down

Testing

  • Test coverage of the existing system will not be good enough
  • Passing the old tests is essential, but doesn’t indicate anything about the quality of the new work
  • The failures of the new system will be very different and the existing tests cover mostly the old failures

Refactoring

  • It really makes a difference
  • People don’t want to work in a dirty environment
  • The team knows what isn’t working and needs support to be allowed to fix it properly
  • Done properly it does make remaining work go faster (can make the difference between converging and diverging)

Full-stack Iterations

  • Must eventually stop rejecting and throwing out iterations and settle on one to prepare for shipment
  • Key to utility of iterations is quantity and speed
  • Anything that impedes speed or increases cost of production or throwing away is getting in the way
  • Never ship features based on an iteration that may not be the final one

Semi-related features delivered on the way

  • They are a distraction
  • They are never excellent features because they aren’t what the team really means to do
  • The work required to bring them to shipping state and maintaining them during the main effort is a huge distraction
  • There is a little bit learned about existing systems and what bringing them to production quality entails
  • They remain a huge drag on attention and resources even after the primary shipment because they need to be ported
  • Requires organizational support not to demand them from a long program

Clients

  • Having clients too early is deadly
  • Mismatched schedules and requirements will warp growing systems
  • The integrity of the framework is compromised as shortcuts are taken to satisfy immediate needs of clients out of the appropriate construction sequence
  • You will never feel ready for clients even when you are
  • At the point the work is ready, turning away clients is destructive
  • It takes three clients to sufficiently drive generalization of a framework

Stabilization

  • You have to turn it on before it’s ready in order to get it ready
  • The issues involved in really running a new system in production cannot be simulated
  • You should not plan to turn on for the first time and ship in the same release

Requirements

  • Shifting requirements are a reality
  • But in-flight design changes must be minimized
  • Choices made off-the-cuff need to be considered for their expense over leaving things the way they are
  • Complex systems need a design document for developers, testers, doc, and usability to work off jointly
  • These must be done at a fairly low level, a high-level one doesn’t specify anything sufficiently

Prototypes, walkabouts, demo nights

  • Never ship features based on early iterations (did I already say that?)
  • Prototypes always appear closer to ship-readiness than they are in reality
  • Prototype code must be kept out of the production stream
  • That requires development procedures that enable it
  • Prototype code that leaks into production will cause problems for a very long time
  • Customer exposure is a big morale boost and mitigates risk
  • Walkabouts have to be well-managed and infrequent so as not to leave people waiting and uncertain
  • Some people’s work shows more easily than others
  • Some people like this kind of exposure more than others
  • It’s good for upper management to meet the team and talk to them
  • There is a danger of pressure to change design on-the-fly at these events
  • Pressures to show work for a deadline leads to shortcuts
  • That’s OK in prototype code, not in production
  • The deadline for something to be shown in demo/walkabout must be a decoupled from the deadline for submission to a production stream (they may have very little to do with one another)
  • But it’s nearly impossible for a viewer of the work to understand that it is nowhere near complete

That’s all I can think of at the moment. There’s a lot I want to write more about. I can’t wait to apply what I’ve learned to making more software, better, faster.

Don’t write “smart,” “fast,” or “safe” code

This is really about naming. Never name anything “smartXXX,” “fastYYY,” or “safeZZZ.” I see it happen all the time.

The crux of the biscuit is that once you’ve called a function, method, class, or what-have-you “smartXXX,” “fastYYY,” or “safeZZZ,” why the hell would I ever use ordinary XXX, YYY, or ZZZ. Am I supposed to prefer “dumb,” “slow,” or “dangerous” code? If the version you’ve produced is purely superior and functionally equivalent to the existing code otherwise, get rid of the existing stuff. If it differs in some meaningful way, it can’t ONLY be that it differs in being smarter, faster, or safer. Otherwise, you’d purge the other one. Tell us something useful in its name about why the old one continues to exist. If your new one is “safe” because it catches exceptions and deals with them somehow where the old one allowed them to pass upward, don’t call it “safeFoo,” call it “FooNoThrow.”
Continue reading

Mouse Button Drag Tracking in Javascript

HTML5 Canvas:

Event log: (code adapted from Jan Wolter)
Clear event log
One of the first obstacles you’ll encounter when trying to use HTML5 Canvas for interactive graphics experiments is that mouse button drag tracking is badly broken. First, there isn’t uniformity across browsers even for the numbering of the mouse buttons. Second, once the mouse leaves the canvas element itself, you get no more mouse events, so you won’t even hear about a mouseup event that ends a drag if it happens outside the canvas.
Continue reading

A Fluid Simulation in Javascript for the HTML5 Canvas

Show FPS

I wrote a small iPhone app a little while ago called FireWater, for which I developed a simple pressure-based fluid simulation.

I thought it would be fun to port it to the HTML5 Canvas in Javascript to learn the Canvas myself and teach a little about writing simulations.

If your browser supports the HTML5 Canvas element (Chrome, Safari, Firefox, Opera, IE9) you should see the canvas above filled with black. Once you have a black canvas, click and drag on it to see it in action.

The GitHub repository for the source can be found here.

Hello, again

UpFork!
I have decided to begin again to write here. I will be writing about software development. I hope there may be some things of interest. That’s my son Henry there in the banner in a ball pit. I won’t be writing about him, though.

(I’ll start by reposting a few articles I wrote a long time ago…)

Picking a platform

I want to experiment in making an online diagramming tool.

So step one is picking a platform in which to implement it. Normally this is the kind of decision that would paralyze me in weeks of delicious research and tinkering. Not anymore. As Jason Fried reminds us, decisions are temporary. Just make a choice. Change it later if it needs changing. You’re just as likely to make a bad choice after wasting two weeks fretting about it. Keep moving.

Continue reading