Tuesday, May 16, 2017

Thoughts on Software Development: A recent interview

I had a very productive discussion with a few consultants from India recently. The topic of choice was Software Development: specifically, how best to run teams and a good software development process. Here's a quick summary of what we discussed. Hopefully, this will be helpful to a broad audience:

Q1: Team Structure: How best to structure a team, what are the roles and responsibilities?
A: Largely, teams in tech are structured with 2 leaders: a technical lead and a manager. The role of the technical lead is managing the technical aspects of product development (code reviews, commits, design, testing, client team management and tough debugging). The role of the manager is people management (specifically and preferably practicing "'servant leadership"). The manager is responsible for morale, happiness, compensation, budgeting etc. but is not responsible for the technical direction of the team.

In some cases, it makes sense for the technical lead to also be the manager of the team, but largely, it means that one of the two roles becomes secondary to the other. In many cases, teams suffer slightly. 

Within management, there are two schools of thought: manager-leads and TL-leads. In the manager-leads scenario, the manager has the upper hand and decides prioritization of tasks (leads to conflicts with technical direction where business priorities precede technical quality). In the TL-leads scenario, manager is responsible for resourcing of projects but TL leads project execution and budgeting. Such a scenario leads to better incentives around code quality and execution but might have slips on the budget end.

Q2: So which should I choose? Manager-leads or TL-leads?
A: Manager-leads is run by several teams in the industry (Amazon being one of them). While business results are excellent, there is quite a bit of churn on the team. Churn is bad because it makes producing long term, stable infrastructure much harder (because nobody understands the system well enough).

TL-leads is the philosophy followed by Google (and other similar companies). They are OK throwing a little more money and time into problems as long as the results are excellent. Many other companies cannot afford this style of management and they resource a lot more conservatively.

Q3: What's the team composition like? How many people should be put onto a team?
A: A typical team can vary in size but it's important to understand certain ratios. Micromanagement is a big problem in the tech industry - there's always a lot of work and never enough time. In such a situation, an idle manager just slows down projects by constantly asking for updates and communicating them throughout the org.

A healthy ratio of managers to engineers is 1:7 - beyond this point, the manager really doesn't have the time or the ability to micromanage the execution of a project. A ratio of 1:5 or more is better for a TL as well for similar reasons. The absence of micromanagement makes engineers happier and empowers them by giving them more freedom of choice during the execution phase of a project.

Q4: What about Product Vision? What is the role of a Product Manager?
A: Product Management is a glue role. Product Managers do not have engineering reports. They don't have involvement in the execution phase of a project. A typical ratio for Product Managers is 1 PM for 20 Engineers (or around 1 PM for 4 teams). The reason for this large leverage is that execution takes a lot longer than ideation. By having a PM responsible for 4 teams or the work of 20 Engineers, the PM has enough leverage to keep multiple projects running simultaneously. The PM doesn't necessarily have to track the project progress himself (possibly only the key ones might matter), but he does have to answer daily execution questions from the engineers to guide the execution of the project.

Q5: What are success metrics for a team? What are the KPIs of an engineer?
A: You get what you incentivize: At Google, individual engineers are incentivized to product business impact. This means moving customer and product metrics and placing the improved metrics in context. eg. If an engineer improved suggestion quality by 2%, is that a large change or a small change? (At Google, it might be a large change if most improvements to suggestion quality are in the 0.1% range). Producing business impact is rewarded disproportionately at each level. This leads to lots of launches, several failures but a general bias to produce new things. 

Such a system dis-incentivizes routine maintenance work and as a result, product iteration and maintenance isn't as glamorous. Other important metrics are code-craftsmanship and leadership.

LinkedIn follows a similar structure: the key performance scopes are in "Leadership", "'Execution" and "Craftsmanship". In both these systems, individual productivity metrics like "code coverage", "lines of code written", "commits made", "bugs fixed" are de-emphasized because these numbers are easy to game.

Q6: On to general execution: What does the software development lifecycle look like?
A: On a high level, the software execution lifecycle is split into 4 major phases: Design, Build, Run, Maintain. Engineers are expected to design systems, they're supposed to commit infrastructure and code that creates the system, they're supposed to run it and stabilize it in production (with monitoring and alerting) and they're then supposed to maintain their code (This last part is crucial. In a healthy software ecosystem, lots of things have dependencies on each other and things break all the time; having a sense of ownership of the code you've written promotes code maintenance over time and prevents code rot because "it's someone-else's problem").

Q7: Practical matters: What makes it into a software release? How do you decide what to keep and what to cut?
A: The current best practices in software development are continuous integration and continuous deployment. Continuous integration means that every code commit is tested for correctness by running all tests on a machine that is separate and isolated from an engineer's development environment. This ensures that local dependencies don't leak inadvertently into the build / test / run ecosystem. In addition, Build / CI systems like Jenkins and others allow running the tests across a wide variety of environments / devices / versions etc. This ensures that the system is meeting build and code quality across an entire range of project platforms.

A continuous deployment system takes the "continuous integration" concept a step further: once a code commit passes continuous integration, it's automatically deployed to production. Such a style of development requires strict control of code check-ins. New code is always developed behind a feature flag and the flag is always turned off till the time the code is known to be stable. During the stabilization phase of the code, the feature flag may only be enabled for a subset of users but the continuous integration tests should ensure that all tests pass in both scenarios.

In a world of continuous integration and continuous deployment, the concept of a release is made very cheap: if it's built and the tests pass, it's ready for release. If a feature is released and found to be unstable, the feature flag is just turned off and the rest of the binary is expected to run without issues. What makes it into a "cut" is now simply: "is the code committed, are the tests there and do we have enough confidence in the code to turn the feature flag on".

Overall, this was quite a stimulating process. Hopefully, despite the length, this post will provide value to you as well.

Cheers!
Divye