Saturday, October 17, 2020

Expectations of an Engineering Manager

This is a beautiful article that very succinctly and accurately defines management responsibilities:

The key principles being:
  • Support the members of your team and help them grow.
  • Follow along the deliveries, setting quality standards, making sure the team has the support they need and upper management the feedback they need [through you and without you]
  • Keep a constant practice of creating, improving or eliminating team or company processes.
There are some issues with the article as well: for example, the reliance on Andy Grove's definition of management is something that I don't fully agree with. It's a reasonable definition, don't get me wrong. It's even valuable in certain ways, I just don't fully agree with it though.

Traditionally, Andy Grove defined a manager's output as the sum of their team's output and the teams under influence output. This is great if the goal of your organization's managers is to play politics at promotion time (hey I influenced this guy, this is my project! this win is from my team! etc.). It's not so great if the goal is to objectively define a manager's performance with respect to items under their control and that work they do on a daily basis.

For example, a team or organization can be highly effective and yet have a terrible manager or leader who pits the team members against each other. A prima donna manager can "influence" teams by being a pain. That's typically not what's desired. A manager (for better or for worse) is an institutional position. It's institutional because where there's a team, there needs to be a goal, someone who sets goalposts and a referee that calls the goal when it's achieved. It's inevitable. That's the process nature of management => things need to happen, there needs to be one or more people who can take responsibility for the goal and process. That's the manager. However, unless the manager actually directly contributed (positively) to the goal, the process, the execution; taking credit for the team's work feels a bit ingenuous. Similarly, a great manager joining a team that needs a turnaround can't be adequately measured by the total team output. The change in the team out is a valid benchmark.

My preferred benchmark of a manager's work is in a term called effectiveness. Effectiveness is defined in terms of the actions of the manager with respect to their team. 

How effective is the manager in growing the team or growing the org? 
Is hiring well thought out? 
Are the roles clear before a hire? 
Are new hires well integrated? 
Are existing team members aware of their growth charts? 
Does the team interact well with each other? 
Are process blockers simplified or removed?
Are decisions made quickly?
What is the quality of those decisions?
Are non-consensus decisions documented?
Is communication within the team smooth?
Are processes within the team smooth?
Does the work of the team get adequately documented and promoted in the org?
Is the team correctly benchmarked for effectiveness against its peers?
How accommodating is the manager of diverse perspectives?
How much autonomy does the team have? Is there micromanagement?
Is the manager inclusive of opinions that diverge from their own?
Are the right members of the team rewarded?
For team members that are struggling, is corrective action taken? 
How is that corrective action communicated and managed?
Are performance conversations continuous and regular?
Does the manager actively and regularly seek feedback about their own performance?
Does the manager provide opportunities to lead?
Do you feel you have enough scope that stretches you in your current role?
Does your manager provide technical guidance and support? [(in-my-view) a delicate question].

The team's eventual output is a derivative of a well run team and facilitative management trounces management by control and process any day. If we agree with the previous statement, a manager should then. be assessed by their team and their own peers and managers on the questions listed above - how well does the manager facilitate the work of the company? For the questions listed above are specific, they are positive and a negative reaction on any of them indicates things that the manager can do to be more effective. There are probably more questions that can be added to the list and I'd love to hear some from you. If you come across this, please leave a comment!

Tuesday, September 29, 2020

Pearls of wisdom: taxing into prosperity

For a nation to try and tax itself into prosperity is like a man standing in a bucket trying to lift himself up by the handle -- Winston Churchill

Saturday, June 27, 2020

Checked exceptions break composition

A.K.A. Always Throw Runtime Exceptions or their subclasses

A typical Java or C++ function can potentially come with an exception specification: for example, a method can declare that it throws exceptions of a particular type (eg. IOException, std::bad_alloc etc.) and clients need to handle that exception being thrown with a try-catch block. This seems good at the outset till we spend time thinking through what this does to the type of the function.

A typical function in a happy-go-lucky world either succeeds or fails with an exception because of something beyond its control. If it succeeds, it returns with a value of the provided return type (let's call it SuccessValueType). If it fails with an exception (eg. a file read error or a mem allocation error), it throws the exception and the error handling parts of the code run. In type terms, the return type of the function is Either<SuccessValueType, RuntimeExceptionType> (where the RuntimeExceptionType is an implicit return type of the function). If all functions agree that RuntimeExceptionType is the implicit secondary return type, functions and try-catch blocks compose beautifully. This is because every function call site becomes an implicit early return point with a valid return value from the function. As a corollary, every try-catch block wrapping the function also makes little to no assumptions of the kinds of exceptions its likely to receive and that builds in flexibility for code evolution.

Here's an example:
Function 1 => calls => Function 2 followed by Function 3; both Function 2 and 3 can only throw RuntimeExceptions
If either of these functions throws, the RuntimeException is propagated as an "early return" from Function 1 without any changes. You can stack as many layers of nesting in the code and the return types and early return behavior remain compatible (because all the functions agree that RuntimeExceptionType is an implicit return type).
The application then adds error handling code close to the top-level of the processing hierarchy and presents the error to the user (as a form of recovery) or retries or notifies an engineer to take a look. If we need to add additional context to the exception, at any level a try-catch block can be introduced to attach context information to the exception and rethrowing the RuntimeException. This introduction of an intermediate try-catch is a purely local change that composes well with try-catch blocks further up the stack (removing a try-catch similarly composes well). Adding new libraries or call paths to the code remains a purely local operation and does not affect the type hierarchy or the error-handling try-catch structure.

Contrast with what happens when a checked exception is introduced. The function type changes from bi-valent to tri-valent: Either<SuccessValueType, RuntimeExceptionType, CheckedExceptionType>. Note that avoiding the RuntimeExceptionType is not possible (else you'll have code littered with redundant bad_alloc, io_exceptions and the like that are meaningless). With a trivalent return type from a function, we have 2 options: 

1. Convert the function back into the bi-valent return type by introducing a try-catch block, catching the checked exception and rethrowing as a RuntimeException.
2. Propagate the checked exception and ask our clients to update their code.

(1) is of-course the reasonable thing to do. It's a local operation, client code doesn't have to change and we're back to having to deal with only a single type of failure (either the function succeeds or the function fails with a RuntimeException).
(2) is a world of pain. If we're in this world, every new introduction of a checked exception means that significant chunks of the program have to change to include the new checked exception type.

Going back to our original example:
If Function 1 => calls => Function 2 followed by Function 3 and both of them throw checked exceptions of different types, the return type of Function 1 becomes Either<SuccessValueType, RuntimeExceptionType, F2CheckedExceptionType, F3CheckedExceptionType> (essentially the union of all the checked exceptions show up in the return type signature). As we keep adding more nested functions, this type list keeps expanding. 

In practical terms, this means that the developer adds "throws F2CheckedExceptionType, F3CheckedExceptionType, ..." to each of the caller functions in order to get them to compose. All the try-catch blocks similarly bloat to handle all the possible failure cases. Beyond small-sized codebases, this is completely infeasible because these signature changes and the try-catch handlers keep propagating out throughout the codebase. This hurts dev velocity.

From a recovery perspective, these checked exceptions are typically handled just one-level up the call stack at the lowest level of library code (to avoid the exception signature blowout) and a local resolution is done (retry a few times and then fail). This is typically not an optimal solution (eg. for an out-of-disk-space error, a batch processing application might prefer an immediate crash, a streaming application might prefer a continuous retry but without propagating the error all the way up to the application, this choice of recovery can't be made reasonably and the only way out is to pass down configurations to control this behavior... a gargantuan mess). 

In the RuntimeException only world, the retry configuration stays at the top level where things can be handled based on the execution environment.

In summary, as a practical matter, professional software engineers should ensure that their functions only throw unchecked exceptions (RuntimeExceptions or similar). Checked exceptions are actively harmful to dev velocity in large codebases and should be avoided. Google avoids this tar pit by banning exceptions from C++ code (for historical reasons), LinkedIn & Pinterest actively utilize a RuntimeException based Java codebase and you should encourage this too.

Wednesday, January 15, 2020

IO numbers that everyone should know

In the Numbers Every Programmer Should Know, one set of numbers that I've always found missing were IO numbers (HDD vs SSD - random reads / writes). I found a really good source on StackExchange for these numbers and for the sake of posterity, I'm documenting this here (for me and for you):

  • SSD | HDD Sequential Read/Write : 700 MB/s+ | 115 MB/s (6x diff) 
  • SSD | HDD Random Read 512KB : 160 MB/s | 39 MB/s (4x diff)
  • SSD | HDD Random Write 512KB : 830 MB/s | 57 MB/s (14x diff)
  • SSD | HDD Random Read 4KB27 MB/s | 0.5 - 1.5 MB/s (17x diff)
  • SSD | HDD Random Write 4KB135 - 177 MB/s | 0.7 MB/s (192x+ diff!)
The bottom line is that unless you're thrashing the HDD with lots of 4KB random writes, the HDD should not be tapped out till about 30+ MB/s (and an SSD should be just fine till about 150 - 300 MB/s). If you're seeing an HDD tapped out at 3 MB/s, then you're either not writing sequentially or your block size of writes is too small. If you're seeing an SSD tapped out at < 100 MB/s, it's almost certainly a software bug and not an IO limitation. In either case, the basic norm holds - if you can, always use SSDs, they usually save you money in CPU time.