Skip to content

Instantly share code, notes, and snippets.

@64J0
Created June 12, 2025 22:36
Show Gist options
  • Save 64J0/ebb9e9ec60c4b00ea3f055c757dd878f to your computer and use it in GitHub Desktop.
Save 64J0/ebb9e9ec60c4b00ea3f055c757dd878f to your computer and use it in GitHub Desktop.
[Management] Good operations teams

Good operations team

Operations teams are vital to keeping a software system running smoothly. A good operations team typically is responsible for the following, and more:

  • Monitoring the health of the system and quickly restoring service if it goes into a bad state
  • Tracking down the cause of problems, such as system failures or degraded performance
  • Keeping software and platforms up to date, including security patches
  • Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage
  • Anticipating future problems and solving them before they occur (e.g., capacity planning)
  • Establishing good practices and tools for deployment, configuration management, and more
  • Performing complex maintenance tasks, such as moving an application from one platform to another
  • Maintaining the security of the system as configuration changes are made
  • Defining processes that make operations predictable and help keep the production environment stable
  • Preserving the organization’s knowledge about the system, even as individual people come and go

Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities, Data systems can do various things to make routing tasks easy, including:

  • Providing visibility into the runtime behavior and internals of the system, with good monitoring
  • Providing good support for automation and integration with standard tools
  • Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted)
  • Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
  • Providing good default behavior, but also giving administrators the freedom to override defaults when needed
  • Self-healing where appropriate, but also giving administrators manual control over the system state when needed
  • Exhibiting predictable behavior, minimizing surprises

Reference

  • [1] Designing Data-Intensive Applications
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment