Page MenuHomePhabricator

Update/organize train deployment and related policy documentation
Closed, InvalidPublic

Description

In general, deployment documentation right now is a mess.

Several large pages are redundant with one another and slightly out of sync, navigation is difficult, and important details of policy are hard to find. There's also not really a single clear entry point for new deployers.

We should consolidate a number of pages under a more coherent structure, make sure everything actually reflects current practice, and improve the navigation aids. This applies to the procedural train docs as well as to descriptions of how deployments are structured overall and how backports are to be conducted.

Structural improvements and onboarding

We want to get more people confident deploying backports, as well as aware of the ways they are affected by the train process. To that end:

  • There should probably be an overall /Deployments portal, replacing the current calendar location
  • All the deployment docs should actually live under /Deployments
  • Calendar should probably move to /Deployments/Calendar
    • Projects that reference /Deployments will need updating:
      • Jouncebot parses Deployments
      • Do a codesearch for other stuff, ask around
  • There should be a /Deployments/Training entrypoint for new folks
  • We should establish a clear training process.
    • Open to anyone who:
      • Is in NDA / WMF / WMDE LDAP groups.
      • Has shell access.
      • Has received log triage training. (Details here could be worked out, but knowing how to deal with logs needs to be part of knowing how to deploy.)
    • Put this on the staff calendar, and offer invites: "Message me your email associated with your LDAP and I'll add you to the invite."
      • Trainer will check that people meet requirements.

Policy change tweaks

  • Holding the train
    • Mention client errors and 1k limit in a 12 hour period before it's an UBN
    • Client errors < 100 / hour
    • Specific error budget - 2 or more times in a version?
    • Define "new" in regards to errors
  • Heterogeneous deployment/Train_deploys
    • Mention client error dashboard
    • Client errors < 100 / hour
    • Define "new" in regards to errors

Event Timeline

cc: @thcipriani, @dancy if there are specifics I'm forgetting here.

LGTM! I'll mention what I voiced in our meeting: "new" is the term I struggle with.

brennen triaged this task as High priority.Feb 3 2021, 11:49 PM
brennen updated the task description. (Show Details)
brennen updated the task description. (Show Details)
brennen moved this task from Backlog to Doing on the User-brennen board.
brennen renamed this task from Update train policy documentation to Update/organize train deployment and related policy documentation.Feb 5 2021, 9:31 PM
brennen updated the task description. (Show Details)

Unlicking this cookie for the moment, as my good intentions got mugged by reality.

brennen moved this task from Backlog to Done or Declined on the User-brennen board.

This task as-written no longer reflects reality.

At this writing:

  • We have a longstanding training process for new deployers, albeit a sparsely attended one. Training is quite a bit less necessary than it used to be, since...
  • Backports and train deployment have been simplified dramatically via scap backport and the recent scap train.
  • We've recently done some work on updating deployment docs to reflect these things. Imperfect, but on the whole it's in better shape than it was a while ago. My hope is that we continue to reduce the amount of documentation needed to cover a process that's getting less painful with time.
  • Moving the Deployment calendar seems like more hassle than it's worth.

There's some policy stuff about error budgets muddled in here, but I'd like to re-approach that if necessary under its own ticket. I think we're talking seriously about handing off some responsibility for deployment operations and error triage to the teams actually responsible for the code, and that'll probably be an appropriate time to think about this stuff.