Quantcast
Viewing all articles
Browse latest Browse all 7

An Important Lesson in Getting Code to Production

Any scrum team knows that the target is to get stories to done within the sprint.  Teams establish a Definition of Done (DoD) which includes all of the items that the teams needs to get the story to done.

Often teams that are new to scrum will follow many of the ceremonies, but still operate within a traditional waterfall context, meaning they won’t release to production until everything is done.  Today I will share a painful story on why getting stories to “Done” is not enough and scrum teams need to actually release the code to production more frequently than at the end of the project.

Image may be NSFW.
Clik here to view.
Definition-of-Done

The Background

I was working with a team as part of a large company to upgrade some of their systems.  The company historically used a waterfall process, but was starting to transition to scrum.  Initially the team was told that since this was an upgrade it would be run as a waterfall project, but as we dove in we saw how it would fit very well into an agile methodology.

Below were some key characteristics for the project:

  • There were 2 key drivers for the upgrade: End of support for the version of the operating system and concerns over performance during heavy usage periods
  • There was a “hard” delivery date needed as a result of the 2 key drivers above, which was ~1 year out
  • The project included the following: OS upgrade, database upgrade, splitting a single server for the application and database into 2 servers, building in BCP and HA to both the application and database server, rewriting a third party software with in house software and replacing the current job scheduler with another (among other items)
  • Additional scope items were added throughout the project and deemed “necessary”

The Plan

So even though we were told this project was not good for scrum, the team decided they would utilize Scrum.  Below was the basic approach:

  • There were 13 different applications on these servers.  The team used each application as an Epic.  Since the team did not have automated testing in place, it was not feasible to completely regression test every single system.  Instead, they evaluated the testing strategy and created stories around each major testing flow that they would perform.
  • We got together with the SMEs and used Planning Poker to estimate the size of each story (testing flow) and compared to our baseline.  After estimating each story (testing flow), we then compared the total rolled up points for each epic (application) to see if they compared relatively, which they generally did.
  • We then worked with the team to estimate their velocity, at which point we created a release plan.  Since we were given the end date, we knew that we could fit 13 3-week sprints.

I recommended that instead of doing one large production release, we break the project up into multiple smaller production releases along the way.  We received the buy-in to work in sprints, but we were told that we could not break up the releases and would release everything to production at once.

And this was where I failed as a ScrumMaster. When I analyzed the effort, I came to the following conclusions:

  • We were going to be standing up a new production server alongside the existing production server.  Our goal was to move all of the applications over to the new server, with the goal of retiring the old server.
  • There was quite a bit of risk with this project.  The code base was developed on the assumption that the application and database would be on the same server.  We were changing a lot (including application and database versions).  Additionally, these systems were critical to the business and the clients.
  • We had an entirely new system that we could release to, run in parallel against and not affect our current production system.

So I proposed that we prioritized the backlog, we pull a system that was low risk from an impact perspective, but contained as many of the characteristics that we could from the other systems (such as programming languages).  Once we finish all of the stories for this application, we then release it to the new production system and stop using the current production system.  This would accomplish several things:

  • It would allow us to test our release process.  Since there were a lot of new things with this system, it would be great to actually do a release and get quick feedback on anything that we had to adjust.
  • The risk of impacting our current production system was low, given that this was a completely new system.  If the release did not go well, we would gain valuable knowledge and could simply restore the previous version of the application.
  • If the application that we released had issues, we could learn from that any apply changes to the other system.  An example would be if the way we were trying to connect from the new application server to the new database was not allowed in production (but was allowed in our test environment), one of the few ways that we could catch it is to actually put it on that production environment.
  • If the release went well, then we would begin reducing the load on our existing production system.  If you remember back, one of the drivers for this project was the fear that the current production system could not keep up during peak times.  Moving applications off one at a time would reduce risk by removing the “hard” date.

Even with laying out these items (and several others), the organization just was not ready to separate from “the way they had always done things”.  They decided to mitigate some of the risk by doing the single big bang release over a 3 day weekend, ensuring that no one on the team would sleep or get to enjoy the weekend.

The Problems

“In software, if something is painful, do it more often.”

The main fear that I could identify is that the company was not structured to do frequent releases.  Release cycles were long, the release process (production and between the various test environments) was manual and the business was expecting a long UAT (User Acceptance Testing) phase at the end instead of several smaller, incremental UATs throughout.

Image may be NSFW.
Clik here to view.
EPSON scanner image

Problem 1: Maintenance, Merging & Rework

  • You can imagine that 13 different applications have a lot of code.  They also most likely require changes for reasons outside of our effort.  This meant that as our project got further along, we had to constantly merge code in from the other projects.  Not only did this take time away from us, we had the added complexity of changing code to work on the new servers and operating systems.  The code we were merging in was designed for the old servers/systems, so these were complicated merges.
  • Since we were operating in sprints, we focused on developing and testing in small cycles, with the goal to get ach story (testing flow) to done.  This code then sat in version control, and after performing a merge we needed to do some additional testing, which again took time away.

Problem 2: Not Knowing What We Don’t Know

  • One of the powers of getting to “done” is that there should be less surprises.  When we estimate that we are 80% done with something, it is just an estimate.  Especially if we have not completed testing or other items, we don’t know if there are hidden issues that might cause us to actually only be 40% done.
  • Even if we complete all of our testing, until we have gone through all of our steps and are using a system in production, we won’t know if it is truly working.  There could many reasons for this, including our test environments not being identical to our production environment.

Problem 3: Release Planning

  • Since the company did not have release automation in place, releases and everything related to them was done manually.  This means that as time goes on, the “release plan” continued to grow and had to constantly be maintained.  One can imagine the amount of mistakes in this plan, many of which were not caught because of infrequent releases.

Problem 4: Back Out Planning

  • An important aspect of any release plan is the ability to back out the changes and revert to the pre-release state.  Although the team had planned for this, everyone “knew” that this release just had to work, so there was not much focus on making sure the system could be reverted.

Image may be NSFW.
Clik here to view.
10-7-2004-7-02-41-AM-3940637

The Reality

So after various bumps in the road, many of which we were able to react to and overcome as a result of utilizing scrum, the release weekend was finally upon the team.  They stocked up their Red Bull, coffee and Mountain Dew and planted themselves together in a “command center” (large training room).  It had been a long road to get to this point.  The team expected to have some issues, but from the testing and several dry run releases to a production like environment, expected to be able to get the release out.  Below are a few of the key items that occurred:

  • As expected, the team ended up working all 3 days that weekend to try and get the release out.  The first night some of the team left because they were dependent on other team members to do work, but the second day was an all-nighter for the team.
  • At the end of the second day there were still several issues with the system.  The team decided to continue pushing forward with the release.
  • Team members had a few hours of sleep, then continued working the third day until ~2 am.
  • The first morning “live” on the new system, I ended up having to pull all of the team together.  We took over a large training room so that we could work together and efficiently address the growing list of defects.
  • The team put in on average 80 hours the first week of release (after the long 3 day release weekend).  This included pulling in several resources that were not on the project to assist.
  • The second and third week saw less time spent, but the team was still putting in ~60 hours each week.
  • It took several months and many maintenance efforts in order to get the system stable and to close out the majority of defects.
  • In total, there were 400+ production defects from this release, many of which were critical and the team had to resolve very quick while also being sleep deprived.
  • As a result of a great team the business and the clients had minimal impact (financially and missed SLAs), which was simply amazing given the number of high critical defects.
  • It looked like the project was going to come in under budget before release, and it ended up going 30%+ over budget after all of the extra time was spent after the release.

Image may be NSFW.
Clik here to view.
images

The Lessons

There were many lessons to learn after this effort.  Once the critical bugs were resolved, management began to try and figure out just what went wrong.  Of course there were many failure points that contributed and each person/group were blaming other groups to protect themselves.  A few key items included the following:

  • Just about any project can operate using scrum.  The riskier the project, the more using an incremental and iterative approach makes sense.
  • Getting code to meet the Definition of Done (DoD) is one thing, but it is difficult to truly know if everything is really “Done” until the software is running in production.
  • Think outside the box.  Just because your organization has not done it before does not mean you shouldn’t try.  Releasing the new applications to production on the new server one at a time is a great example of this.
  • Releasing often reduces risk.  Not only does it make sure that you have a great release process (and hopefully show the need for automation), but it also provided feedback to truly let you know if things are working.
  • It is essential to spend time developing a strategy for backing out code if there are issues.  The smaller the release, the easier the back out plan is.
  • Frequent releases (even if code is in a turned off state) prevent the ongoing maintenance of having to merge code from other branches/streams.
  • A “hard” date often doesn’t have to be a hard date.  In this case, we could have reduced the numbers of applications running in the previous system to hedge our risk of peak times.

After the project was finished and we were holding a retrospective, one manager said it best… “Even though we had a lot of issues with this effort, I truly believe operating with Scrum saved us.  If we had treated this as a waterfall project, we would have never hit our date and I fear there would have been a much larger client and financial impact.”

The post An Important Lesson in Getting Code to Production appeared first on Blue Agility LLC.


Viewing all articles
Browse latest Browse all 7

Trending Articles