315: Emotions Are A Pendulum
Steph talks about starting a new project and identifying "focused" tests while Chris shares his latest strategy for managing flaky tests. They also ponder the squishy "it depends" side of software and respond to a listener question about testing all commits in a pull request.
This episode is brought to you by ScoutAPM (https://scoutapm.com/bikeshed). Give Scout a try for free today and Scout will donate $5 to the open source project of your choice when you deploy.
rspec-retry (https://github.com/NoRedInk/rspec-retry)
Cassidy Williams - It Depends - GitHub Universe 2021 (https://www.youtube.com/watch?v=aMWh2uLO9OM)
Say No To More Process (https://thoughtbot.com/blog/say-no-to-more-process-say-yes-to-trust)
StandardRB (https://github.com/testdouble/standard)
Become a Sponsor (https://thoughtbot.com/sponsorship) of The Bike Shed!
Transcript:
CHRIS: My new computer is due on the fourth. I'm so close.
STEPH: On the fourth?
CHRIS: On the fourth.
STEPH: That's so exciting.
CHRIS: And I'm very excited. But no, I don't want to upgrade any software on this computer anymore. Never again shall I update a piece of software on this computer.
STEPH: [laughs]
CHRIS: This is its final state. And then I will take its soul and move it into the new computer, and we'll go from there. [chuckles]
STEPH: Take its soul. [laughs]
CHRIS: Hello and welcome to another episode of The Bike Shed, a weekly podcast from your friends at thoughtbot about developing great software. I'm Chris Toomey.
STEPH: And I'm Steph Viccari.
CHRIS: And together, we're here to share a bit of what we learn along the way. So, Steph, what's new in your world?
STEPH: Hey, Chris. Let's see. It's been kind of a busy week. It's been a busy family week. Utah, my dog, hasn't been feeling well as you know because you and I have chatted off-mic about that a bit. So he is still recovering from something, I don't know what. He's still on most days his normal captain chaos self, but then other days, he's not feeling well. So I'm just keeping a close eye on him. And then I also got some other family illnesses going on. So it has been a busy family week for sure.
On the more technical project side, I am wrapping up my current project. So I have one more week, and then I will shift into a new project, which I'm very excited about. And you and I have chatted about this several times. So there's always just that interesting phase where you're trying to wrap up and hand things off and then accomplish last-minute wishlist items for a project before then you start with a new one. So I am currently in that phase.
CHRIS: How long were you on this project for?
STEPH: It'll be a total of I think eight months.
CHRIS: Eight months, that's healthy. That's a bunch. It's always interesting to be on a project for that long but then not longer. There were plenty of three and four-month projects that I did. And you can definitely get a large body of work done. You can look back at it and proudly stare at the code that you have written. But that length of time is always interesting to me because you end up really...for me, when I've had projects that went that long but then not longer, I always found that to be an interesting breaking point. How are you feeling moving on from it? Are you ready for something new? Are you sad to be moving on? Do you feel attached to things?
STEPH: It's always a mix. I'm definitely attached to the team, and then there are always lots of things that I'd still love to work on with that team. But then, I am also excited to start something new. That's why I love this role of consulting because then I get to hop around and see new projects and challenges and work with new people. I'm thinking seven to eight months might be a sweet spot for me in terms of the length of a project. Because I find that first month with a project, I'm really still ramping up, I'm getting comfortable, I'm getting in the groove, and I'm contributing within a short amount of time. But I still feel like that first month; I’m getting really comfortable with this new environment that I'm in. And so then I have that first month.
And then, at six months, I have more of heads-down time. And I get to really focus and work with a team. And then there's that transition period, and it's nice to know when that's coming up for several weeks, so then I have a couple of weeks to then start working on that transition phase. So eight months might be perfect because then it's like a month for onboarding, ramping up, getting comfortable. And then six months of focus, and then another month of just focusing on what needs to be transitioned so then I can transition off the team.
CHRIS: All right. Well, now we've defined it - eight months is the perfect length of a project.
STEPH: That's one of the things I like about the Boost team is because we typically have longer engagements. So that was one of the reasons when we were splitting up the teams in thoughtbot that I chose the Boost team because I was like, yeah, I like the six-month-plus project.
Speaking of that wishlist, there are little things that I've wanted to make improvements on but haven't really had time to do. There's one that's currently on my mind that I figured I'd share with you in case you have thoughts on it. But I am a big proponent of using the RSpec focus filter for when running tests. So that way, I can just prefix a context it block or describe block with F, and then RSpec I can just run all the tests. But RSpec will only run the tests that I've prefixed with that F focus command., and I love it.
But we are running into some challenges with it because right now, there's nothing that catches that in a pull request. So if you commit that focus filter on some of your tests, and then that gets pushed up, if someone doesn't notice it while reviewing your pull request, then that gets merged into main. And all of the tests are still green, but it's only a subset of the tests that are actually running. And so it's been on my mind that I'd love something that's going to notice that, that's going to catch it, something that is not just us humans doing our best but something that's automated that's going to notice it for us. And I have some thoughts. But I'm curious, have you run into something like this? Do you have a way that you avoid things like that from sneaking into the main branch?
CHRIS: Interestingly, I have not run into this particular problem with RSpec, and that's because of the way that I run RSpec tests. I almost never use the focus functionality where you actually change the code file to say, instead of it, it is now fit to focus that it.
I tend to lean into the functionality where RSpec you can pass it the line number just say, file: and then line number. And RSpec will automatically figure out which either spec or context block or entire file. And also, I have Vim stuff that allows me to do that very easily from the file. It's very rare that I would want to run more than one file.
So basically, with that, I have all of the flexibility I need. And it doesn't require any changes to the file. So that's almost always how I'm working in that mode. I really love that. And it makes me so sad when I go to JavaScript test runners because they don't have that.
That said, I've definitely felt a very similar thing with ESLint and ESLint yelling at me for having a console.log. And I'm like, ESLint, I'm working here. I got to debug some stuff, so if you could just calm down for a minute. And what I would like is a differentiation between these are checks that should only run in CI but definitely need to run in CI. And so I think an equivalent would be there's probably a RuboCop rule that says disallow fit or disallow any of the focus versions for RSpec. But I only want those to run in CI.
And this has been a pain point that I felt a bunch of times. And it's never been painful enough that I put in the effort to fix it. But I really dislike particularly that version of I'm in my editor, and I almost always want there to be no warnings within the editor. I love that TypeScript or ESLint, or other things can run within the editor and tell me what's going on. But I want them to be contextually aware. And that's the dream I've yet to get there.
STEPH: I like the idea of ESLint having a work mode where you're like, back off, I am in work mode right now. [chuckles] I understand that I won't commit this.
CHRIS: I'm working here. [laughter]
STEPH: And I like the idea of a RuboCop. So that's where my mind went initially is like, well, maybe there's a custom cop, or maybe there's an existing one, and I just haven't noticed it yet. But so I'm adding a rule that says, hey, if you do see an fcontext, fdescribe, ffit, something like that, please fail. Please let us know, so we don't merge this in. So that's on my wishlist, not my to-don't list. That one is on my to-do list.
CHRIS: I'm also intrigued, though, because the particular failure mode that you're describing is you take what is an entire spec suite, and instead, you focus down to one context block within a given file. So previously, there were 700 specs that ran, and now there are 12. And that's actually something that I would love for Circle or whatever platform you're running your tests on to be like, hey, just as a note, you had been slowly creeping up and had hit a high watermark of roughly 700 specs. And then today, we're down to 12. So either you did some aggressive grooming, or something's wrong. But a heuristic analysis of like, I know sometimes people delete specs, and that's a thing that's okay but probably not this many. So maybe something went wrong there.
STEPH: I feel like we're turning CI into this friend at the bar that's like, "Hey, you've had a couple of drinks. I just wanted to check in with you to make sure that you're good." [laughs]
CHRIS: Yes.
STEPH: "You've had 100 tests that were running and now only 50. Hey, friend, how are you? What's going on?"
CHRIS: "This doesn't sound like you. You're normally a little more level-headed." [laughs] And that's the CI that is my friend that keeps me honest. It's like, "Wait, you promised never to overspend anymore, and yet you're overspending." I'm like, "Thank you, CI. You're right; I did say I want the test to pass."
STEPH: [laughs] I love it. I'll keep you posted if I figure something out; if I either turn CI into that friend, that lets me know when my behavior has changed in a concerning way, and an intervention is needed. Or, more likely, I will see if there's a RuboCop or some other process that I can apply that will check for this, which I imagine will be fast. I mean, we're very mindful about ensuring our test suite doesn't slow down as we're running it. But I'm just thinking about this out loud. If we add that additional cop, I imagine that will be fast. So I don't think that's too much of an overhead to add to our CI process.
CHRIS: If you've already got RuboCop in there, I'm guessing the incremental cost of one additional cop is very small. But yeah, it is interesting. That general thing of I want CI to go fast; I definitely feel that feel. And we're slowly creeping up on the project I'm working on. I think we're at about somewhere between five to six minutes, but we've gotten there pretty quickly where not that long ago; it was only three minutes.
We're adding a lot of features specs, and so they are definitely accruing slowdowns in our CI. And they're worth it; I think, because they're so valuable. And they test the whole integration of everything, but it's a thing that I'm very closely watching. And I have a long list of things that I might pursue when I decide it's time for CI to get a haircut, as it were.
STEPH: I have a very hot tip for a way to speed up your test, and that is to check if any of your tests have a very long sleep in them. That came up recently [chuckles] this week where someone was working in a test and found some relic that had been added a while back that then wasn't caught. And I think it was a sleep 30. And they were like, "Hey, I just sped up our test by 30 seconds." I was like, ooh, we should grep now to see if there's anything else like that. [laughs]
CHRIS: Oh, I love the sentence we should grep now. [laughter] The correct response to this is to grep immediately. I thought you were going to go with the pro tip of you can just focus down to one context block. And then the specs will run so much faster because you're ignoring most of them, but we don't want to do that. The sleep, though, that's a pro tip. And that does feel like a thing that there could be a cop for, like, never sleep more than...frankly, let's try not to sleep at all but also, add a sleep in our specs. We can sleep in life; it's important, but anyway. [chuckles]
STEPH: [laughs] That was the second hot tip, and you got it.
CHRIS: Lots of hot tips. Well, I'm going to put this in the category of good idea, terrible idea. I won't call it a hot tip. It's a thing we're trying. So much as we have tried to build a spec suite that is consistent and deterministic and tells us only the truth, feature specs, even in our best efforts, still end up flaking from time to time. We'll have feature specs that fail, and then eventually, on a subsequent rerun, they will pass. And I am of the mindset that A, we should try and look into those and see if there is a real cause to it. But sometimes, just the machinery of feature specs, there's so much going on there. We've got the additional overhead of we're running it within a JavaScript context. There's just so much there that...let me say what I did, and then we can talk more about the context.
So there's a gem called RSpec::Retry. It comes from the wonderful folks over at NoRedInk, a well-known Elm shop for anyone out there in the Elm world. But RSpec::Retry does basically what it says in the name. If the spec fails, you can annotate specs. In our case, we've only enabled this for the feature specs. And you can tell it to retry, and you can say, "Retry up to this many times," and et cetera, et cetera. So I have enabled this for our feature specs. And I've only enabled it on CI. That's an important distinction. This does not run locally.
So if you run a feature spec and it fails locally, that's a good chance for us to intervene and look at whether or not there's some flakiness there. But on CI, I particularly don't want the case where we have a pull request, everything's great, and we merge that pull request, and then the subsequent rebuild, which again, as a note, I would rather that Circle not rebuild it because we've already built that one. But that is another topic that I have talked about in the past, and we'll probably talk about it again in the future.
But setting that aside, Circle will rebuild on the main branch when we merge in, and sometimes we'll see failures there. And that's where it's most painful. Like, this is now the deploy queue. This is trying to get this out into whatever environment we're deploying to. And it is very sad when that fails. And I have to go in and manually say, hey, rebuild. I know that this works because it just worked in the pull request, and it's the same commit hash. So I know deterministically for reasons that this should work. And then it does work on a rebuild.
So we introduced RSpec::Retry. We have wrapped it around our feature specs. And so now I believe we have three possible retries. So if it fails once, it'll try it again, and then it'll try it a third time. So far, we've seen each time that it has had to step in; it will pass on the subsequent run. But I don't know; there was some very gentle pushback or concerns; let’s call them when I introduced this pull request from another developer on the team, saying, "I don't know, though, I feel like this is something that we should solve at the root layer. The failures are a symptom of flaky tests, or inconsistency or et cetera, and so I'd rather not do this." And I said, "Yeah, I know. But I'm going to merge it," and then I merged it.
We had a better conversation about that. I didn't just broadly overrule. But I said, "I get it, but I don't see the obvious place to shore this up. I don't see where we're doing weird inconsistent things in our code. This is just, I think, inherent complexity of feature specs." So I did it, but yeah, good idea, terrible idea. What do you think, Steph? Maybe terrible is too strong of a word. Good idea, mediocre idea.
STEPH: I like the original branding. I like the good idea, terrible idea. Although you're right, that terrible is a very strong branding. So I am biased right now, so I'm going to lead in answering your question by stating that because our current project has that problem as well where we have these flaky tests. And it's one of those that, yes, we need to look at them. And we have fixed a large number of them, but there are still more of them.
And it becomes a question of are we actually doing something wrong here that then we need to fix? Or, like you said, is it just the nature of these features-specs? Some of them are going to occasionally fail. What reasonable improvements can we make to address this at the root cause? I'm interested enough that I haven't heard of RSpec::Retry that I want to check it out because when you add that, you annotate a test. When a test fails, does it run the entire build, or will it rerun just that test? Do you happen to know?
CHRIS: Just the test. So it's configured as in a round block on the feature specs. And so you tell it like, for any feature spec, it's like config.include for feature specs RSpec::Retry or whatever. So it's just going to rerun the one feature spec that failed when and if that happens. So it's very, very precise as well in that sense where when we have a failure merging into the main branch, I have to rebuild the whole thing. So that's five or six minutes plus whatever latency for me to notice it, et cetera, whereas this is two more seconds in our CI runtime. So that's great. But again, the question is, am I hiding? Am I dealing with the symptoms and not the root cause, et cetera?
STEPH: Is there a report that's provided at the end that does show these are the tests that failed and we had to rerun them?
CHRIS: I believe no-ish. You can configure it to output, but it's just going to be outputting to standard out, I believe. So along with the sea of green dots, you'll see had to retry this one. So it is visible, but it's not aggregated. And the particular thing is there's the JUnit reporter that we're using. So the XML common format for this is how long our tests took to run, and these ones passed and failed.
So Circle, as a particular example, has platform-level insights for that kind of stuff. And they can tell you these are your tests that fail most commonly. These are the tests that take the longest run, et cetera. I would love to get it integrated into that such that retried and then surface this to Circle. Circle could then surface it to us. But right now, I don't believe that's happening. So it is truly I will not see it unless I actively go search for it. To be truly honest, I'm probably not doing that.
STEPH: Yeah, that's a good, fair, honest answer. You mentioned earlier that if you want a test to retry, you have to annotate the test. Does that mean that you get to highlight specific tests that you're marking those to say, "Hey, I know that these are flaky. I'm okay with that. Please retry them." Or does it apply to all of them?
CHRIS: I think there are different ways that you can configure it. You could go the granular route of we know this is a flaky spec, so we're going to only put the retry logic around it. And that would be a normal RSpec annotation sort of tagging the spec, I think, is the terminology there. But we've configured it globally for all feature specs. So in a spec support file, we just say config.include Rspec::Retry where type is a feature. And so every feature spec now has the possibility to retry. If they pass on the first pass, which is the hope most of the time, then they will not be tried. But if they don't, if they fail, then they'll be retried up to three times or up to two additional times, I think is the total.
STEPH: Okay, cool. That's helpful. So then I think I have my answer. I really think it's a good idea to automate retrying tests that we have identified that are flaky. We've tried to address the root, and our resolution was this is fine. This happens sometimes. We don't have a great way to improve this, and we want to keep the test. So we're going to highlight that this test we want to retry. And then I'm going to say it’s not a great idea to turn it on for all of them just because then I have that same fear about you're now hiding any flaky tests that get introduced into the system. And nobody reasonably is going to go and read through to see which tests are going to get retried, so that part makes me nervous.
CHRIS: I like it. I think it's a balanced and reasonable set of good and terrible idea. Ooh, it's perfect. I don't think we've had a balanced answer on that yet.
STEPH: I don't think so.
CHRIS: This is a new outcome for this segment. I agree. Ideally, in my mind, it would be getting into that XML format, the output from the tests, so that we now have this artifact, we can see which ones are flaky and eventually apply effort there. What you're saying feels totally right of we should be more particular and granular. But at the same time, the failure mode and the thing that I'm trying, I want to keep deploys going. And I only want to stop deploys if something's really broken.
And if a spec retries, then I'm fine with it is where I've landed, particularly because we haven't had any real solutions where there was anything weird in our code. Like, there's just flakiness sometimes. As I say it, I feel like I'm just giving up. [laughs] And I can hear this tone of stuff's just hard sometimes, and so I've taken the easy way out. And I guess that's where I'm at right now. But I think what you're saying is a good, balanced answer here. I like it. I don't know if I'm going to do anything about it, but...[laughter]
STEPH: Well, going back to when I was saying that I'm biased, our team is feeling this pain because we have flaky tests. And we're creating tickets, and we're trying to do all the right things. We create a ticket. We have that. So it's public. So people know it's been acknowledged. If someone's working on it, we let the team know; hey, I'm working on this. So we're not duplicating efforts. And so, we are trying to address all of them. But then some of them don't feel like a great investment of our time trying to improve.
So that's what I really do like about the RSpec::Retry is then you can still have a resolution. Because it's either right now your resolution is to fix it or to change the code, so then maybe you can test it in a different way. There's not really a good medium step there. And so the retry feels like an additional good outcome to add to your tool bag to say, hey, I've triaged this, and this feels reasonable that we want to retry this.
But then there's also that concern of we don't want to hide all of these flaky tests from ourselves in case we have done it and there is an opportunity for us to improve it. So I think that's what I do really like about it because right now, for us, when a test fails, we have to rerun the entire build, and that's painful. So if tests are taking about 20 minutes right now, then one spec fails, and then you have to wait another 20 minutes.
CHRIS: I would have turned this on years ago with a 20-minute build time. [chuckles]
STEPH: [laughs] Yeah, you're not wrong. But also, I didn't actually know about RSpec::Retry until today. So that may be something that we introduce into our application or something that I bring up to the team to see if it's something that we want to add. But it is interesting that initial sort of ooh kind of feeling that the team will give you introducing because it feels bad. It feels wrong to be like, hey, we're just going to let these flaky tests live on, and we're going to automate retrying them to at least speed us up. And it's just a very interesting conversation around where we want to invest our time and between the risk and pay off.
And I had a similar experience this week where I had that conversation, but this one was more with myself where I was working through a particular issue where we have a state in the application where something weird was done in the past that led us to a weird state. And so someone raised a very good question where it's like, well, if what you're saying is technically an impossible state, we should make it impossible, like at the database layer. And I love that phrase. And yet, there was a part of me that was like, yes, but also doing that is not a trivial investment. And we're here because of a very weird thing that happened before.
It felt one of those interesting, like, do we want to pursue the more aggressive, like, let's make this impossible for the future? Or do we want to address it for now and see if it comes back up, and then we can invest more time in it? And I had a hard time walking myself through that because my initial response was, well, yeah, totally, we should make it impossible. But then I walked through all the steps that it would take to make that happen, and it was not very trivial.
And so it was one of those; it felt like the change that we ended up with was still an improvement. It was going to prevent users from seeing an error. It was still going to communicate that this state is an odd state for the application to be in. But it didn't go as far as to then add in all of the safety measures. And I felt good about it. But I had to convince myself to feel good about it.
CHRIS: What you're describing there, the whole thought sequence, really feels like the encapsulation of it depends. And that being part of the journey of learning how to do software development and what it means. And you actually shared a wonderful video with me yesterday, and it was Cassidy Williams at GitHub Universe. And it was her talking to her younger self, and just it depends, and it was so true. So we will include a link to that in the show note because that was a wonderful thing for you to share. And it really does encapsulate this thing. And from the outside, before I started doing software development, I'm like, it's cool. I'm going to learn how to sling code and fix the stuff and hack, and it'll be great, and obvious, and correct, and knowable. And now I'm like, oh man, squishy nonsense. That's all it is.
STEPH: [laughs]
CHRIS: Fun squishy, and I like it. It's so good. But it depends. Exactly that one where you're like, I know that there's a way to get to correctness here but is it worth the effort? And looping back to...I'm surprised at the stance that I've taken where I'm just like, yeah, I'm putting in RSpec::Retry. This feels like the right thing. I feel good about this decision. And so I've tried to poke at it a tiny bit.
And I think what matters to me deeply in a list of priorities is number one correctness. I care deeply that our system behaves correctly as intended and that we are able to verify that. I want to know if the system is not behaving correctly. And that's what we've talked about, like, if the test suite is green, I want to be able to deploy. I want to feel confident in that.
Flaky specs exist in this interesting space where if there is a real underlying issue, if we've architected our system in a way that causes this flakiness and that a user may ever experience that, then that is a broken system. That is an incorrect system, and I want to resolve that. But that's not the case with what we're experiencing. We're happy with the architecture of our system. And when we're resolving it, we're not even really resolving them. We're just rerunning manually at this point. We're just like, oh, that spec flaked. And there's nothing to do here because sometimes that just happens. So we're re-running manually. And so my belief is if I see all green, if the specs all pass, I know that I can deploy to production.
And so if occasionally a spec is going to flake and retrying it will make it pass (and I know that pass doesn't mean oh, this time it happened to pass; it's that is the correct outcome) and we have a false negative before, then I'm happy to instrument the system in a way that hides that from me because, at this point, it does feel like noise. I'm not doing anything else with the failures when we were looking at them more pointedly. I'm not resolving those flaky specs. There are no changes that we've made to the underlying system. And they don't represent a failure mode or an incorrectness that an end-user might see. So I honestly want to paper over and hide it from myself. And that's why I've chosen this. But you can see I need to defend my actions here because I feel weird. I feel a little off about this. But as I talk through it, that is the hierarchy. I care about correctness.
And then, the next thing I care about is maintaining the deployment pipeline. I want that to be as quick and as efficient as possible. And I've talked a bunch about explorations into the world of observability and trying to figure out how to do continuous deployment because I think that really encourages overall better engineering outcomes. And so first is correctness. Second is velocity. And flaky specs impact velocity heavily, but they don't actually impact correctness in the particular mode that we're experiencing them here. They definitely can. But in this case, as I look at the code, I'm like, nah, that was just noise in the system. That was just too much complexity stacked up in trying to run a feature spec that simulates a browser and a user clicking in JavaScript and all this stuff and the things. But again, [laughs] here I am. I am very defensive about this apparently.
STEPH: Well, I can certainly relate because I was defending my answer to myself earlier. And it is really interesting what you're pointing out. I like how you appreciate correctness and then velocity, that those are the two things that you're going after. And flaky tests often don't highlight an incorrect system. It is highlighting that maybe our code or our tests are not as performant as we would like them to be, but the behavior is correct. So I think that's a really important thing to recognize.
The part where I get squishy is where we have encountered on this project some flaky tests that did highlight that we had incorrect behavior, and there's only been maybe one or two. It was rare that it happened, but it at least has happened once or twice where it highlighted something to us that when tests were run...I think there's a whole lot of context. I won't get into it. But essentially, when tests were being run in a particular way that made them look like a flaky test, it was actually telling us something truthful about the system, that something was behaving in a way that we didn't want it to behave. So that's why I still like that triage that you have to go through.
But I also agree that if you're trying to get out at a deploy, you don't want to have to deal with flaky tests. There's a time to eat your vegetables, and I don't know if it's when you've got a deploy that needs to go out. That might not be the right time to be like, oh, we've got a flaky test. We should really address this. It's like, yes; you should note to yourself, hey, have a couple of vegetables tomorrow, make a ticket, and address that flaky test but not right now. That's not the time. So I think you've struck a good balance. But I also do like the idea of annotating specific tests instead of just retrying all of them, so you don't hide anything from yourself.
CHRIS: Yeah. And now that I'm saying it and now that I'm circling back around, what I'm saying is true of everything we've done so far. But it is possible that now this new mode that the system behaves in where it will essentially hide flaky specs on CI means that any new flaky regressions, as it were, will be hidden from us.
And thus far, almost all or I think all of the flakiness that we've seen has basically been related to timeouts. So a different way to solve this would potentially be to up the Capybara wait time. So there are occasionally times where the system's churning through, and the various layers of the feature specs just take a little bit longer. And so they miss...I forget what it is, but it's like two seconds right now or something like that. And I can just bump that up and say it's 10 seconds. And that's a mode that if eventually, the system ends in the state that we want, I'm happy to wait a little longer to see that, and that's fine.
But there are...to name some of the ways that flaky tests can actually highlight truly incorrect things; race conditions are a pretty common one where this behaves fine most of the time. But if the background job happens to succeed before the subsequent request happens, then you'll go to the page. That's a thing that a real user may experience, and in fact, it might even be more likely in production because production has differential performance characteristics on your background jobs versus your actual application. And so that's the sort of thing that would definitely be worth keeping in mind.
Additionally, if there are order issues within your spec suite if the randomize...I think actually RSpec::Retry wouldn't fix this, though, because it's going to retry within the same order. So that's a case that I think would be still highlighted. It would fail three times and then move on. But those we should definitely deal with. That's a test-related thing. But the first one, race conditions, that's totally a thing. They come up all the time. And I think I've potentially hidden that from myself now. And so, I might need to lock back what I said earlier because I feel like it's been true thus far that that has not been the failure mode, but it could be moving forward. And so I really want to find out if we got flaky specs. I don't know; I feel like I've said enough about this. So I'm going to stop saying anything new. [laughs] Do you have any other thoughts on this topic?
STEPH: Our emotions are a pendulum. We swing hard one way, and then we have to wait till we come back and settle in the middle. But there's that initial passion play where you're really frustrated by something, and then you swing, and you settle back towards something that's a little more neutral.
CHRIS: I don't trust anyone who pretends like their opinions never change. It doesn't feel like a good way to be.
STEPH: Oh, I hope that...Do people say that? I hope that's not true. I hope we are all changing our opinions as we get more information.
CHRIS: Me too.
Mid-roll Ad
And now a quick break to hear from today's sponsor, Scout APM.
Scout APM is leading-edge application performance monitoring that's designed to help Rails developers quickly find and fix performance issues without having to deal with the headache or overhead of enterprise platform feature bloat. With a developer-centric UI and tracing logic that ties bottlenecks to source code, you can quickly pinpoint and resolve those performance abnormalities like N+1 queries, slow database queries, memory bloat, and much more.
Scout's real-time alerting and weekly digest emails let you rest easy knowing Scout's on watch and resolving performance issues before your customers ever see them. Scout has also launched its new error monitoring feature add-on for Python applications. Now you can connect your error reporting and application monitoring data on one platform.
See for yourself why developers call Scout their best friend and try our error monitoring and APM free for 14 days; no credit card needed. And as an added-on bonus for Bike Shed listeners, Scout will donate $5 to the open-source project of your choice when you deploy. Learn more at scoutapm.com/bikeshed. That's scoutapm.com/bikeshed.
CHRIS: Well, shifting only ever so slightly because it turns out it's a very related question, but we have a listener question. As always, thank you so much to everyone who sends in listener questions. We really appreciate them. And today's question comes from Mikhail, and he writes in, "Regarding the discussion in Episode 311 on requiring commits merged to be tested, I have a question on how you view multi commit PRs. Do you think all the commits in a PR should be tested or only the last one? If you test all commits in a PR, do you have any good tips on setups for that? Would you want all commits to pass all tests? For one, it helps a lot when using Git bisect. It is also a question of keeping the history clean and understandable.
As a background on the project I currently work on, we have the opinion that all commits should be tested and working. We have now decided on single commit PRs only since this is the only way that we can currently get the setup reasonably on our CI. I would like to sometimes make PRs with more than one commit since I want to make commits as small as possible. In order to do that, we would have to find a way to make sure all commits in the PR are tested. There seems to be some hacky ways to accomplish this, but there is not much talk about it. Also, we are strict in requiring a linear history in all our projects. Kind regards, Mikhail." So, Steph, what do you think?
STEPH: I remember reading this question when it came in. And I have an experience this week that is relevant to this mainly because I had seen this question, and I was thinking about it. And off the cuff, I haven't really thought about this. I haven't been very concerned about ensuring every single commit passes because I want to ensure that, ultimately, the final commit that I have is going in.
But I also rarely have more than one commit in a PR. So that's often my default mode. There are a couple of times that I'll have two, maybe three commits, but I think that's pretty rare for me. I'll typically have just one commit. So I haven't thought about this heavily. And it's not something that frankly I've been concerned about or that I've run into issues with.
From their perspective about using Git bisect, I could see how that could be troublesome, like if you're looking at a commit and you realize there's a particular commit that's already merged and that fails. The other area that I could think of where this could be problematic is if you're trying to roll back to a specific commit. And if you accidentally roll back to a commit that is technically broken, but you didn't know that because it was not the final commit as was getting tested on CI, that could happen. I haven't seen that happen. I haven't experienced it. So while that does seem like a legitimate concern, it's also one that I frankly just haven't had.
But because I read this question from this person earlier this week, I actually thought about it when I was crafting a PR that had several commits in it, which is kind of unusual for me since I'm usually one or two commits in a PR. But for this one, I had several because we use standard RB in our project to handle all the formatting. And right now, we have one of those standard to-do files because we added it to the project. But there are still a number of manual fixes that need to be applied.
So we just have this list of files that still need to be formatted. And as someone touches that file, we will format it, and then we'll take it out of that to-do list. So then standard RB will include it as it's linting all of our files. And I decided to do that for all of our spec files. Because I was like, well, this was the safest chunk of files to format that will require the least amount of review from folks. So I just want to address all of them in one go. But I separated the more interesting changes into different commits just to make others aware of, like, hey, this is something standard RB wants.
And it was interesting enough that I thought I would point it out. So my first commit removed all the files from that to-do list, but then my other commits are the ones that made actual changes to some of those files that needed to be corrected. So technically, one or two of my middle commits didn't pass the standard RB linting. But because CI was only running that final commit, it didn't notice that.
And I thought about this question, and so I intentionally went back and made sure each of those commits were correct at that point in time. And I feel good about that. But I still don't feel the need to add more process around ensuring each commit is going to be green. I think I would lean more in favor of let's keep our PR small to one or two commits. But I don't know; it’s something I haven't really run into. It's an interesting question. How about you? What are your experiences, or what are your thoughts on this, Chris?
CHRIS: When this question came through, I thought it was such an interesting example of considering the cost of process changes. And to once again reference one of our favorite blog posts by German Velasco, the Say No to More Process post, which we will, of course, link in the show notes. This is such a great example of there was likely a small amount of pain that was felt at one point where someone tried to run git bisect. They ran into a troublesome commit, and they were like, oh no, this happened. We need to add processes, add automation, add control to make sure this never happens again.
Personally, I run git bisect very rarely. When I do, it's always a heroic moment just to get it started and to even know which is the good and which is the bad. It's always a thing anyway. So it would be sad if I ran into one of these commits. But I think this is a pretty rare outcome. I think in the particular case that you're talking about, there's probably a way to actually tease that apart. I think it sounds like you fixed those commits knowing this, maybe because you just put it in your head. But the idea that the process that this team is working on has been changed such that they only now allow single commit PRs feels like too much process in my mind. I think I'm probably 80%, maybe 90% of the time; it’s only a single commit in a PR for me.
But occasionally, I really value having the ability to break it out into discrete steps, like these are all logically grouped in one changeset that I want to send through. But they're discrete steps that I want to break apart so that the team can more easily review it so that we have granular separation, and I can highlight this as a reference. That's often something that I'll do is I want this commit to standalone because I want it to be referenced later on. I don't want to just fold it into the broader context in which it happened, but it's pretty rare. And so to say that we can't do that feels like we're adding process where it may not be worth it, where the cost of that process change is too high relative to the value that we're getting, which is speculatively being able to run git bisect and not hit something problematic in the future.
There's also the more purist, dogmatic view of well, all commits should be passing, of course. Yeah, I totally agree with that. But what's it worth to you? How much are you willing to spend to achieve that goal? I care deeply about the correctness of my system but only the current correctness. I don't care about historical correctness as much, some. I think I'm diminishing this more than I mean to. But really back to that core question of yes, this thing has value, but is it worth the cost that we have to pay in terms of process, in terms of automation and maintenance of that automation over time, et cetera or whatever the outcome is? Is it worth that cost? And in this case, for me, this would not be worth the cost. And I would not want to adopt a workflow that says we can only ever have single commit PRs, or all commits must be run on CI or any of those variants.
STEPH: This is an interesting situation where I very much agree with everything you're saying. But I actually feel like what Mikhail wants in this world; I want it too. I think it's correct in the way that I do want all the commits to pass, and I do want to know that. And I think since I do fall into the default, like you mentioned, 80%, 90% of my PRs are one commit. I just already have that. And the fact that they're enforcing that with their team is interesting. And I'm trying to think through why that feels cumbersome to enforce that.
And I'm with you where I'll maybe have a refactor commit or something that goes before. And it's like, well, what's wrong with splitting that out into a separate PR? What's the pain point of that? And I think the pain point is the fact that one, you have two PRs that are stacked on each other. So you have the first one that you need to get reviewed, and then the second one; there’s that bit of having to hop between the two if there's some shared context that someone can't just easily review in one pull request.
But then there's also, as we just mentioned, there's CI that has to run. And so now it's running on both of them, even though maybe that's a good thing because it's running on both commits. I like the idea that every commit is tested, and every commit is green. But I actually feel like it's some of our other processes that make it cumbersome and hard to get there. And if CI did run on every commit, I think it would be ideal, but then we are increasing our CI time by running it on every commit.
And then it comes down to essentially what you said, what's the risk? So if we do merge in a commit that doesn't work or has something that's failing about it but then the next commit after that fixes it, what's the risk that we're going to roll back to that one specific commit that was broken? If that's a high risk for you and your team, then adding this process is probably the really wise thing to do because you want to make sure the app doesn't go down for users. That's incredibly important. If that's not a high risk for your team, then I wouldn't add the process.
CHRIS: Yeah, I totally agree. And to clarify my stances, for me, this change, this process change would not be worth the trade-off. I love the idea. I love the goal of it. But it is not worth the process change, and that's partly because I haven't particularly felt the pain. CI is not an inexhaustible resource I have learned. I'm actually somewhat proud our very small team that is working on the project that we're working on; we just recently ran out of our CI budget, and Circle was like, "Hey, we got to charge you more." And I was like, "Cool, do that." But it was like, there is cost both in terms of the time, clock time, and each PR running and all of those. We have to consider all of these different things.
And hopefully, we did a useful job of framing the conversation, because as always, it depends, but it depends on what. And in this case, there's a good outcome that we want to get to, but there's an associated cost. And for any individual team, how you weigh the positive of the outcome versus how you weigh the cost will alter the decision that you make. But that's I think, critically, the thing that we have to consider.
I've also noticed I've seen this conversation play out within teams where one individual may acutely feel the pain, and therefore they're anchored in that side. And the cost is irrelevant to them because they're like, I feel this pain so acutely, but other people on the team aren't working in that part of the codebase or aren't dealing with bug triage in the same way that that other developer is. And so, even within a team, there may be different levels of how you measure that. And being able to have meaningful conversations around that and productively come to a group decision and own that and move forward with that is the hard work but the important work that we have to do.
STEPH: Yeah. I think that's a great summary; it depends. On that note, shall we wrap up?
CHRIS: Let's wrap up. The show notes for this episode can be found at bikeshed.fm.
STEPH: This show is produced and edited by Mandy Moore.
CHRIS: If you enjoyed listening, one really easy way to support the show is to leave us a quick rating or even a review in iTunes, as it really helps other folks find the show.
STEPH: If you have any feedback for this or any of our other episodes, you can reach us at @_bikeshed or reach me on Twitter @SViccari.
CHRIS: And I'm @christoomey
STEPH: Or you can reach us at
[email protected] via email.
CHRIS: Thanks so much for listening to The Bike Shed, and we'll see you next week.
All: Byeeeeeeeeee!
Announcer: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success.