Should we keep pitest/descartes in our build?

vmassol · May 15, 2020, 1:32pm

Hi devs,

Current situation:

Seems XWiki committers don’t care that much about our pitest build failures. Past records show that the build can stay with pitest failures for more than a month and nobody cares much, and we even release with failing pitest jobs. Note that this is not the case for coverage failures where it’s fixed much faster.
I’m mostly the only one spending the time to fix the issues it seems and I’m a bit fed up doing it. A lot of the time (but not always) it’s about fixing flickers (ie reporting them on github issues for the descartes project, adding a todo in the pom and lowering the threshold).
I’m not convinced that the pros we get (small increased quality of tests) is worth the efforts. I think that in theory it’s interesting but I have the feeling that most of our tests are written well enough that the coverage percentage and mutation percentage are similar. There could be some modules where it’s not the case, but we’re not improving them with the “break-on-threshold” strategy anyway.
The test coverage metric/threshold is easier & faster to compute and doesn’t suffer from flickerings. I wonder if it’s not enough for us.

Thus, I’m questioning our usage of pitest/descartes. I like the concept and I like that we’re one of the few projects pushing it beyond test coverage and I think it would be worth it if we didn’t have the flickers.

In addition, the pitest/descartes projects are not very active and I doubt anyone is going to fix the issue any time soon, which means we will need to do the work ourselves (and I don’t see us having the motivation or time to do it) or continue to lower the thresholds.

I’m not proposing anything at this stage, just wanted to check your POV on the topic.

Thanks
-Vincent

tmortagne · May 15, 2020, 1:45pm

I did found a few bugs in the past thanks to pitest but very few to be honest and it does not feel like it worth the time spent on it.

I find this statement a bit unfair. It’s probably true for pitest/descarte bugs reporting recently (I did reported quite a few of them but not very recently) but not for fixing things in our side.

About flickers I’m wondering if the default Pitest behavior is more stable than Descartes one.

Continuing to use Descarte does not worth the trouble IMO.

surli · May 15, 2020, 3:38pm

I find the tool interesting personally but mostly for improving the way tests are written in a module: I use it once in a while just to read the report and see what cases are badly handled by tests.

Now I personally think that all the energy we currently invest in trying to reproduce flickers and to check why our builds are erroring on the CI would be used to fix the mutation score if we’d have a stable CI.

So in short:

I’m -1 to remove the tool from our pom
I’m +1 to not enforce a passing mutation coverage threshold for our releases (which is actually already the case)
I’m -0 to remove our jenkins job to check this: we can just remove the job from the recommended view, but keep it since it’s still informative
I’m +1 to reconsider it when we’ll manage to get a stable CI

vmassol · May 15, 2020, 4:04pm

Not really. It’s flickering, there isn’t much we can do about it except forking pitest and descartes and fixing the code and I doubt we’ll ever do that.

Again the problem is not the mutation score, that’s good and I would keep this. The issues are the flickers.

With your arguments, we would also drop coverage check (jacoco) because we have CI stabilization issues. That would be a bad idea. They’re not exclusive.

vmassol · May 24, 2020, 12:46pm

Hi Simon,

I don’t agree with this logic because having jobs failing that we don’t fix is not good idea. In addition it shows to us and to the rest of the world that we’re not serious about it and that we don’t care. It’s disheartening for our morale

Also on the principle, I don’t agree about having failing quality checks for things we don’t want to fix.

So I wouldn’t like to keep the job active (it also uses CI resources). I’d like at least to disable it FTM and we can revisit in 6 months.

For the pom, I also don’t like it, it’s a bit like having commented out code in the sources, it’s just bad. If we don’t use it then we shouldn’t have it. It makes the pom larger and more complex than it should be. I’m not going to fight this point for now so I’ll leave it. We’ll revisit it in 6 months and see how often we’ve executed it. Note that we could also easily transform the config in a command line version that we can document somewhere for you (since you seem to want to run it). We don’t need to have it in the POM.

Thanks

vmassol · May 24, 2020, 12:47pm

Actually keeping it in the pom.xml is bad, because the thresholds for each module will be wrong and they’ll become wronger and wronger as time passes. It’s like having wrong javadoc that doesn’t match the code, it’s worse than not having it.

So I’m actually against keeping the thresholds in the pom.xml if we don’t consider that descartes passing is a must have for the release.

vmassol · May 24, 2020, 12:49pm

@surli Please reply fast because right now we have commons and platform jobs failing for pitest so we cannot release anything, including the release of Monday. And no, I don’t agree with keeping the checks and saying we don’t care that the build fails. It’s either all or nothing. We also need to build back faith in our builds and CI.

vmassol · May 24, 2020, 12:50pm

Also don’t be worried, this is all under SCM so we won’t loose anything in case we want to come back to it one day

surli · May 25, 2020, 7:33am

Well according to your first post this wasn’t a proposal for doing anything
But anyway, ok to remove check from CI / thresholds.

vmassol · May 25, 2020, 12:34pm

Right That was 10 days ago and we now need to do something since we need to release a final version and it would be nicer to have this sorted out ASAP.

vmassol · May 25, 2020, 12:43pm

I think I misunderstood this sentence originally, sorry about that. On my side, I don’t think that we would do that. I think it’s just because there’s not enough value provided by pitest/descartes compared to the cost it takes to use them, and that’s mostly because of the pitest/descartes flickers. Without them, I think it would just barely provide enough value.

For me, they have never been very useful, they just told me what I knew wasn’t tested already and running the coverage in my IDE points to me to lots of places to improve already. My personal opinion is that pitest/descartes becomes interesting only when you consider that you’ve done the best possible job using coverage (ie 100% coverage) and you want to verify that it’s indeed true (or not). Or by using it in your IDE as a guide to help you write better tests from the beginning.

mflorea · May 25, 2020, 12:53pm

I haven’t followed this topic to have a strong opinion. Mutation testing is an interesting concept. The issue I have with the integration of pitest in the XWiki build is that you need to remember or find the commands to run it. I mean, can remember mvn clean install -Pquality (for jacoco), but using a different command and parameters for the mutation coverage made me always forget about it. And the fact that the computed mutation coverage wasn’t always precise (sometimes even far away) lowered my trust in this tool. I mean, when I see that the mutation score has decreased I’m not confident that this is due to my code or due to some bug in pitest / descartes.

Thanks,
Marius

vmassol · May 28, 2020, 4:47pm

Done, see https://jira.xwiki.org/browse/XCOMMONS-1960