Proposal to adjust testing to run on PGO builds only and not test on OPT builds

I would like to propose that we do not run tests on linux64-opt, windows7-o=
pt, and windows10-opt.

Why am I proposing this:
1) All test regressions that were found on trunk are mostly on debug, and i=
n fewer cases on PGO.  There are no unique regressions found in the last 6 =
months (all the data I looked at) that are exclusive to OPT builds.
2) On mozilla-beta, mozilla-release, and ESR, we only build/test PGO builds=
, we do not run tests on plan OPT builds
3) This will reduce the jobs (about 16%) we run which in turn reduces, cpu =
time, money spent, turnaround time, intermittents, complexity of the taskgr=
aph.
4) PGO builds are very similar to OPT builds, but we add flags to generate =
profile data and small adjustments to build scripts behind MOZ_PGO flag in-=
tree, then we launch the browser, collect data, and repack our binaries for=
 faster performance.
5) We ship PGO builds, not OPT builds

What are the risks associated with this?
1) try server build times will increase as we will be testing on PGO instea=
d of OPT
2) we could miss a regression that only shows up on OPT, but if we only shi=
p PGO and once we leave central we do not build OPT, this is a very low ris=
k.

I would like to hear any concerns you might have on this or other areas whi=
ch I have overlooked.  Assuming there are no risks which block this, I woul=
d like to have a decision by January 11th, and make the adjustments on Janu=
ary 28th when Firefox 67 is on trunk.
0
jmaher
1/3/2019 4:17:33 PM
mozilla.dev.platform 6486 articles. 0 followers. Post Follow

25 Replies
30 Views

Similar Articles

[PageSpeed] 47

Can we set it up so we can manually runs tests on opt builds; but they
aren't by default?

I've had many instances where opt (and pgo) fail; but I can't
reproduce a test failure locally and can only do it on try. Letting me
run that test on the opt build will save the additional pgo build time
(both the cloud-cost time and the developer turn-around time.)

-tom

On Thu, Jan 3, 2019 at 4:20 PM jmaher <joel.maher@gmail.com> wrote:
>
> I would like to propose that we do not run tests on linux64-opt, windows7=
-opt, and windows10-opt.
>
> Why am I proposing this:
> 1) All test regressions that were found on trunk are mostly on debug, and=
 in fewer cases on PGO.  There are no unique regressions found in the last =
6 months (all the data I looked at) that are exclusive to OPT builds.
> 2) On mozilla-beta, mozilla-release, and ESR, we only build/test PGO buil=
ds, we do not run tests on plan OPT builds
> 3) This will reduce the jobs (about 16%) we run which in turn reduces, cp=
u time, money spent, turnaround time, intermittents, complexity of the task=
graph.
> 4) PGO builds are very similar to OPT builds, but we add flags to generat=
e profile data and small adjustments to build scripts behind MOZ_PGO flag i=
n-tree, then we launch the browser, collect data, and repack our binaries f=
or faster performance.
> 5) We ship PGO builds, not OPT builds
>
> What are the risks associated with this?
> 1) try server build times will increase as we will be testing on PGO inst=
ead of OPT
> 2) we could miss a regression that only shows up on OPT, but if we only s=
hip PGO and once we leave central we do not build OPT, this is a very low r=
isk.
>
> I would like to hear any concerns you might have on this or other areas w=
hich I have overlooked.  Assuming there are no risks which block this, I wo=
uld like to have a decision by January 11th, and make the adjustments on Ja=
nuary 28th when Firefox 67 is on trunk.
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
0
Tom
1/3/2019 4:26:01 PM
On 03/01/2019 16:17, jmaher wrote:
> What are the risks associated with this?
> 1) try server build times will increase as we will be testing on PGO instead of OPT
> 2) we could miss a regression that only shows up on OPT, but if we only ship PGO and once we leave central we do not build OPT, this is a very low risk.

Couldn't we leave opt enabled for try and just stop running it on 
integration/central branches? That would allow faster/cheaper try but 
preserve the benefits you list above without any additional increase in 
risk compared to today. I do wonder how that would interact with 
artifact builds though; maybe it would be worth running opt *builds* 
just not opt *tests* (which I think is your proposal anyway).
0
James
1/3/2019 4:36:08 PM
Artifact builds don=E2=80=99t work with PGO, do they? When I do `-p all` =
on an artifact try push I get busted PGO builds (for example: =
https://treeherder.mozilla.org/#/jobs?repo=3Dtry&revision=3D7f8ead55ca9782=
1c60ef38af4dec01b8bff0fdf3&selectedJob=3D219655864). What's needed to =
make it work? Requiring a full build for frontend-only changes would =
increase the turnaround time and resource savings in (3).

Brian

> On Jan 3, 2019, at 8:17 AM, jmaher <joel.maher@gmail.com> wrote:
>=20
> I would like to propose that we do not run tests on linux64-opt, =
windows7-opt, and windows10-opt.
>=20
> Why am I proposing this:
> 1) All test regressions that were found on trunk are mostly on debug, =
and in fewer cases on PGO.  There are no unique regressions found in the =
last 6 months (all the data I looked at) that are exclusive to OPT =
builds.
> 2) On mozilla-beta, mozilla-release, and ESR, we only build/test PGO =
builds, we do not run tests on plan OPT builds
> 3) This will reduce the jobs (about 16%) we run which in turn reduces, =
cpu time, money spent, turnaround time, intermittents, complexity of the =
taskgraph.
> 4) PGO builds are very similar to OPT builds, but we add flags to =
generate profile data and small adjustments to build scripts behind =
MOZ_PGO flag in-tree, then we launch the browser, collect data, and =
repack our binaries for faster performance.
> 5) We ship PGO builds, not OPT builds
>=20
> What are the risks associated with this?
> 1) try server build times will increase as we will be testing on PGO =
instead of OPT
> 2) we could miss a regression that only shows up on OPT, but if we =
only ship PGO and once we leave central we do not build OPT, this is a =
very low risk.
>=20
> I would like to hear any concerns you might have on this or other =
areas which I have overlooked.  Assuming there are no risks which block =
this, I would like to have a decision by January 11th, and make the =
adjustments on January 28th when Firefox 67 is on trunk.
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

0
Brian
1/3/2019 4:43:42 PM
CC Callek

How will this interact with the "shippable builds" project that Callek
posted
about awhile back? My understanding is there's a high probability PGO is
going away. Would it make sense to wait for that to project to wrap up?

-Andrew

On Thu, Jan 3, 2019 at 11:20 AM jmaher <joel.maher@gmail.com> wrote:

> I would like to propose that we do not run tests on linux64-opt,
> windows7-opt, and windows10-opt.
>
> Why am I proposing this:
> 1) All test regressions that were found on trunk are mostly on debug, and
> in fewer cases on PGO.  There are no unique regressions found in the last 6
> months (all the data I looked at) that are exclusive to OPT builds.
> 2) On mozilla-beta, mozilla-release, and ESR, we only build/test PGO
> builds, we do not run tests on plan OPT builds
> 3) This will reduce the jobs (about 16%) we run which in turn reduces, cpu
> time, money spent, turnaround time, intermittents, complexity of the
> taskgraph.
> 4) PGO builds are very similar to OPT builds, but we add flags to generate
> profile data and small adjustments to build scripts behind MOZ_PGO flag
> in-tree, then we launch the browser, collect data, and repack our binaries
> for faster performance.
> 5) We ship PGO builds, not OPT builds
>
> What are the risks associated with this?
> 1) try server build times will increase as we will be testing on PGO
> instead of OPT
> 2) we could miss a regression that only shows up on OPT, but if we only
> ship PGO and once we leave central we do not build OPT, this is a very low
> risk.
>
> I would like to hear any concerns you might have on this or other areas
> which I have overlooked.  Assuming there are no risks which block this, I
> would like to have a decision by January 11th, and make the adjustments on
> January 28th when Firefox 67 is on trunk.
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
0
Andrew
1/3/2019 4:44:29 PM
Would this apply to talos as well? I=E2=80=99ve wondered before if we =
should care at all about opt-only talos regressions for platforms where =
we ship PGO. IME quite a number of talos changes (both improvements and =
regressions) only show up on one or the other, so dropping one would =
simplify things.

Brian

> On Jan 3, 2019, at 8:17 AM, jmaher <joel.maher@gmail.com> wrote:
>=20
> I would like to propose that we do not run tests on linux64-opt, =
windows7-opt, and windows10-opt.
>=20
> Why am I proposing this:
> 1) All test regressions that were found on trunk are mostly on debug, =
and in fewer cases on PGO.  There are no unique regressions found in the =
last 6 months (all the data I looked at) that are exclusive to OPT =
builds.
> 2) On mozilla-beta, mozilla-release, and ESR, we only build/test PGO =
builds, we do not run tests on plan OPT builds
> 3) This will reduce the jobs (about 16%) we run which in turn reduces, =
cpu time, money spent, turnaround time, intermittents, complexity of the =
taskgraph.
> 4) PGO builds are very similar to OPT builds, but we add flags to =
generate profile data and small adjustments to build scripts behind =
MOZ_PGO flag in-tree, then we launch the browser, collect data, and =
repack our binaries for faster performance.
> 5) We ship PGO builds, not OPT builds
>=20
> What are the risks associated with this?
> 1) try server build times will increase as we will be testing on PGO =
instead of OPT
> 2) we could miss a regression that only shows up on OPT, but if we =
only ship PGO and once we leave central we do not build OPT, this is a =
very low risk.
>=20
> I would like to hear any concerns you might have on this or other =
areas which I have overlooked.  Assuming there are no risks which block =
this, I would like to have a decision by January 11th, and make the =
adjustments on January 28th when Firefox 67 is on trunk.
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

0
Brian
1/3/2019 4:48:50 PM
On Thu, Jan 3, 2019 at 8:43 AM Brian Grinstead <bgrinstead@mozilla.com>
wrote:

> Artifact builds don=E2=80=99t work with PGO, do they? When I do `-p all` =
on an
> artifact try push I get busted PGO builds (for example:
> https://treeherder.mozilla.org/#/jobs?repo=3Dtry&revision=3D7f8ead55ca978=
21c60ef38af4dec01b8bff0fdf3&selectedJob=3D219655864).
> What's needed to make it work? Requiring a full build for frontend-only
> changes would increase the turnaround time and resource savings in (3).
>

I can partly address this.  There are two things at play (at least):

1) automation builds need a special configuration piece in place to
properly support artifact builds.  Almost certainly that's not in place for
PGO builds, since it's such an unusual thing to do: "you want to pack PGO
binaries into a development build... why?"  But there's really no reason we
can't do that in automation so I've filed
https://bugzilla.mozilla.org/show_bug.cgi?id=3D15175323 for these things.
It's not high priority but we might as well capture the request; in
general, we always want try pushes to succeed with sensible results if we
can arrange it.

2) locally, we need to teach the artifact code to sniff whatever mozconfig
options say "I'm doing PGO" and fetch the right binaries based on that.  I
think that enabling PGO locally is a little delicate, and I know that
chmanchester (and others?) is working hard to make this more robust, so
perhaps this is easy or becomes easy soon.  I've filed
https://bugzilla.mozilla.org/show_bug.cgi?id=3D1517532 to track this.

If I'm wrong about the feasibility of these things, please update the
tickets!

Best,
Nick
0
Nicholas
1/3/2019 5:41:24 PM
On 03/01/2019 16:17, jmaher wrote:
> I would like to propose that we do not run tests on linux64-opt, windows7-opt, and windows10-opt.
> 
> Why am I proposing this:
> 1) All test regressions that were found on trunk are mostly on debug, and in fewer cases on PGO.  There are no unique regressions found in the last 6 months (all the data I looked at) that are exclusive to OPT builds.
> 2) On mozilla-beta, mozilla-release, and ESR, we only build/test PGO builds, we do not run tests on plan OPT builds
> 3) This will reduce the jobs (about 16%) we run which in turn reduces, cpu time, money spent, turnaround time, intermittents, complexity of the taskgraph.
> 4) PGO builds are very similar to OPT builds, but we add flags to generate profile data and small adjustments to build scripts behind MOZ_PGO flag in-tree, then we launch the browser, collect data, and repack our binaries for faster performance.
> 5) We ship PGO builds, not OPT builds
> 
> What are the risks associated with this?
> 1) try server build times will increase as we will be testing on PGO instead of OPT
> 2) we could miss a regression that only shows up on OPT, but if we only ship PGO and once we leave central we do not build OPT, this is a very low risk.

It's not just tryserver build times. Presumably this will also tend to 
increase the time between a patch landing on inbound or autoland and any 
resulting test failures showing up.

This seems like a negative in that it means more patches are likely to 
have landed on top of the regressing one in the meantime, potentially 
complicating backouts, and the original developer may be less likely to 
still be around for a quick investigation/fix.

How long does it typically take for full PGO test results to be 
available for a push? How does that compare to full Opt test results? 
ISTM that if the increase is quite marginal, this may be OK, but if the 
latency becomes substantially greater, there will be a continual cost in 
increased developer and/or sheriff pain.

JK
0
Jonathan
1/3/2019 5:51:42 PM
I should say that the shippable build proposal (
https://groups.google.com/d/msg/mozilla.dev.planning/JomJmzGOGMY/vytPViZBDgAJ)
doesn't seem to intersect negatively with this.

And in fact I think these two proposals compliment each other quite nicely.

Additionally I have no concerns over this work taking place prior to my
work being complete.

on the specific proposal front I can envision us allowing tests to be run
on non-pgo builds via triggers (so never by default, but always
backfillable/selectable) should someone need to try and bisect an issue
that is discovered... I'm not sure if the code maintenance burden is worth
it for the benefit but I don't hold a strong opinion there.

~Justin Wood (Callek)

On Thu, Jan 3, 2019 at 11:44 AM Andrew Halberstadt <ahal@mozilla.com> wrote:

> CC Callek
>
> How will this interact with the "shippable builds" project that Callek
> posted
> about awhile back? My understanding is there's a high probability PGO is
> going away. Would it make sense to wait for that to project to wrap up?
>
> -Andrew
>
> On Thu, Jan 3, 2019 at 11:20 AM jmaher <joel.maher@gmail.com> wrote:
>
>> I would like to propose that we do not run tests on linux64-opt,
>> windows7-opt, and windows10-opt.
>>
>> Why am I proposing this:
>> 1) All test regressions that were found on trunk are mostly on debug, and
>> in fewer cases on PGO.  There are no unique regressions found in the last 6
>> months (all the data I looked at) that are exclusive to OPT builds.
>> 2) On mozilla-beta, mozilla-release, and ESR, we only build/test PGO
>> builds, we do not run tests on plan OPT builds
>> 3) This will reduce the jobs (about 16%) we run which in turn reduces,
>> cpu time, money spent, turnaround time, intermittents, complexity of the
>> taskgraph.
>> 4) PGO builds are very similar to OPT builds, but we add flags to
>> generate profile data and small adjustments to build scripts behind MOZ_PGO
>> flag in-tree, then we launch the browser, collect data, and repack our
>> binaries for faster performance.
>> 5) We ship PGO builds, not OPT builds
>>
>> What are the risks associated with this?
>> 1) try server build times will increase as we will be testing on PGO
>> instead of OPT
>> 2) we could miss a regression that only shows up on OPT, but if we only
>> ship PGO and once we leave central we do not build OPT, this is a very low
>> risk.
>>
>> I would like to hear any concerns you might have on this or other areas
>> which I have overlooked.  Assuming there are no risks which block this, I
>> would like to have a decision by January 11th, and make the adjustments on
>> January 28th when Firefox 67 is on trunk.
>> _______________________________________________
>> dev-platform mailing list
>> dev-platform@lists.mozilla.org
>> https://lists.mozilla.org/listinfo/dev-platform
>>
>
0
Justin
1/3/2019 6:07:01 PM
On 01/03/2019 09:51 AM, Jonathan Kew wrote:
> On 03/01/2019 16:17, jmaher wrote:
>>
>> What are the risks associated with this?
>> 1) try server build times will increase as we will be testing on PGO 
>> instead of OPT
>> 2) we could miss a regression that only shows up on OPT, but if we 
>> only ship PGO and once we leave central we do not build OPT, this is 
>> a very low risk.
>
> It's not just tryserver build times. Presumably this will also tend to 
> increase the time between a patch landing on inbound or autoland and 
> any resulting test failures showing up.
>
> This seems like a negative in that it means more patches are likely to 
> have landed on top of the regressing one in the meantime, potentially 
> complicating backouts, and the original developer may be less likely 
> to still be around for a quick investigation/fix.
>
> How long does it typically take for full PGO test results to be 
> available for a push? How does that compare to full Opt test results? 
> ISTM that if the increase is quite marginal, this may be OK, but if 
> the latency becomes substantially greater, there will be a continual 
> cost in increased developer and/or sheriff pain. 

Good points, but given that most failures will show up debug builds, it 
seems like a more relevant metric is the difference between time(Opt) vs 
min(time(debug), time(PGO)). Though debug builds may run slow enough 
that it boils down to what you said?

Looking at Windows 64-bit jobs from a random push ( 
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=63027ff03effb04ed4bf53bbb0c9aa1bad4b4c9b 
), I see:

pgo: build=119min + Wd1=15min
opt: build=55min + Wd1=13min
debug: build=46min + Wd1=22min

So by that, you get opt and debug Wd1 results back at the same time 
(67-68min) and pgo Wd1 results take twice as long (134min). I imagine 
there are much slower test jobs that make this situation cloudier, but 
assuming the general pictures holds then it seems like opt is mostly 
redundant with debug.

The majority of your currently opt-triggered backouts will still happen, 
just using debug results now. This is assuming debug normally catches a 
superset of the problems that opt would, which is asserted in #1 of 
jmaher's post.

+1 from me for killing off opt tests.

0
Steve
1/3/2019 6:16:14 PM
On 01/03/2019 10:07 AM, Justin Wood wrote:
> on the specific proposal front I can envision us allowing tests to be run
> on non-pgo builds via triggers (so never by default, but always
> backfillable/selectable) should someone need to try and bisect an issue
> that is discovered... I'm not sure if the code maintenance burden is worth
> it for the benefit but I don't hold a strong opinion there.

Is it a lot of maintenance? We have this for some other jobs 
(linux64-shell-haz is the one I'm most familiar with, but it's a 
standalone job so doesn't have non-toolchain graph dependencies). I get 
quite a bit of value out of the resulting faster hack-try-debug cycles; 
I would imagine it to be at least as useful to have a turnaround time of 
1 hour for opt vs 2 hours for pgo.

0
Steve
1/3/2019 6:22:16 PM
On 03/01/2019 18:16, Steve Fink wrote:

> Good points, but given that most failures will show up debug builds, it 
> seems like a more relevant metric is the difference between time(Opt) vs 
> min(time(debug), time(PGO)). Though debug builds may run slow enough 
> that it boils down to what you said?
> 
> Looking at Windows 64-bit jobs from a random push ( 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=63027ff03effb04ed4bf53bbb0c9aa1bad4b4c9b 
> ), I see:
> 
> pgo: build=119min + Wd1=15min
> opt: build=55min + Wd1=13min
> debug: build=46min + Wd1=22min
> 
> So by that, you get opt and debug Wd1 results back at the same time 
> (67-68min) and pgo Wd1 results take twice as long (134min). I imagine 
> there are much slower test jobs that make this situation cloudier, but 
> assuming the general pictures holds then it seems like opt is mostly 
> redundant with debug.

I think a good rule of thumb is that debug tests are about twice as slow 
as opt, with the same chunking. So for a test job taking closer to an 
hour on opt (which some do), you can easily be at 45 minutes longer for 
opt results than debug. We could of course chunk more, but there's 
overhead there that would eat some of the regained capacity.

I wonder if an alternative would be running opt+debug on integration 
branches and pgo+debug on central. That would have the obvious 
disadvantage that pgo-only failures would be caught much later, but it 
would keep current end-to-end times for integration and slightly better 
capacity savings. I don't know how common pgo-only failures are compared 
to other things that we are only catching on central.
0
James
1/3/2019 6:28:16 PM
I don't think its much burden, but when we have code complexity it can add
up with a matter of "how useful is this really.." Even if maintenance
burden is low it is still a tradeoff. I'm just saying I suspect its
possible to do this, but not sure if it is useful in the end (and I'm not
looking to make the call on that)

~Justin Wood (Callek)

On Thu, Jan 3, 2019 at 1:22 PM Steve Fink <sfink@mozilla.com> wrote:

> On 01/03/2019 10:07 AM, Justin Wood wrote:
> > on the specific proposal front I can envision us allowing tests to be run
> > on non-pgo builds via triggers (so never by default, but always
> > backfillable/selectable) should someone need to try and bisect an issue
> > that is discovered... I'm not sure if the code maintenance burden is
> worth
> > it for the benefit but I don't hold a strong opinion there.
>
> Is it a lot of maintenance? We have this for some other jobs
> (linux64-shell-haz is the one I'm most familiar with, but it's a
> standalone job so doesn't have non-toolchain graph dependencies). I get
> quite a bit of value out of the resulting faster hack-try-debug cycles;
> I would imagine it to be at least as useful to have a turnaround time of
> 1 hour for opt vs 2 hours for pgo.
>
>
0
Justin
1/3/2019 6:36:50 PM
On Thu, Jan 3, 2019 at 7:22 PM Steve Fink <sfink@mozilla.com> wrote:

> I get
> quite a bit of value out of the resulting faster hack-try-debug cycles;
> I would imagine it to be at least as useful to have a turnaround time of
> 1 hour for opt vs 2 hours for pgo.
>

+1. The past week I've been Try-debugging (1) an intermittent Talos orange
(affected only Win64 opt and pgo, bug 1516679) and (2) an intermittent dt8
orange (affected only Win32 opt and pgo, bug 1516967). This was a pretty
annoying process, but pgo builds would have made this much worse. I'd
really appreciate it if we considered keeping "opt" as an optional
configuration for these use cases - it will save some people a lot of time.

Thanks,
Jan



>
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
0
Jan
1/3/2019 6:39:36 PM
Thank you Joel for writing up this proposal!

Are you also proposing that we stop the linux64-opt and win64-opt builds as
well, except for leaving them as an available option on try? If we're not
testing them on integration or release branches, there doesn't seem to be
much purpose in doing the builds.

On Thu, 3 Jan 2019 at 11:20, jmaher <joel.maher@gmail.com> wrote:

> I would like to propose that we do not run tests on linux64-opt,
> windows7-opt, and windows10-opt.
>
> Why am I proposing this:
> 1) All test regressions that were found on trunk are mostly on debug, and
> in fewer cases on PGO.  There are no unique regressions found in the last 6
> months (all the data I looked at) that are exclusive to OPT builds.
> 2) On mozilla-beta, mozilla-release, and ESR, we only build/test PGO
> builds, we do not run tests on plan OPT builds
> 3) This will reduce the jobs (about 16%) we run which in turn reduces, cpu
> time, money spent, turnaround time, intermittents, complexity of the
> taskgraph.
> 4) PGO builds are very similar to OPT builds, but we add flags to generate
> profile data and small adjustments to build scripts behind MOZ_PGO flag
> in-tree, then we launch the browser, collect data, and repack our binaries
> for faster performance.
> 5) We ship PGO builds, not OPT builds
>
> What are the risks associated with this?
> 1) try server build times will increase as we will be testing on PGO
> instead of OPT
> 2) we could miss a regression that only shows up on OPT, but if we only
> ship PGO and once we leave central we do not build OPT, this is a very low
> risk.
>
> I would like to hear any concerns you might have on this or other areas
> which I have overlooked.  Assuming there are no risks which block this, I
> would like to have a decision by January 11th, and make the adjustments on
> January 28th when Firefox 67 is on trunk.
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
0
Chris
1/3/2019 9:46:06 PM
+1.

The goal of shippable builds is twofold:

1. to make sure opt builds+tests, or similar (artifact builds?) answer the
question "is my commit good?" as fast as possible, and
2. to make sure shippable builds+tests answer the question "are these
binaries correct and ready to ship, if we decide to ship this revision?"

I agree that we should run a full suite of tests against shippable builds,
which probably includes things like performance testing.
We still need some class of builds+tests that answer the question "is my
commit good?" quickly. If debug builds are sufficient for the most part,
and opt builds+tests on try fill in the gaps, then yes. (That appears to be
what this thread is largely about.) If not, I could see us having at least
some subset of tests running against opt or artifact builds.

If we switch talos to PGO now, we'll probably switch them to shippable
builds at some point in the near future.


On Thu, Jan 3, 2019 at 8:45 AM Andrew Halberstadt <ahal@mozilla.com> wrote:

> CC Callek
>
> How will this interact with the "shippable builds" project that Callek
> posted
> about awhile back? My understanding is there's a high probability PGO is
> going away. Would it make sense to wait for that to project to wrap up?
>
> -Andrew
>
> On Thu, Jan 3, 2019 at 11:20 AM jmaher <joel.maher@gmail.com> wrote:
>
> > I would like to propose that we do not run tests on linux64-opt,
> > windows7-opt, and windows10-opt.
> >
> > Why am I proposing this:
> > 1) All test regressions that were found on trunk are mostly on debug, and
> > in fewer cases on PGO.  There are no unique regressions found in the
> last 6
> > months (all the data I looked at) that are exclusive to OPT builds.
> > 2) On mozilla-beta, mozilla-release, and ESR, we only build/test PGO
> > builds, we do not run tests on plan OPT builds
> > 3) This will reduce the jobs (about 16%) we run which in turn reduces,
> cpu
> > time, money spent, turnaround time, intermittents, complexity of the
> > taskgraph.
> > 4) PGO builds are very similar to OPT builds, but we add flags to
> generate
> > profile data and small adjustments to build scripts behind MOZ_PGO flag
> > in-tree, then we launch the browser, collect data, and repack our
> binaries
> > for faster performance.
> > 5) We ship PGO builds, not OPT builds
> >
> > What are the risks associated with this?
> > 1) try server build times will increase as we will be testing on PGO
> > instead of OPT
> > 2) we could miss a regression that only shows up on OPT, but if we only
> > ship PGO and once we leave central we do not build OPT, this is a very
> low
> > risk.
> >
> > I would like to hear any concerns you might have on this or other areas
> > which I have overlooked.  Assuming there are no risks which block this, I
> > would like to have a decision by January 11th, and make the adjustments
> on
> > January 28th when Firefox 67 is on trunk.
> > _______________________________________________
> > dev-platform mailing list
> > dev-platform@lists.mozilla.org
> > https://lists.mozilla.org/listinfo/dev-platform
> >
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
0
Aki
1/3/2019 11:31:12 PM
Nicholas Alexander wrote on 03.01.19 18:41:

> 1) automation builds need a special configuration piece in place to
> properly support artifact builds.  Almost certainly that's not in place for
> PGO builds, since it's such an unusual thing to do: "you want to pack PGO
> binaries into a development build... why?"  But there's really no reason we
> can't do that in automation so I've filed
> https://bugzilla.mozilla.org/show_bug.cgi?id=15175323 for these things.

This is actually: https://bugzilla.mozilla.org/show_bug.cgi?id=1517533

Thanks for filing those bugs.

-- 
Henrik Skupin
Senior Software Engineer
Mozilla Corporation
0
Henrik
1/4/2019 8:37:43 AM
On Thu, Jan 3, 2019 at 1:47 PM Chris AtLee <catlee@mozilla.com> wrote:

> Thank you Joel for writing up this proposal!
>
> Are you also proposing that we stop the linux64-opt and win64-opt builds as
> well, except for leaving them as an available option on try? If we're not
> testing them on integration or release branches, there doesn't seem to be
> much purpose in doing the builds.
>

One reason we might not want to stop producing opt builds: we produce
artifact builds against opt (and debug, with --enable-debug in the local
mozconfig).  It'll be very odd to have --enable-artifact-build and
_require_ --enable-pgo or whatever it is in the local mozconfig.

I expect that these opt build platforms will be relatively inexpensive to
preserve, because step one (IIUC) of pgo is to build the same source files
as the opt builds.  So with luck we get sccache hits between the jobs.
Perhaps somebody with more knowledge of pgo and sccache can confirm or
refute that assertion?

Nick
0
Nicholas
1/4/2019 4:56:52 PM
On Fri, Jan 4, 2019 at 11:57 AM Nicholas Alexander
<nalexander@mozilla.com> wrote:
> One reason we might not want to stop producing opt builds: we produce
> artifact builds against opt (and debug, with --enable-debug in the local
> mozconfig).  It'll be very odd to have --enable-artifact-build and
> _require_ --enable-pgo or whatever it is in the local mozconfig.

This seems reasonable.  (I'm in agreement with the people upthread
that think we should have opt testing, but regardless of that
particular outcome, not requiring people to put goo in their
mozconfigs seems like a noble goal.)

> I expect that these opt build platforms will be relatively inexpensive to
> preserve, because step one (IIUC) of pgo is to build the same source files
> as the opt builds.  So with luck we get sccache hits between the jobs.
> Perhaps somebody with more knowledge of pgo and sccache can confirm or
> refute that assertion?

PGO uses different compilation flags than a normal opt build in both
the profiling and the profile use phases (for instrumentation, etc.),
so I'd assume that opt builds and PGO builds would not share compiled
objects.

-Nathan
0
Nathan
1/4/2019 5:04:08 PM
thanks everyone for your comments on this.  It sounds like from a practical standpoint until we can get the runtimes of PGO builds on try and in integration to be less than debug build times this is not a desirable change.

A few common responses:
* artifact opt builds on try are fast for quick iterations, a must have
* can we do artifact builds for PGO? (thanks :nalexander for bug 1517533 and bug 1517532)
* what about talos?  we need to investigate this more, I have always argued against pgo only for talos, but maybe we can revisit that (bug 1514829)
* do we turn off builds as well?  I had proposed just the tests, if we decide to turn off talos it would make sense to turn off builds.

Thanks all for the quick feedback, when the bugs in this thread are further along, or if I see another simpler solution for reducing the duplication, I will follow up.
0
jmaher
1/4/2019 8:24:06 PM
>* do we turn off builds as well?  I had proposed just the tests, if we decide to turn off talos it would make sense to turn off builds.

Would turning off opt builds cause problems if you want to mozregression
an opt build?  And would this be an issue?  (obviously it might be for
opt-only failures, or trying to verify if a regression identified in
mozregression for PGO was a PGO bug or now, though that could be checked
at the cost of a build or 4 even if we don't build opt, probably).

-- 
Randell Jesup, Mozilla Corp
remove "news" for personal email
0
Randell
1/7/2019 9:39:09 PM
Earlier today I landed a fix for bug 1517532 that will mean that an
artifact build with MOZ_PGO set will pull artifacts from an automation pgo
build. As a result artifact pgo builds as trigger by a "-p all
--artifact..." will succeed now as well (and consume pgo'd artifacts).

If we end up wanting to turn off opt builds in automation after all we may
be able to pull artifacts from pgo builds for local artifact builds by
default. The behavior of the compiled code shouldn't be different -- this
probably wouldn't matter to people developing front end code locally.

Chris

On Fri, Jan 4, 2019 at 12:25 PM jmaher <joel.maher@gmail.com> wrote:

> thanks everyone for your comments on this.  It sounds like from a
> practical standpoint until we can get the runtimes of PGO builds on try and
> in integration to be less than debug build times this is not a desirable
> change.
>
> A few common responses:
> * artifact opt builds on try are fast for quick iterations, a must have
> * can we do artifact builds for PGO? (thanks :nalexander for bug 1517533
> and bug 1517532)
> * what about talos?  we need to investigate this more, I have always
> argued against pgo only for talos, but maybe we can revisit that (bug
> 1514829)
> * do we turn off builds as well?  I had proposed just the tests, if we
> decide to turn off talos it would make sense to turn off builds.
>
> Thanks all for the quick feedback, when the bugs in this thread are
> further along, or if I see another simpler solution for reducing the
> duplication, I will follow up.
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
0
Chris
1/9/2019 7:36:06 AM
Following up on this, thanks to Chris we have fast artifact builds for PGO,=
 so the time to develop and use try server is in parity with current opt so=
lutions for many cases (front end development, most bisection cases).

I have also looked in depth at what the impact on the integration branches =
would be.  In the data set from July-December (H2 2018) there were 11 insta=
nces of tests that we originally only scheduled in the OPT config and we di=
dn't have PGO or Debug test jobs to point out the regression (this is due t=
o scheduling choices).  Worse case scenario is finding the regression on PG=
O up to 1 hour later 11 times or roughly 2x/month.  Backfilling to find the=
 offending patch as we do now 24% of the time would be similar time.  In fa=
ct running the OPT jobs on Debug instead would result in same time for all =
11 instances (due to more chunks on debug and similar runtimes).  In short,=
 little to no impact.

Lastly there was a pending question about talos.  There is an edge case whe=
re we can see a regression on talos that is PGO, but it is unrelated to the=
 code and just a side effect of how PGO works.  I looked into that in https=
://bugzilla.mozilla.org/show_bug.cgi?id=3D1514829.  I found that if we didn=
't get opt alerts that we would not have missed any regressions.  Furthermo=
re, for the regressions, for the ones that were pgo only regressions (very =
rare) there were many other regressions at the same time (say a build chang=
e, or test change, etc.) and usually these were accepted changes, backed ou=
t, or investigated on a different test or platform.  In the past when we ha=
ve determined a regression is a PGO artifact we have resolved it as WONTFIX=
 and moved on.

Given this summary, I feel that most concerns around removing testing for O=
PT are addressed.  I would also like to extend the proposal to remove the O=
PT builds since no unit or perf tests would run on there.

As my original timeline is not realistic, I would like to see if there are =
comments until next Wednesday- January 23rd, then I can follow up on remain=
ing issues or work towards ensuring we start the process of making this hap=
pen and what the right timeline is.
0
jmaher
1/17/2019 4:42:37 PM
On 17/01/2019 16:42, jmaher wrote:
> Following up on this, thanks to Chris we have fast artifact builds for PGO, so the time to develop and use try server is in parity with current opt solutions for many cases (front end development, most bisection cases).

Even as someone not making frequent changes to compiled code I 
occasionally want to both rebuild and run tests on opt (e.g. because 
some test changes also require changes to moz.build files that could 
break the build in a way that isn't caught by an artifact build). In 
this case adding an extra hour of end-to-end time on try is a pretty 
serious regression.

For my specific use case it might be enough if we could schedule 
artifact builds for PGO and full builds for debug. But I suspect it's 
going to work better for more people — and save more resources overall — 
to simply keep the default try configuration as-is and just turn off 
non-PGO opt builds (or at least tests) on integration branches / central.
0
James
1/17/2019 5:04:04 PM
Hi Joel,

Can you say more about this point in your original email: "3) This will
reduce the jobs (about 16%) we run which in turn reduces, cpu time, money
spent, turnaround time, intermittents, complexity of the taskgraph." It
seems to me that if we remove non-PGO opt builds even on Try, we might use
more cpu time because there are so many Try pushes requesting opt builds.
Do we have data on this?

Thanks,
Jan

On Thu, Jan 17, 2019 at 5:45 PM jmaher <joel.maher@gmail.com> wrote:

> Following up on this, thanks to Chris we have fast artifact builds for
> PGO, so the time to develop and use try server is in parity with current
> opt solutions for many cases (front end development, most bisection cases).
>
> I have also looked in depth at what the impact on the integration branches
> would be.  In the data set from July-December (H2 2018) there were 11
> instances of tests that we originally only scheduled in the OPT config and
> we didn't have PGO or Debug test jobs to point out the regression (this is
> due to scheduling choices).  Worse case scenario is finding the regression
> on PGO up to 1 hour later 11 times or roughly 2x/month.  Backfilling to
> find the offending patch as we do now 24% of the time would be similar
> time.  In fact running the OPT jobs on Debug instead would result in same
> time for all 11 instances (due to more chunks on debug and similar
> runtimes).  In short, little to no impact.
>
> Lastly there was a pending question about talos.  There is an edge case
> where we can see a regression on talos that is PGO, but it is unrelated to
> the code and just a side effect of how PGO works.  I looked into that in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1514829.  I found that if we
> didn't get opt alerts that we would not have missed any regressions.
> Furthermore, for the regressions, for the ones that were pgo only
> regressions (very rare) there were many other regressions at the same time
> (say a build change, or test change, etc.) and usually these were accepted
> changes, backed out, or investigated on a different test or platform.  In
> the past when we have determined a regression is a PGO artifact we have
> resolved it as WONTFIX and moved on.
>
> Given this summary, I feel that most concerns around removing testing for
> OPT are addressed.  I would also like to extend the proposal to remove the
> OPT builds since no unit or perf tests would run on there.
>
> As my original timeline is not realistic, I would like to see if there are
> comments until next Wednesday- January 23rd, then I can follow up on
> remaining issues or work towards ensuring we start the process of making
> this happen and what the right timeline is.
> _______________________________________________
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
0
Jan
1/17/2019 5:52:10 PM
Thanks for asking Jan.  I think 16% is the maximum we can save.  In talking
with a few more people, I think a middle of the road proposal would be to:
Turn off linux64/windows7/windows10 opt builds+tests on autoland and
mozilla-inbound.  Leave them on for mozilla-central and try.

What this does is allows for try to be faster as needed, continue to offer
peace of mind by running the tests on m-c (and sheriffs can backfill if
needed), and removes confusion about building/testing locally vs try.  This
would be similar to what we already see where many people only test opt on
try and land and if a pgo test regresses we would need to backout.

Are there any concerns with this latest proposal?


On Thu, Jan 17, 2019 at 12:52 PM Jan de Mooij <jdemooij@mozilla.com> wrote:

> Hi Joel,
>
> Can you say more about this point in your original email: "3) This will
> reduce the jobs (about 16%) we run which in turn reduces, cpu time, money
> spent, turnaround time, intermittents, complexity of the taskgraph." It
> seems to me that if we remove non-PGO opt builds even on Try, we might use
> more cpu time because there are so many Try pushes requesting opt builds.
> Do we have data on this?
>
> Thanks,
> Jan
>
> On Thu, Jan 17, 2019 at 5:45 PM jmaher <joel.maher@gmail.com> wrote:
>
>> Following up on this, thanks to Chris we have fast artifact builds for
>> PGO, so the time to develop and use try server is in parity with current
>> opt solutions for many cases (front end development, most bisection cases).
>>
>> I have also looked in depth at what the impact on the integration
>> branches would be.  In the data set from July-December (H2 2018) there were
>> 11 instances of tests that we originally only scheduled in the OPT config
>> and we didn't have PGO or Debug test jobs to point out the regression (this
>> is due to scheduling choices).  Worse case scenario is finding the
>> regression on PGO up to 1 hour later 11 times or roughly 2x/month.
>> Backfilling to find the offending patch as we do now 24% of the time would
>> be similar time.  In fact running the OPT jobs on Debug instead would
>> result in same time for all 11 instances (due to more chunks on debug and
>> similar runtimes).  In short, little to no impact.
>>
>> Lastly there was a pending question about talos.  There is an edge case
>> where we can see a regression on talos that is PGO, but it is unrelated to
>> the code and just a side effect of how PGO works.  I looked into that in
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1514829.  I found that if
>> we didn't get opt alerts that we would not have missed any regressions.
>> Furthermore, for the regressions, for the ones that were pgo only
>> regressions (very rare) there were many other regressions at the same time
>> (say a build change, or test change, etc.) and usually these were accepted
>> changes, backed out, or investigated on a different test or platform.  In
>> the past when we have determined a regression is a PGO artifact we have
>> resolved it as WONTFIX and moved on.
>>
>> Given this summary, I feel that most concerns around removing testing for
>> OPT are addressed.  I would also like to extend the proposal to remove the
>> OPT builds since no unit or perf tests would run on there.
>>
>> As my original timeline is not realistic, I would like to see if there
>> are comments until next Wednesday- January 23rd, then I can follow up on
>> remaining issues or work towards ensuring we start the process of making
>> this happen and what the right timeline is.
>> _______________________________________________
>> dev-platform mailing list
>> dev-platform@lists.mozilla.org
>> https://lists.mozilla.org/listinfo/dev-platform
>>
>
0
Joel
1/18/2019 9:36:01 PM
Reply: