2018.03.12 Let's Encrypt Wildcard Certificate Encoding Issue

During final tests for the general availability of wildcard certificate sup=
port, the Let's Encrypt operations team issued six test wildcard certificat=
es under our publicly trusted root:

https://crt.sh/?id=3D353759994
https://crt.sh/?id=3D353758875
https://crt.sh/?id=3D353757861
https://crt.sh/?id=3D353756805
https://crt.sh/?id=3D353755984
https://crt.sh/?id=3D353754255

These certificates contain a subject common name that includes a  =E2=80=9C=
*.=E2=80=9D label encoded as an ASN.1 PrintableString, which does not allow=
 the asterisk character, violating RFC 5280.

We became aware of the problem on 2018-03-13 at 00:43 UTC via the linter fl=
agging in crt.sh [1]. All six certificates have been revoked.

The root cause of the problem is a Go language bug [2] which has been resol=
ved in Go v1.10 [3], which we were already planning to deploy soon. We will=
 resolve the issue by upgrading to Go v1.10 before proceeding with our wild=
card certificate launch plans.

We employ a robust testing infrastructure but there is always room for impr=
ovement and sometimes bugs slip through our pre-production tests. We=E2=80=
=99re fortunate that the PKI community has produced some great testing tool=
s that sometimes catch things we don=E2=80=99t. In response to this inciden=
t we are planning to integrate additional tools into our testing infrastruc=
ture and improve our test coverage of multiple Go versions.

[1] https://crt.sh/

[2] https://github.com/golang/go/commit/3b186db7b4a5cc510e71f90682732eba3df=
72fd3

[3] https://golang.org/doc/go1.10#encoding/asn1
0
josh
3/13/2018 2:35:30 AM
mozilla.dev.security.policy 1296 articles. 2 followers. Post Follow

15 Replies
40 Views

Similar Articles

[PageSpeed] 31

 > During final tests for the general availability of wildcard 
certificate support, the Let's Encrypt operations team issued six test 
wildcard certificates under our publicly trusted root:
 >
 > https://crt.sh/?id=353759994
 > https://crt.sh/?id=353758875
 > https://crt.sh/?id=353757861
 > https://crt.sh/?id=353756805
 > https://crt.sh/?id=353755984
 > https://crt.sh/?id=353754255
 >
Somebody noticed there 
https://community.letsencrypt.org/t/acmev2-and-wildcard-launch-delay/53654/62 
that the certificate of *.api.letsencrypt.org (apparently currently in 
use), issued by "TrustID Server CA A52" (IdenTrust) seams to have the 
same problem:
https://crt.sh/?id=8373036&opt=cablint,x509lint
0
Tom
3/13/2018 8:33:11 AM
On Tuesday, March 13, 2018 at 3:33:50 AM UTC-5, Tom wrote:
> > During final tests for the general availability of wildcard 
> certificate support, the Let's Encrypt operations team issued six test 
> wildcard certificates under our publicly trusted root:
>  >
>  > https://crt.sh/?id=353759994
>  > https://crt.sh/?id=353758875
>  > https://crt.sh/?id=353757861
>  > https://crt.sh/?id=353756805
>  > https://crt.sh/?id=353755984
>  > https://crt.sh/?id=353754255
>  >
> Somebody noticed there 
> https://community.letsencrypt.org/t/acmev2-and-wildcard-launch-delay/53654/62 
> that the certificate of *.api.letsencrypt.org (apparently currently in 
> use), issued by "TrustID Server CA A52" (IdenTrust) seams to have the 
> same problem:
> https://crt.sh/?id=8373036&opt=cablint,x509lint

I think it's just a coincidence that we got a wildcard cert from IdenTrust a long time ago and it happens to have the same encoding issue that we ran into. I notified IdenTrust in case they haven't fixed the problem since then.
0
josh
3/13/2018 1:46:43 PM
The fact that this mis-issuance occurred does raise a question for the
community.

For quite some time, it has been repeatedly emphasized that maintaining a
non-trusted but otherwise identical staging environment and practicing all
permutations of tests and issuances -- especially involving new
functionality -- on that parallel staging infrastructure is the mechanism
by which mis-issuances such as those mentioned in this thread may be
avoided within production environments.

Let's Encrypt has been a shining example of best practices up to this point
and has enjoyed the attendant minimization of production issues (presumably
as a result of exercising said best practices).

Despite that, however, either the test cases which resulted in these
mis-issuances were not first executed on the staging platform or did not
result in the mis-issuance there.  A reference was made to a Go lang
library error / non-conformance being implicated.  Were the builds for
staging and production compiled on different releases of Go lang?

Certainly, I think these particular mis-issuances do not significantly
affect the level of trust which should be accorded to ISRG / Let's Encrypt.

Having said that, however, it is worth noting that in a fully new and novel
PKI infrastructure, it seems likely -- based on recent inclusion / renewal
requests -- that such a mis-issuance would recently have resulted in a
disqualification of a given root / key with guidance to cut a new root PKI
and start the process over.

I am not at all suggesting consequences for Let's Encrypt, but rather
raising a question as to whether that position on new inclusions / renewals
is appropriate.  If these things can happen in a celebrated best-practices
environment, can they really in isolation be cause to reject a new
application or a new root from an existing CA?

Another question this incident raised in my mind pertains to the parallel
staging and production environment paradigm:  If one truly has the 'courage
of conviction' of the equivalence of the two environments, why would one
not perform all tests in ONLY the staging environment, with no tests and
nothing other than production transactions on the production environment?
That tests continue to be executed in the production environment while
holding to the notion that a fully parallel staging environment is the
place for tests seems to signal that confidence in the staging environment
is -- in some measure, however small -- limited.


On Tue, Mar 13, 2018 at 8:46 AM, josh--- via dev-security-policy <
dev-security-policy@lists.mozilla.org> wrote:

> On Tuesday, March 13, 2018 at 3:33:50 AM UTC-5, Tom wrote:
> > > During final tests for the general availability of wildcard
> > certificate support, the Let's Encrypt operations team issued six test
> > wildcard certificates under our publicly trusted root:
> >  >
> >  > https://crt.sh/?id=353759994
> >  > https://crt.sh/?id=353758875
> >  > https://crt.sh/?id=353757861
> >  > https://crt.sh/?id=353756805
> >  > https://crt.sh/?id=353755984
> >  > https://crt.sh/?id=353754255
> >  >
> > Somebody noticed there
> > https://community.letsencrypt.org/t/acmev2-and-wildcard-
> launch-delay/53654/62
> > that the certificate of *.api.letsencrypt.org (apparently currently in
> > use), issued by "TrustID Server CA A52" (IdenTrust) seams to have the
> > same problem:
> > https://crt.sh/?id=8373036&opt=cablint,x509lint
>
> I think it's just a coincidence that we got a wildcard cert from IdenTrust
> a long time ago and it happens to have the same encoding issue that we ran
> into. I notified IdenTrust in case they haven't fixed the problem since
> then.
> _______________________________________________
> dev-security-policy mailing list
> dev-security-policy@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-security-policy
>
0
Matthew
3/13/2018 8:13:03 PM
On Tue, Mar 13, 2018 at 4:13 PM, Matthew Hardeman via dev-security-policy <
dev-security-policy@lists.mozilla.org> wrote:

> I am not at all suggesting consequences for Let's Encrypt, but rather
> raising a question as to whether that position on new inclusions / renewals
> is appropriate.  If these things can happen in a celebrated best-practices
> environment, can they really in isolation be cause to reject a new
> application or a new root from an existing CA?
>

While I certainly appreciate the comparison, I think it's apples and
oranges when we consider both the nature and degree, nor do I think it's
fair to suggest "in isolation" is a comparison.

I'm sure you can agree that incident response is defined by both the nature
and severity of the incident itself, the surrounding ecosystem factors
(i.e. was this a well-understood problem), and the detection, response, and
disclosure practices that follow. A system that does not implement any
checks whatsoever is, I hope, something we can agree is worse than a system
that relies on human checks (and virtually indistinguishable from no
checks), and that both are worse than a system with incomplete technical
checks.

I do agree with you that I find it challenging with how the staging
environment was tested - failure to have robust profile tests in staging,
for example, are what ultimately resulted in Turktrust's notable
misissuance of unconstrained CA certificates. Similarly, given the wide
availability of certificate linting tools - such as ZLint, x509Lint,
(AWS's) certlint, and (GlobalSign's) certlint - there's no dearth of
availability of open tools and checks. Given the industry push towards
integration of these automated tools, it's not entirely clear why LE would
invent yet another, but it's also not reasonable to require that LE use
something 'off the shelf'.

I'm hoping that LE can provide more details about the change management
process and how, in light of this incident, it may change - both in terms
of automated testing and in certificate policy review.


> Another question this incident raised in my mind pertains to the parallel
> staging and production environment paradigm:  If one truly has the 'courage
> of conviction' of the equivalence of the two environments, why would one
> not perform all tests in ONLY the staging environment, with no tests and
> nothing other than production transactions on the production environment?
> That tests continue to be executed in the production environment while
> holding to the notion that a fully parallel staging environment is the
> place for tests seems to signal that confidence in the staging environment
> is -- in some measure, however small -- limited.


That's ... just a bad conclusion, especially for a publicly-trusted CA :)
0
Ryan
3/13/2018 9:02:06 PM
On Tuesday, March 13, 2018 at 2:02:45 PM UTC-7, Ryan Sleevi wrote:
> availability of certificate linting tools - such as ZLint, x509Lint,
> (AWS's) certlint, and (GlobalSign's) certlint - there's no dearth of
> availability of open tools and checks. Given the industry push towards
> integration of these automated tools, it's not entirely clear why LE woul=
d
> invent yet another, but it's also not reasonable to require that LE use
> something 'off the shelf'.

We are indeed planning to integrate GlobalSign's certlint and/or zlint into=
 our existing cert-checker pipeline rather than build something new. We've =
already started submitting issues and PRs, in order to give back to the eco=
system:

https://github.com/zmap/zlint/issues/212
https://github.com/zmap/zlint/issues/211
https://github.com/zmap/zlint/issues/210
https://github.com/globalsign/certlint/pull/5

If your question is why we wrote cert-checker rather than use something off=
-the-shelf: cablint / x509lint weren't available at the time we wrote cert-=
checker. When they became available we evaluated them for production and/or=
 CI use, but concluded that the complex dependencies and difficulty of prod=
uctionizing them in our environment outweighed the extra confidence we expe=
cted to gain, especially given that our certificate profile at the time was=
 very static. A system improvement we could have made here would have been =
to set "deploy cablint or its equivalent" as a blocker for future certifica=
te profile changes. I'll add that to our list of items for remediation.
0
jsha
3/13/2018 10:19:16 PM
On Tue, Mar 13, 2018 at 4:02 PM, Ryan Sleevi <ryan@sleevi.com> wrote:

>
>
> On Tue, Mar 13, 2018 at 4:13 PM, Matthew Hardeman via dev-security-policy
> <dev-security-policy@lists.mozilla.org> wrote:
>
>> I am not at all suggesting consequences for Let's Encrypt, but rather
>> raising a question as to whether that position on new inclusions /
>> renewals
>> is appropriate.  If these things can happen in a celebrated best-practices
>> environment, can they really in isolation be cause to reject a new
>> application or a new root from an existing CA?
>>
>
> While I certainly appreciate the comparison, I think it's apples and
> oranges when we consider both the nature and degree, nor do I think it's
> fair to suggest "in isolation" is a comparison.
>

I thought I recalled a recent case in which a new root/key was declined
with the sole unresolved (and unresolvable, save for new key generation,
etc.) matter precluding the inclusion being a prior mis-issuance of test
certificates, already revoked and disclosed.  Perhaps I am mistaken.



>
> I'm sure you can agree that incident response is defined by both the
> nature and severity of the incident itself, the surrounding ecosystem
> factors (i.e. was this a well-understood problem), and the detection,
> response, and disclosure practices that follow. A system that does not
> implement any checks whatsoever is, I hope, something we can agree is worse
> than a system that relies on human checks (and virtually indistinguishable
> from no checks), and that both are worse than a system with incomplete
> technical checks.
>
>
I certainly concur with all of that, which is the part of the basis for
which I form my own opinion that Let's Encrypt should not suffer any
consequence of significance beyond advice along the lines of "make your
testing environment and procedures better".


> I do agree with you that I find it challenging with how the staging
> environment was tested - failure to have robust profile tests in staging,
> for example, are what ultimately resulted in Turktrust's notable
> misissuance of unconstrained CA certificates. Similarly, given the wide
> availability of certificate linting tools - such as ZLint, x509Lint,
> (AWS's) certlint, and (GlobalSign's) certlint - there's no dearth of
> availability of open tools and checks. Given the industry push towards
> integration of these automated tools, it's not entirely clear why LE would
> invent yet another, but it's also not reasonable to require that LE use
> something 'off the shelf'.
>

I'm very interested in how the testing occurs in terms of procedures.  I
would assume, for example, that no test transaction of any kind would ever
be "played" against a production environment unless that same exact test
transaction had already been "played" against the staging environment.
With respect to this case, were these wildcard certificates requested and
issued against the staging system with materially the same test transaction
data, and if so was the encoding incorrect?  If these were not performed
against staging, what was the rational basis for executing a new and novel
test transaction against the production system first?  If they were
performed AND if they did not encode incorrectly, then what was the
disparity between the environments which led to this?  (The implication
being that some sort of change management process needs to be revised to
keep the operating environments of staging and production better
synchronized.)  If they were performed and were improperly encoded on the
staging environment, then one would presume that the erroneous result was
missed by the various automated and manual examinations of the results of
the tests.

As you note, it's unreasonable to require use of any particular
implementation of any particular tool but in as far as the other tools
achieve certain results while clearly the LE developed tools did not catch
this issue, it would appear that LE needs to better test their testing
mechanisms and while it may not be necessary for them to incorporate the
competing tools in the live issuance pipeline, it would seem advisable that
Let's Encrypt should pass the output results (the certificates) of tests
within their staging environment through these various other testing tools
as part of a post-staging-deployment testing phase.  It would seem logical
to take the best of breed tools and stack them up whether automatically or
manually and waterfall the final output results of a full suite of test
scenarios against the post-deployment state of the staging environment,
with a view to identifying discrepancies between the LE tool opinion and
the external tool's opinion and reconciling those, rejecting invalid
determinations as appropriate.


>
> I'm hoping that LE can provide more details about the change management
> process and how, in light of this incident, it may change - both in terms
> of automated testing and in certificate policy review.
>
>
>> Another question this incident raised in my mind pertains to the parallel
>> staging and production environment paradigm:  If one truly has the
>> 'courage
>> of conviction' of the equivalence of the two environments, why would one
>> not perform all tests in ONLY the staging environment, with no tests and
>> nothing other than production transactions on the production environment?
>> That tests continue to be executed in the production environment while
>> holding to the notion that a fully parallel staging environment is the
>> place for tests seems to signal that confidence in the staging environment
>> is -- in some measure, however small -- limited.
>
>
> That's ... just a bad conclusion, especially for a publicly-trusted CA :)
>
>
I certainly agree it's possible that I've reached a bad conclusion there,
but I would like to better understand how specifically?  Assuming the same
input data set and software manipulating said data, two systems should in
general execute identically.  To the extent that they do not, my initial
position would be that a significant failing of change management of
operating environment or data set or system level matters has occurred.  I
would think all of those would be issues of great concern to a CA, if for
no other reason than that they should be very very rare.
0
Matthew
3/13/2018 10:27:08 PM
On Tuesday, March 13, 2018 at 2:02:45 PM UTC-7, Ryan Sleevi wrote:
> I'm hoping that LE can provide more details about the change management
> process and how, in light of this incident, it may change - both in terms
> of automated testing and in certificate policy review.

Forgot to reply to this specific part. Our change management process starts=
 with our SDLC, which mandates code review (typically dual code review), un=
it tests, and where appropriate, integration tests. All unittests and integ=
rations tests are run automatically with every change, and before every dep=
loy. Our operations team checks the automated test status and will not depl=
oy if the tests are broken. Any configuration changes that we plan to apply=
 in staging and production are first added to our automated tests.

Each deploy then spends a period of time in our staging environment, where =
it is subject to further automated tests: periodic issuance testing, plus p=
erformance, availability, and correctness monitoring equivalent to our prod=
uction environment. This includes running the cert-checker software I menti=
oned earlier. Typically our deploys spend two days in our staging environme=
nt before going live, though that depends on our risk evaluation, and hotfi=
x deploys may spend less time in staging if we have high confidence in thei=
r safety. Similarly, any configuration changes are applied to the staging e=
nvironment before going to production. For significant changes we do additi=
onal manual testing in the staging environment. Generally this testing mean=
s checking that the new change was applied as expected, and that no errors =
were produced. We don't rely on manual testing as a primary way of catching=
 bugs; we automate everything we can.

If the staging deployment or configuration change doesn't show any problems=
, we continue to production. Production has the same suite of automated liv=
e tests as staging. And similar to staging, for significant changes we do a=
dditional manual testing. It was this step that caught the encoding issue, =
when one of our staff used crt.sh's lint tool to double check the test cert=
ificate they issued.

Clearly we should have caught this earlier in the process. The changes we h=
ave in the pipeline (integrating certlint and/or zlint) would have automati=
cally caught the encoding issue at each staging in the pipeline: in develop=
ment, in staging, and in production.
0
jsha
3/13/2018 10:50:50 PM
On Tuesday, March 13, 2018 at 23:51:01 UTC+1 js...@letsencrypt.org wrote:

> Clearly we should have caught this earlier in the process. The changes we=
 have in the pipeline (integrating certlint and/or zlint) would have automa=
tically caught the encoding issue at each staging in the pipeline: in devel=
opment, in staging, and in production.

So to clarify I understand this: The same problem was in the staging enviro=
nment and there where also certificates with illegal encoding issued in sta=
ging, but you didn't notice them because no one manually validated them wit=
h the crt.sh lint?

Or are there differences between staging and production?
0
josef
3/14/2018 10:18:15 AM
> So to clarify I understand this: The same problem was in the staging environment and there where also certificates with illegal encoding issued in staging, but you didn't notice them because no one manually validated them with the crt.sh lint?

That's correct.

> Or are there differences between staging and production?

Yep, there are differences, though of course we try to keep them to a minimum. The most notable is that we don't use trusted keys in staging. That means staging can only submit to test CT logs, and is therefore not picked up by crt.sh.
0
jsha
3/14/2018 6:17:31 PM
On Tuesday, March 13, 2018 at 4:27:23 PM UTC-6, Matthew Hardeman wrote:
> I thought I recalled a recent case in which a new root/key was declined
> with the sole unresolved (and unresolvable, save for new key generation,
> etc.) matter precluding the inclusion being a prior mis-issuance of test
> certificates, already revoked and disclosed.  Perhaps I am mistaken.

I haven't seen this directly addressed.  I'm not sure what incident you are=
 referring to, but I'm fairly that the mis-issuance that needed new keys wa=
s for certificates that were issued for domains that weren't properly valid=
ated.

In the case under discussion in this thread, all the mis-issued certificate=
s are only mis-issued due to encoding issues. The certificates are for sub-=
domains of randomly generated subdomains of aws.radiantlock.org (which, acc=
ording to whois, is controlled by Let's Encrypt). I presume these domains a=
re created specifically for testing certificate issuance in the production =
environment in a way that complies with the BRs.

To put it succinctly, the issue you are referring to is about issuing certi=
ficates for domains that aren't authorized (whether for testing or not), ra=
ther than creating test certificates.

-- Tom Prince
0
Tom
3/14/2018 6:35:25 PM
On Tue, Mar 13, 2018 at 6:27 PM Matthew Hardeman <mhardeman@gmail.com>
wrote:

> Another question this incident raised in my mind pertains to the parallel
>>> staging and production environment paradigm:  If one truly has the
>>> 'courage
>>> of conviction' of the equivalence of the two environments, why would on=
e
>>> not perform all tests in ONLY the staging environment, with no tests an=
d
>>> nothing other than production transactions on the production environmen=
t?
>>> That tests continue to be executed in the production environment while
>>> holding to the notion that a fully parallel staging environment is the
>>> place for tests seems to signal that confidence in the staging
>>> environment
>>> is -- in some measure, however small -- limited.
>>
>>
>> That's ... just a bad conclusion, especially for a publicly-trusted CA :=
)
>>
>>
> I certainly agree it's possible that I've reached a bad conclusion there,
> but I would like to better understand how specifically?  Assuming the sam=
e
> input data set and software manipulating said data, two systems should in
> general execute identically.  To the extent that they do not, my initial
> position would be that a significant failing of change management of
> operating environment or data set or system level matters has occurred.  =
I
> would think all of those would be issues of great concern to a CA, if for
> no other reason than that they should be very very rare.
>

I get the impression you may not have run complex production systems,
especially distributed systems, or spent much time with testing
methodology, given statements such as =E2=80=9Ccourage or your conviction.=
=E2=80=9D

No testing system is going to be perfect, and there=E2=80=99s a difference =
between
designed redundancy and unnecessary testing.

For example, even if you had 100% code coverage through tests, there are
still things that are possible to get wrong - for example, you could test
every line of your codebase and still fail to properly handle IDNs, for
example - or, as other CAs have shown, ampersands.

It=E2=80=99s foolish to think that a staging environment will cover every p=
ossible
permutation - even if you solved the halting problem, you will still have
issues with, say, solar radiation induced bitflips, or RAM heat, or any
number of other issues. And yes, these are issues still affecting real
systems today, not scary stories we tell our SREs to keep them up at night.

Look at any complex system - avionics, military command-and-control,
certificate authorities, modern scalable websites - and you will find
systems designed with redundancy throughout, to ensure proper functioning.
It is the madness of inexperience to suggest that somehow this redundancy
is unnecessary or somehow a black mark - the Sean Hannity approach of =E2=
=80=9CF=E2=80=99
it, we=E2=80=99ll do it live=E2=80=9D is the antithesis of modern and secur=
e design. The
suggestion that this is somehow a sign of insufficient testing or design
is, at best, naive, and at worse, detrimental towards discussions of how to
improve the ecosystem.

>
0
Ryan
3/14/2018 6:57:54 PM
This incident, and the resulting action to "integrate GlobalSign's certlint
and/or zlint into our existing cert-checker pipeline" has been documented
in bug 1446080 [1]

This is further proof that pre-issuance TBS certificate linting (either by
incorporating existing tools or using a comprehensive set of rules) is a
best practice that prevents misissuance. I don't understand why all CA's
aren't doing this.

- Wayne

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1446080
0
Wayne
3/15/2018 7:04:53 PM
Le 15/03/2018 à 20:04, Wayne Thayer a écrit :
> This incident, and the resulting action to "integrate GlobalSign's certlint
> and/or zlint into our existing cert-checker pipeline" has been documented
> in bug 1446080 [1]
> 
> This is further proof that pre-issuance TBS certificate linting (either by
> incorporating existing tools or using a comprehensive set of rules) is a
> best practice that prevents misissuance. I don't understand why all CA's
> aren't doing this.
> 
> - Wayne
> 
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1446080
> 

Should another bug be opened for the certificate issued by IdenTrust 
with apparently the same encoding problem?

https://crt.sh/?id=8373036&opt=cablint,x509lint

Does Mozilla expects the revocation of such certificates?

https://groups.google.com/d/msg/mozilla.dev.security.policy/wqySoetqUFM/l46gmX0hAwAJ
0
Tom
3/15/2018 7:22:43 PM
On Thu, Mar 15, 2018 at 12:22 PM, Tom via dev-security-policy <
dev-security-policy@lists.mozilla.org> wrote:

> Should another bug be opened for the certificate issued by IdenTrust with
> apparently the same encoding problem?
>
> Yes - this is bug 1446121 (
https://bugzilla.mozilla.org/show_bug.cgi?id=3D1446121)

https://crt.sh/?id=3D8373036&opt=3Dcablint,x509lint
>

Does Mozilla expects the revocation of such certificates?
>
> Yes, within 24 hours per BR 4.9.1.1 (9) "The CA is made aware that the
Certificate was not issued in accordance with these Requirements or the
CA=E2=80=99s Certificate Policy or Certification Practice Statement;"

Mozilla requires adherence to the BRs, and the BRs require CAs to comply
with RFC 5280.

https://groups.google.com/d/msg/mozilla.dev.security.policy/
> wqySoetqUFM/l46gmX0hAwAJ
>
> - Wayne
0
Wayne
3/15/2018 7:58:51 PM
Please put also this certificate on that list:
https://crt.sh/?id=181538497&opt=cablint,x509lint

Best Regards, 
Jozsef
0
UTF
3/16/2018 3:41:26 PM
Reply: