Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: tls error: bad record MAC #1581

Closed
mengzhuo opened this issue Sep 14, 2019 · 7 comments
Closed

storage: tls error: bad record MAC #1581

mengzhuo opened this issue Sep 14, 2019 · 7 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. needs more info This issue needs more information from the customer to proceed. status: investigating The issue is under investigation, which is determined to be non-trivial.

Comments

@mengzhuo
Copy link

I found it in the Go build farm:
Error: writeSnapshot: local error: tls: bad record MAC

I think this is caused by the "finish_write_snapshot_to_gcs" while the coordinator trying to writing snapshot to GCE.

golang/go#31217

All log:
      rev: e7e2b1c2b91320ef0ddf025d330061d56115dd53
 buildlet: http://ls-3a3k reverse peer ls-3a3k/129.226.132.234:39892 for host type host-linux-mipsle-mengzhuo
  started: 2019-09-11 23:53:42.184067314 +0000 UTC m=+58.957360392
    ended: 2019-09-12 03:57:57.089056773 +0000 UTC m=+14713.862349853
  success: false

Events:
  2019-09-11T23:53:42Z checking_for_snapshot 
  2019-09-11T23:53:42Z finish_checking_for_snapshot after 191.2ms
  2019-09-11T23:53:42Z get_buildlet 
  2019-09-11T23:53:42Z wait_static_builder host-linux-mipsle-mengzhuo
  2019-09-11T23:53:42Z waiting_machine_in_use 
  2019-09-12T03:48:32Z finish_wait_static_builder after 3h54m49.4s; host-linux-mipsle-mengzhuo
  2019-09-12T03:48:32Z clean_buildlet http://ls-3a3k reverse peer ls-3a3k/129.226.132.234:39892 for host type host-linux-mipsle-mengzhuo
  2019-09-12T03:48:32Z finish_clean_buildlet after 443ms; http://ls-3a3k reverse peer ls-3a3k/129.226.132.234:39892 for host type host-linux-mipsle-mengzhuo
  2019-09-12T03:48:32Z finish_get_buildlet after 3h54m50s
  2019-09-12T03:48:32Z using_buildlet ls-3a3k
  2019-09-12T03:48:32Z write_version_tar 
  2019-09-12T03:48:32Z get_source 
  2019-09-12T03:48:33Z finish_get_source after 0s
  2019-09-12T03:48:33Z write_go_src_tar 
  2019-09-12T03:50:00Z finish_write_go_src_tar after 1m27.4s
  2019-09-12T03:50:00Z make_and_test 
  2019-09-12T03:50:00Z make src/make.bash
  2019-09-12T03:56:10Z finish_make after 6m10.2s; src/make.bash
  2019-09-12T03:56:10Z clean_for_snapshot 
  2019-09-12T03:56:11Z finish_clean_for_snapshot after 178.3ms
  2019-09-12T03:56:11Z write_snapshot_to_gcs 
  2019-09-12T03:56:11Z fetch_snapshot_reader_from_buildlet 
  2019-09-12T03:56:11Z finish_fetch_snapshot_reader_from_buildlet after 345.6ms
  2019-09-12T03:57:56Z finish_write_snapshot_to_gcs after 1m45.7s; err=local error: tls: bad record MAC
  2019-09-12T03:57:56Z finish_make_and_test after 7m56.3s; err=writeSnapshot: local error: tls: bad record MAC

Build log:
linux-mips64le-mengzhuo at e7e2b1c2b91320ef0ddf025d330061d56115dd53

:: Running /tmp/workdir-host-linux-mipsle-mengzhuo/go/src/make.bash with args ["/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/make.bash"] and env ["LANG=en_US.UTF-8" "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin" "GO_BUILDER_ENV=host-linux-mipsle-mengzhuo" "WORKDIR=/tmp/workdir-host-linux-mipsle-mengzhuo" "HTTPS_PROXY=http://127.0.0.1:8123" "HTTP_PROXY=http://127.0.0.1:8123" "USER=root" "HOME=/root" "GO_STAGE0_NET_DELAY=800ms" "GO_STAGE0_DL_DELAY=300ms" "GOROOT_BOOTSTRAP=/tmp/workdir-host-linux-mipsle-mengzhuo/go1.4" "GO_BUILDER_NAME=linux-mips64le-mengzhuo" "GO_BUILDER_FLAKY_NET=1" "GOROOT_BOOTSTRAP=/usr/lib/golang" "GOMIPS64=hardfloat" "GOBIN=" "TMPDIR=/tmp/workdir-host-linux-mipsle-mengzhuo/tmp" "GOCACHE=/tmp/workdir-host-linux-mipsle-mengzhuo/gocache"] in dir /tmp/workdir-host-linux-mipsle-mengzhuo/go/src

Building Go cmd/dist using /usr/lib/golang.
Building Go toolchain1 using /usr/lib/golang.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/mips64le.
---
Installed Go for linux/mips64le in /tmp/workdir-host-linux-mipsle-mengzhuo/go
Installed commands in /tmp/workdir-host-linux-mipsle-mengzhuo/go/bin


Error: writeSnapshot: local error: tls: bad record MAC
@jeanbza
Copy link
Contributor

jeanbza commented Sep 14, 2019

Sorry, just trying to better understand what you're seeing here: are you using cloud.google.com/go/storage? If so, could you provide a small sample of your code, the error you're seeing, and which operation in your code causes the error?

@jeanbza jeanbza added api: storage Issues related to the Cloud Storage API. needs more info This issue needs more information from the customer to proceed. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Sep 14, 2019
@jeanbza
Copy link
Contributor

jeanbza commented Sep 14, 2019

cc @odeke-em for http / tls expertise.

@mengzhuo
Copy link
Author

@jadekler it’s Go team using GCS not me but the code is here

https://github.com/golang/build/blob/01fd29966998a0a3ecd5d721de6bcde3ea9b9a6f/cmd/coordinator/coordinator.go#L2161

@odeke-em
Copy link
Contributor

odeke-em commented Sep 16, 2019

Thank you for filing this issue @mengzhuo and welcome to the Google Cloud Go project!
Thank you for the ping @jadekler!

@mengzhuo, how often can you reproduce this problem? Is it just a flake or does it happen consistently? I ask because that error indicates a connection authentication error where the payload sent from the server doesn't match expected values.

Google's servers seem to use TLSv1.2 as advertised in https://cloud.google.com/load-balancing/docs/ssl-policies-concepts and also if you examine the connection when we do a server hello to googleapis.com:443 but also from https://cloud.google.com/security/encryption-in-transit/#footnote4 which notes that TLS 1.3 is being experimentally rolled out just for Gmail, while Go's TLS now seems to be using TLS 1.3.

Did this error just only happen recently? If you are directly using Go1.13 and not tip before @FiloSottile's "TLS always on commit" in golang/go@a8c2e5c#diff-0977a3b43ae53839ce25d758f24138f2,
you can try invoking GODEBUG=tls13=0 go run main.go
with this repro on your builder to try to investigate what's trippin' out and if it is just a new TLS 1.3 Go mal-adjustment.

package main

import (
	"crypto/tls"
	"log"
)

func main() {
	conn, err := tls.Dial("tcp", "googleapis.com:443", &tls.Config{
		// TODO: please properly load your TLSClient configuration here.
		// Just using InsecureSkipVerify for easy testing.
		InsecureSkipVerify: true,
	})
	if err != nil {
		log.Fatalf("Failed to make request: %v", err)
	}
	defer conn.Close()

	// Write something to the connection.
	conn.Write([]byte("POST / HTTP/1.1\r\nPath: /upload/\r\n\r\n"))
}

If this issue can be consistently reproduced, please try explicitly setting the client's TLS version to 1.2 like this MaxVersion: tls.VersionTLS12 and for the full example:

package main

import (
	"crypto/tls"
	"log"
)

func main() {
	conn, err := tls.Dial("tcp", "googleapis.com:443", &tls.Config{
		// TODO: please properly load your TLSClient configuration here.
		// Just using InsecureSkipVerify for easy testing.
		InsecureSkipVerify: true,
		MaxVersion:         tls.VersionTLS12,
	})
	if err != nil {
		log.Fatalf("Failed to make request: %v", err)
	}
	defer conn.Close()

	// Write something to the connection.
	conn.Write([]byte("POST / HTTP/1.1\r\nPath: /upload/\r\n\r\n"))
}

and if that produces a difference, then we can know that this is an issue with Go's TLS 1.3 implementation.

I'll also kindly ping @agl since this involves an interaction with Google's TLS.

@odeke-em
Copy link
Contributor

@mengzhuo actually since this involves running lots of code on your builder, what you can do to quickly verify if the problem is with the Go's TLS v1.3 vs Google TLS v1.2 interaction, you can apply this patch to your builder code in cmd/coordinator/gce.go

diff --git a/cmd/coordinator/gce.go b/cmd/coordinator/gce.go
index e4e702d..809c647 100644
--- a/cmd/coordinator/gce.go
+++ b/cmd/coordinator/gce.go
@@ -12,6 +12,7 @@ package main
 
 import (
 	"context"
+	"crypto/tls"
 	"encoding/json"
 	"errors"
 	"fmt"
@@ -44,6 +45,7 @@ import (
 	"golang.org/x/oauth2/google"
 	compute "google.golang.org/api/compute/v1"
 	"google.golang.org/api/googleapi"
+	"google.golang.org/api/option"
 )
 
 func init() {
@@ -137,20 +139,34 @@ func initGCE() error {
 	cfgDump, _ := json.MarshalIndent(buildEnv, "", "  ")
 	log.Printf("Loaded configuration %q for project %q:\n%s", *buildEnvName, buildEnv.ProjectName, cfgDump)
 
+	opts := []option.ClientOption{
+		// Force TLS 1.2 in the HTTP client because of issues:
+		// * https://github.com/golang/go/issues/31217
+		// * https://github.com/googleapis/google-cloud-go/issues/1581
+		// in which there might be a bad interaction with Go's TLS v1.3 and Google's TLS v1.2.
+		option.WithHTTPClient(&http.Client{
+			Transport: &http.Transport{
+				TLSClientConfig: &tls.Config{
+					MaxVersion: tls.VersionTLS12,
+				},
+			},
+		}),
+	}
+
 	ctx := context.Background()
 	if *mode != "dev" {
-		storageClient, err = storage.NewClient(ctx)
+		storageClient, err = storage.NewClient(ctx, opts...)
 		if err != nil {
 			log.Fatalf("storage.NewClient: %v", err)
 		}
 
-		metricsClient, err = monapi.NewMetricClient(ctx)
+		metricsClient, err = monapi.NewMetricClient(ctx, opts...)
 		if err != nil {
 			log.Fatalf("monapi.NewMetricClient: %v", err)
 		}
 	}
 
-	dsClient, err = datastore.NewClient(ctx, buildEnv.ProjectName)
+	dsClient, err = datastore.NewClient(ctx, buildEnv.ProjectName, opts...)
 	if err != nil {
 		if *mode == "dev" {
 			log.Printf("Error creating datastore client: %v", err)

or for the full file which is big, please see this section:

// Copyright 2015 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// +build go1.13
// +build linux darwin

// Code interacting with Google Compute Engine (GCE) and
// a GCE implementation of the BuildletPool interface.

package main

import (
	"context"
	"crypto/tls"
	"encoding/json"
	"errors"
	"fmt"
	"io"
	"io/ioutil"
	"log"
	"net/http"
	"os"
	"path"
	"sort"
	"strconv"
	"strings"
	"sync"
	"time"

	"cloud.google.com/go/compute/metadata"
	"cloud.google.com/go/datastore"
	"cloud.google.com/go/errorreporting"
	monapi "cloud.google.com/go/monitoring/apiv3"
	"cloud.google.com/go/storage"
	"golang.org/x/build/buildenv"
	"golang.org/x/build/buildlet"
	"golang.org/x/build/cmd/coordinator/spanlog"
	"golang.org/x/build/dashboard"
	"golang.org/x/build/gerrit"
	"golang.org/x/build/internal/buildgo"
	"golang.org/x/build/internal/buildstats"
	"golang.org/x/build/internal/lru"
	"golang.org/x/oauth2"
	"golang.org/x/oauth2/google"
	compute "google.golang.org/api/compute/v1"
	"google.golang.org/api/googleapi"
	"google.golang.org/api/option"
)

func init() {
	buildlet.GCEGate = gceAPIGate
}

// apiCallTicker ticks regularly, preventing us from accidentally making
// GCE API calls too quickly. Our quota is 20 QPS, but we temporarily
// limit ourselves to less than that.
var apiCallTicker = time.NewTicker(time.Second / 10)

func gceAPIGate() {
	<-apiCallTicker.C
}

// Initialized by initGCE:
var (
	buildEnv *buildenv.Environment

	dsClient       *datastore.Client
	computeService *compute.Service
	gcpCreds       *google.Credentials
	errTryDeps     error // non-nil if try bots are disabled
	gerritClient   *gerrit.Client
	storageClient  *storage.Client
	metricsClient  *monapi.MetricClient
	inStaging      bool                   // are we running in the staging project? (named -dev)
	errorsClient   *errorreporting.Client // Stackdriver errors client
	gkeNodeIP      string

	initGCECalled bool
)

// oAuthHTTPClient is the OAuth2 HTTP client used to make API calls to Google Cloud APIs.
// It is initialized by initGCE.
var oAuthHTTPClient *http.Client

func initGCE() error {
	initGCECalled = true
	var err error

	// If the coordinator is running on a GCE instance and a
	// buildEnv was not specified with the env flag, set the
	// buildEnvName to the project ID
	if *buildEnvName == "" {
		if *mode == "dev" {
			*buildEnvName = "dev"
		} else if metadata.OnGCE() {
			*buildEnvName, err = metadata.ProjectID()
			if err != nil {
				log.Fatalf("metadata.ProjectID: %v", err)
			}
		}
	}

	buildEnv = buildenv.ByProjectID(*buildEnvName)
	inStaging = (buildEnv == buildenv.Staging)

	// If running on GCE, override the zone and static IP, and check service account permissions.
	if metadata.OnGCE() {
		projectZone, err := metadata.Get("instance/zone")
		if err != nil || projectZone == "" {
			return fmt.Errorf("failed to get current GCE zone: %v", err)
		}

		gkeNodeIP, err = metadata.Get("instance/network-interfaces/0/ip")
		if err != nil {
			return fmt.Errorf("failed to get current instance IP: %v", err)
		}

		// Convert the zone from "projects/1234/zones/us-central1-a" to "us-central1-a".
		projectZone = path.Base(projectZone)
		buildEnv.Zone = projectZone

		if buildEnv.StaticIP == "" {
			buildEnv.StaticIP, err = metadata.ExternalIP()
			if err != nil {
				return fmt.Errorf("ExternalIP: %v", err)
			}
		}

		if !hasComputeScope() {
			return errors.New("coordinator is not running with access to read and write Compute resources. VM support disabled")
		}

		if value, err := metadata.ProjectAttributeValue("farmer-run-bench"); err == nil {
			*shouldRunBench, _ = strconv.ParseBool(value)
		}
	}

	cfgDump, _ := json.MarshalIndent(buildEnv, "", "  ")
	log.Printf("Loaded configuration %q for project %q:\n%s", *buildEnvName, buildEnv.ProjectName, cfgDump)

	opts := []option.ClientOption{
		// Force TLS 1.2 in the HTTP client because of issues:
		// * https://github.com/golang/go/issues/31217
		// * https://github.com/googleapis/google-cloud-go/issues/1581
		// in which there might be a bad interaction with Go's TLS v1.3 and Google's TLS v1.2.
		option.WithHTTPClient(&http.Client{
			Transport: &http.Transport{
				TLSClientConfig: &tls.Config{
					MaxVersion: tls.VersionTLS12,
				},
			},
		}),
	}

	ctx := context.Background()
	if *mode != "dev" {
		storageClient, err = storage.NewClient(ctx, opts...)
		if err != nil {
			log.Fatalf("storage.NewClient: %v", err)
		}

		metricsClient, err = monapi.NewMetricClient(ctx, opts...)
		if err != nil {
			log.Fatalf("monapi.NewMetricClient: %v", err)
		}
	}

	dsClient, err = datastore.NewClient(ctx, buildEnv.ProjectName, opts...)
	if err != nil {
		if *mode == "dev" {
			log.Printf("Error creating datastore client: %v", err)
		} else {
			log.Fatalf("Error creating datastore client: %v", err)
		}
	}

	// don't send dev errors to Stackdriver.
	if *mode != "dev" {
		errorsClient, err = errorreporting.NewClient(ctx, buildEnv.ProjectName, errorreporting.Config{
			ServiceName: "coordinator",
		})
		if err != nil {
			// don't exit, we still want to run coordinator
			log.Printf("Error creating errors client: %v", err)
		}
	}

	gcpCreds, err = buildEnv.Credentials(ctx)
	if err != nil {
		if *mode == "dev" {
			// don't try to do anything else with GCE, as it will likely fail
			return nil
		}
		log.Fatalf("failed to get a token source: %v", err)
	}
	oAuthHTTPClient = oauth2.NewClient(ctx, gcpCreds.TokenSource)
	computeService, _ = compute.New(oAuthHTTPClient)
	errTryDeps = checkTryBuildDeps()
	if errTryDeps != nil {
		log.Printf("TryBot builders disabled due to error: %v", errTryDeps)
	} else {
		log.Printf("TryBot builders enabled.")
	}

	if *mode != "dev" {
		go syncBuildStatsLoop(buildEnv)
	}

	go gcePool.pollQuotaLoop()
	go createBasepinDisks(context.Background())
	return nil
}

func checkTryBuildDeps() error {
	if !hasStorageScope() {
		return errors.New("coordinator's GCE instance lacks the storage service scope")
	}
	if *mode == "dev" {
		return errors.New("running in dev mode")
	}
	wr := storageClient.Bucket(buildEnv.LogBucket).Object("hello.txt").NewWriter(context.Background())
	fmt.Fprintf(wr, "Hello, world! Coordinator start-up at %v", time.Now())
	if err := wr.Close(); err != nil {
		return fmt.Errorf("test write of a GCS object to bucket %q failed: %v", buildEnv.LogBucket, err)
	}
	if inStaging {
		// Don't expect to write to Gerrit in staging mode.
		gerritClient = gerrit.NewClient("https://go-review.googlesource.com", gerrit.NoAuth)
	} else {
		gobotPass, err := metadata.ProjectAttributeValue("gobot-password")
		if err != nil {
			return fmt.Errorf("failed to get project metadata 'gobot-password': %v", err)
		}
		gerritClient = gerrit.NewClient("https://go-review.googlesource.com",
			gerrit.BasicAuth("git-gobot.golang.org", strings.TrimSpace(string(gobotPass))))
	}

	return nil
}

var gcePool = &gceBuildletPool{}

var _ BuildletPool = (*gceBuildletPool)(nil)

// maxInstances is a temporary hack because we can't get buildlets to boot
// without IPs, and we only have 200 IP addresses.
// TODO(bradfitz): remove this once fixed.
const maxInstances = 190

type gceBuildletPool struct {
	mu sync.Mutex // guards all following

	disabled bool

	// CPU quota usage & limits.
	cpuLeft   int // dead-reckoning CPUs remain
	instLeft  int // dead-reckoning instances remain
	instUsage int
	cpuUsage  int
	addrUsage int
	inst      map[string]time.Time // GCE VM instance name -> creationTime
}

func (p *gceBuildletPool) pollQuotaLoop() {
	if computeService == nil {
		log.Printf("pollQuotaLoop: no GCE access; not checking quota.")
		return
	}
	if buildEnv.ProjectName == "" {
		log.Printf("pollQuotaLoop: no GCE project name configured; not checking quota.")
		return
	}
	for {
		p.pollQuota()
		time.Sleep(5 * time.Second)
	}
}

func (p *gceBuildletPool) pollQuota() {
	gceAPIGate()
	reg, err := computeService.Regions.Get(buildEnv.ProjectName, buildEnv.Region()).Do()
	if err != nil {
		log.Printf("Failed to get quota for %s/%s: %v", buildEnv.ProjectName, buildEnv.Region(), err)
		return
	}
	p.mu.Lock()
	defer p.mu.Unlock()
	for _, quota := range reg.Quotas {
		switch quota.Metric {
		case "CPUS":
			p.cpuLeft = int(quota.Limit) - int(quota.Usage)
			p.cpuUsage = int(quota.Usage)
		case "INSTANCES":
			p.instLeft = int(quota.Limit) - int(quota.Usage)
			p.instUsage = int(quota.Usage)
		case "IN_USE_ADDRESSES":
			p.addrUsage = int(quota.Usage)
		}
	}
}

func (p *gceBuildletPool) SetEnabled(enabled bool) {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.disabled = !enabled
}

func (p *gceBuildletPool) GetBuildlet(ctx context.Context, hostType string, lg logger) (bc *buildlet.Client, err error) {
	hconf, ok := dashboard.Hosts[hostType]
	if !ok {
		return nil, fmt.Errorf("gcepool: unknown host type %q", hostType)
	}
	qsp := lg.CreateSpan("awaiting_gce_quota")
	err = p.awaitVMCountQuota(ctx, hconf.GCENumCPU())
	qsp.Done(err)
	if err != nil {
		return nil, err
	}

	deleteIn, ok := ctx.Value(buildletTimeoutOpt{}).(time.Duration)
	if !ok {
		deleteIn = vmDeleteTimeout
	}

	instName := "buildlet-" + strings.TrimPrefix(hostType, "host-") + "-rn" + randHex(7)
	instName = strings.Replace(instName, "_", "-", -1) // Issue 22905; can't use underscores in GCE VMs
	p.setInstanceUsed(instName, true)

	gceBuildletSpan := lg.CreateSpan("create_gce_buildlet", instName)
	defer func() { gceBuildletSpan.Done(err) }()

	var (
		needDelete   bool
		createSpan   = lg.CreateSpan("create_gce_instance", instName)
		waitBuildlet spanlog.Span // made after create is done
		curSpan      = createSpan // either instSpan or waitBuildlet
	)

	log.Printf("Creating GCE VM %q for %s", instName, hostType)
	bc, err = buildlet.StartNewVM(gcpCreds, buildEnv, instName, hostType, buildlet.VMOpts{
		DeleteIn: deleteIn,
		OnInstanceRequested: func() {
			log.Printf("GCE VM %q now booting", instName)
		},
		OnInstanceCreated: func() {
			needDelete = true

			createSpan.Done(nil)
			waitBuildlet = lg.CreateSpan("wait_buildlet_start", instName)
			curSpan = waitBuildlet
		},
		OnGotInstanceInfo: func() {
			lg.LogEventTime("got_instance_info", "waiting_for_buildlet...")
		},
	})
	if err != nil {
		curSpan.Done(err)
		log.Printf("Failed to create VM for %s: %v", hostType, err)
		if needDelete {
			deleteVM(buildEnv.Zone, instName)
			p.putVMCountQuota(hconf.GCENumCPU())
		}
		p.setInstanceUsed(instName, false)
		return nil, err
	}
	waitBuildlet.Done(nil)
	bc.SetDescription("GCE VM: " + instName)
	bc.SetOnHeartbeatFailure(func() {
		p.putBuildlet(bc, hostType, instName)
	})
	return bc, nil
}

func (p *gceBuildletPool) putBuildlet(bc *buildlet.Client, hostType, instName string) error {
	// TODO(bradfitz): add the buildlet to a freelist (of max N
	// items) for up to 10 minutes since when it got started if
	// it's never seen a command execution failure, and we can
	// wipe all its disk content? (perhaps wipe its disk content
	// when it's retrieved, like the reverse buildlet pool) But
	// this will require re-introducing a distinction in the
	// buildlet client library between Close, Destroy/Halt, and
	// tracking execution errors.  That was all half-baked before
	// and thus removed. Now Close always destroys everything.
	deleteVM(buildEnv.Zone, instName)
	p.setInstanceUsed(instName, false)

	hconf, ok := dashboard.Hosts[hostType]
	if !ok {
		panic("failed to lookup conf") // should've worked if we did it before
	}
	p.putVMCountQuota(hconf.GCENumCPU())
	return nil
}

func (p *gceBuildletPool) WriteHTMLStatus(w io.Writer) {
	fmt.Fprintf(w, "<b>GCE pool</b> capacity: %s", p.capacityString())
	const show = 6 // must be even
	active := p.instancesActive()
	if len(active) > 0 {
		fmt.Fprintf(w, "<ul>")
		for i, inst := range active {
			if i < show/2 || i >= len(active)-(show/2) {
				fmt.Fprintf(w, "<li>%v, %s</li>\n", inst.name, friendlyDuration(time.Since(inst.creation)))
			} else if i == show/2 {
				fmt.Fprintf(w, "<li>... %d of %d total omitted ...</li>\n", len(active)-show, len(active))
			}
		}
		fmt.Fprintf(w, "</ul>")
	}
}

func (p *gceBuildletPool) String() string {
	return fmt.Sprintf("GCE pool capacity: %s", p.capacityString())
}

func (p *gceBuildletPool) capacityString() string {
	p.mu.Lock()
	defer p.mu.Unlock()
	return fmt.Sprintf("%d/%d instances; %d/%d CPUs",
		len(p.inst), p.instUsage+p.instLeft,
		p.cpuUsage, p.cpuUsage+p.cpuLeft)
}

// awaitVMCountQuota waits for numCPU CPUs of quota to become available,
// or returns ctx.Err.
func (p *gceBuildletPool) awaitVMCountQuota(ctx context.Context, numCPU int) error {
	// Poll every 2 seconds, which could be better, but works and
	// is simple.
	for {
		if p.tryAllocateQuota(numCPU) {
			return nil
		}
		select {
		case <-time.After(2 * time.Second):
		case <-ctx.Done():
			return ctx.Err()
		}
	}
}

func (p *gceBuildletPool) HasCapacity(hostType string) bool {
	hconf, ok := dashboard.Hosts[hostType]
	if !ok {
		return false
	}
	numCPU := hconf.GCENumCPU()
	p.mu.Lock()
	defer p.mu.Unlock()
	return p.haveQuotaLocked(numCPU)
}

// haveQuotaLocked reports whether the current GCE quota permits
// starting numCPU more CPUs.
//
// precondition: p.mu must be held.
func (p *gceBuildletPool) haveQuotaLocked(numCPU int) bool {
	return p.cpuLeft >= numCPU && p.instLeft >= 1 && len(p.inst) < maxInstances && p.addrUsage < maxInstances
}

func (p *gceBuildletPool) tryAllocateQuota(numCPU int) bool {
	p.mu.Lock()
	defer p.mu.Unlock()
	if p.disabled {
		return false
	}
	if p.haveQuotaLocked(numCPU) {
		p.cpuUsage += numCPU
		p.cpuLeft -= numCPU
		p.instLeft--
		p.addrUsage++
		return true
	}
	return false
}

// putVMCountQuota adjusts the dead-reckoning of our quota usage by
// one instance and cpu CPUs.
func (p *gceBuildletPool) putVMCountQuota(cpu int) {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.cpuUsage -= cpu
	p.cpuLeft += cpu
	p.instLeft++
}

func (p *gceBuildletPool) setInstanceUsed(instName string, used bool) {
	p.mu.Lock()
	defer p.mu.Unlock()
	if p.inst == nil {
		p.inst = make(map[string]time.Time)
	}
	if used {
		p.inst[instName] = time.Now()
	} else {
		delete(p.inst, instName)
	}
}

func (p *gceBuildletPool) instanceUsed(instName string) bool {
	p.mu.Lock()
	defer p.mu.Unlock()
	_, ok := p.inst[instName]
	return ok
}

func (p *gceBuildletPool) instancesActive() (ret []resourceTime) {
	p.mu.Lock()
	defer p.mu.Unlock()
	for name, create := range p.inst {
		ret = append(ret, resourceTime{
			name:     name,
			creation: create,
		})
	}
	sort.Sort(byCreationTime(ret))
	return ret
}

// resourceTime is a GCE instance or Kube pod name and its creation time.
type resourceTime struct {
	name     string
	creation time.Time
}

type byCreationTime []resourceTime

func (s byCreationTime) Len() int           { return len(s) }
func (s byCreationTime) Less(i, j int) bool { return s[i].creation.Before(s[j].creation) }
func (s byCreationTime) Swap(i, j int)      { s[i], s[j] = s[j], s[i] }

// cleanUpOldVMs loops forever and periodically enumerates virtual
// machines and deletes those which have expired.
//
// A VM is considered expired if it has a "delete-at" metadata
// attribute having a unix timestamp before the current time.
//
// This is the safety mechanism to delete VMs which stray from the
// normal deleting process. VMs are created to run a single build and
// should be shut down by a controlling process. Due to various types
// of failures, they might get stranded. To prevent them from getting
// stranded and wasting resources forever, we instead set the
// "delete-at" metadata attribute on them when created to some time
// that's well beyond their expected lifetime.
func (p *gceBuildletPool) cleanUpOldVMs() {
	if *mode == "dev" {
		return
	}
	if computeService == nil {
		return
	}

	// TODO(bradfitz): remove this list and just query it from the compute API?
	// http://godoc.org/google.golang.org/api/compute/v1#RegionsService.Get
	// and Region.Zones: http://godoc.org/google.golang.org/api/compute/v1#Region

	for {
		for _, zone := range buildEnv.ZonesToClean {
			if err := p.cleanZoneVMs(zone); err != nil {
				log.Printf("Error cleaning VMs in zone %q: %v", zone, err)
			}
		}
		time.Sleep(time.Minute)
	}
}

// cleanZoneVMs is part of cleanUpOldVMs, operating on a single zone.
func (p *gceBuildletPool) cleanZoneVMs(zone string) error {
	// Fetch the first 500 (default) running instances and clean
	// thoes. We expect that we'll be running many fewer than
	// that. Even if we have more, eventually the first 500 will
	// either end or be cleaned, and then the next call will get a
	// partially-different 500.
	// TODO(bradfitz): revist this code if we ever start running
	// thousands of VMs.
	gceAPIGate()
	list, err := computeService.Instances.List(buildEnv.ProjectName, zone).Do()
	if err != nil {
		return fmt.Errorf("listing instances: %v", err)
	}
	for _, inst := range list.Items {
		if inst.Metadata == nil {
			// Defensive. Not seen in practice.
			continue
		}
		var sawDeleteAt bool
		var deleteReason string
		for _, it := range inst.Metadata.Items {
			if it.Key == "delete-at" {
				if it.Value == nil {
					log.Printf("missing delete-at value; ignoring")
					continue
				}
				unixDeadline, err := strconv.ParseInt(*it.Value, 10, 64)
				if err != nil {
					log.Printf("invalid delete-at value %q seen; ignoring", *it.Value)
					continue
				}
				sawDeleteAt = true
				if time.Now().Unix() > unixDeadline {
					deleteReason = "delete-at expiration"
				}
			}
		}
		isBuildlet := strings.HasPrefix(inst.Name, "buildlet-")

		if isBuildlet && !sawDeleteAt && !p.instanceUsed(inst.Name) {
			createdAt, _ := time.Parse(time.RFC3339Nano, inst.CreationTimestamp)
			if createdAt.Before(time.Now().Add(-3 * time.Hour)) {
				deleteReason = fmt.Sprintf("no delete-at, created at %s", inst.CreationTimestamp)
			}
		}

		// Delete buildlets (things we made) from previous
		// generations. Only deleting things starting with "buildlet-"
		// is a historical restriction, but still fine for paranoia.
		if deleteReason == "" && sawDeleteAt && isBuildlet && !p.instanceUsed(inst.Name) {
			if _, ok := deletedVMCache.Get(inst.Name); !ok {
				deleteReason = "from earlier coordinator generation"
			}
		}

		if deleteReason != "" {
			log.Printf("deleting VM %q in zone %q; %s ...", inst.Name, zone, deleteReason)
			deleteVM(zone, inst.Name)
		}

	}
	return nil
}

var deletedVMCache = lru.New(100) // keyed by instName

// deleteVM starts a delete of an instance in a given zone.
//
// It either returns an operation name (if delete is pending) or the
// empty string if the instance didn't exist.
func deleteVM(zone, instName string) (operation string, err error) {
	deletedVMCache.Add(instName, token{})
	gceAPIGate()
	op, err := computeService.Instances.Delete(buildEnv.ProjectName, zone, instName).Do()
	apiErr, ok := err.(*googleapi.Error)
	if ok {
		if apiErr.Code == 404 {
			return "", nil
		}
	}
	if err != nil {
		log.Printf("Failed to delete instance %q in zone %q: %v", instName, zone, err)
		return "", err
	}
	log.Printf("Sent request to delete instance %q in zone %q. Operation ID, Name: %v, %v", instName, zone, op.Id, op.Name)
	return op.Name, nil
}

func hasScope(want string) bool {
	// If not on GCE, assume full access
	if !metadata.OnGCE() {
		return true
	}
	scopes, err := metadata.Scopes("default")
	if err != nil {
		log.Printf("failed to query metadata default scopes: %v", err)
		return false
	}
	for _, v := range scopes {
		if v == want {
			return true
		}
	}
	return false
}

func hasComputeScope() bool {
	return hasScope(compute.ComputeScope) || hasScope(compute.CloudPlatformScope)
}

func hasStorageScope() bool {
	return hasScope(storage.ScopeReadWrite) || hasScope(storage.ScopeFullControl) || hasScope(compute.CloudPlatformScope)
}

func readGCSFile(name string) ([]byte, error) {
	if *mode == "dev" {
		b, ok := testFiles[name]
		if !ok {
			return nil, &os.PathError{
				Op:   "open",
				Path: name,
				Err:  os.ErrNotExist,
			}
		}
		return []byte(b), nil
	}

	r, err := storageClient.Bucket(buildEnv.BuildletBucket).Object(name).NewReader(context.Background())
	if err != nil {
		return nil, err
	}
	defer r.Close()
	return ioutil.ReadAll(r)
}

// syncBuildStatsLoop runs forever in its own goroutine and syncs the
// coordinator's datastore Build & Span entities to BigQuery
// periodically.
func syncBuildStatsLoop(env *buildenv.Environment) {
	ticker := time.NewTicker(5 * time.Minute)
	for {
		ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
		if err := buildstats.SyncBuilds(ctx, env); err != nil {
			log.Printf("buildstats: SyncBuilds: %v", err)
		}
		if err := buildstats.SyncSpans(ctx, env); err != nil {
			log.Printf("buildstats: SyncSpans: %v", err)
		}
		cancel()
		<-ticker.C
	}
}

// createBasepinDisks creates zone-local copies of VM disk images, to
// speed up VM creations in the future.
//
// Other than a list call, this a no-op unless new VM images were
// added or updated recently.
func createBasepinDisks(ctx context.Context) {
	if !metadata.OnGCE() || (buildEnv != buildenv.Production && buildEnv != buildenv.Staging) {
		return
	}
	for {
		t0 := time.Now()
		bgc, err := buildgo.NewClient(ctx, buildEnv)
		if err != nil {
			log.Printf("basepin: NewClient: %v", err)
			return
		}
		log.Printf("basepin: creating basepin disks...")
		err = bgc.MakeBasepinDisks(ctx)
		d := time.Since(t0).Round(time.Second / 10)
		if err != nil {
			basePinErr.Store(err.Error())
			log.Printf("basepin: error creating basepin disks, after %v: %v", d, err)
			time.Sleep(5 * time.Minute)
			continue
		}
		basePinErr.Store("")
		log.Printf("basepin: created basepin disks after %v", d)
		return
	}
}

@FiloSottile
Copy link
Contributor

bad record MAC is a very internal error, which usually means network corruption of some sorts (like a proxy changing the contents of the TCP stream), or a severe issue in the TLS stack.

The most useful thing for debugging the latter would be an issue on golang.org/issues with a PCAP and the result of setting tls.Config.KeyLogWriter during that network capture.

@odeke-em
Copy link
Contributor

Thank you @FiloSottile! In deed, as you've chimed in, seems like this issue is related to @mengzhuo's proxy and they've come to the resolution of how to get things done as per golang/go#31217 (comment). I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. needs more info This issue needs more information from the customer to proceed. status: investigating The issue is under investigation, which is determined to be non-trivial.
Projects
None yet
Development

No branches or pull requests

4 participants