Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

policy-server-asg-syncer stuck with time 1970-01-01T00:00:00Z #168

Closed
kinjelom opened this issue Sep 28, 2022 · 8 comments
Closed

policy-server-asg-syncer stuck with time 1970-01-01T00:00:00Z #168

kinjelom opened this issue Sep 28, 2022 · 8 comments
Assignees

Comments

@kinjelom
Copy link

Issue

policy-server-asg-syncer isn't working as expected in a new cf-deployment

Context

I deployed the latest CF-deployment (manifest_version: v21.11.0) and I found that changes to security groups do not work, restart / restage does not help, only restarting the cf/api/policy-server-asg-syncer job helps.

Steps to Reproduce

  • dynamic ASG is enabled by default
  • set policy-server-asg-syncer log-level to debug
  • set vxlan-policy-agent log-level to debug
  • create the a new ASG
  • wait 60 sec (by default)
  • check if the ASG is working
  • check logs
  • restart policy-server-asg-syncer
  • check if the ASG is working

Expected result

The ASG should be propagated after 60 sec.

Current result

The ASG is propagated after restart policy-server-asg-syncer.
policy-server-asg-syncer logs on sync:

{"timestamp":"2022-09-28T06:33:00.590335429Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-token","data":{"URL":{"Scheme":"https","Opaque":"","User":null,"Host":"uaa.service.cf.internal:8443","Path":"/oauth/token","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}}
{"timestamp":"2022-09-28T06:33:00.631255310Z","level":"info","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-last-update","data":{}}
{"timestamp":"2022-09-28T06:33:00.654276982Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.cc-internal-json-client.http-do","data":{"body":"{\n  \"last_update\": \"1970-01-01T00:00:00Z\"\n}","session":"3"}}
{"timestamp":"2022-09-28T06:33:00.654827238Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-last-update-response","data":{"last_update":"1970-01-01T00:00:00Z"}}
{"timestamp":"2022-09-28T06:33:00.655255684Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.skipping-update","data":{"cc-latest-update-time":"1970-01-01T00:00:00Z","local-latest-update-time":"1970-01-01T00:00:00Z"}}
{"timestamp":"2022-09-28T06:33:00.655657413Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.asg-sync-complete","data":{}}

Possible Fix

Suspicious code: https://github.com/cloudfoundry/cf-networking-release/blob/develop/src/code.cloudfoundry.org/policy-server/asg_syncer/asg_syncer.go#L76

if a.lastSyncTime.IsZero() {
	a.Logger.Debug("initializing-lastSyncTime")
	a.lastSyncTime = a.Clock.Now()
}

where:

// IsZero reports whether t represents the zero time instant,
// January 1, year 1, 00:00:00 UTC.
func (t Time) IsZero() bool {
	return t.sec() == 0 && t.nsec() == 0
}

but in the log: "cc-latest-update-time":"1970-01-01T00:00:00Z"

So maybe we should check:

func timeIsInitial(timetamp time.Time) bool {
	return timetamp.IsZero() || timetamp.UnixNano() == 0
}

and then:

func valueHasNotBeenUpdated(ccTimetamp, localTimestamp time.Time) bool {
	return !timeIsInitial(ccTimetamp) && !ccTimetamp.After(localTimestamp)
}
...
func (a *ASGSyncer) Poll() error {
	syncStartTime := a.Clock.Now()
        ...
	if timeIsInitial(a.lastSyncTime) {
		a.Logger.Debug("initializing-lastSyncTime")
		a.lastSyncTime = a.Clock.Now()
	}
        ...
	if valueHasNotBeenUpdated(ccLatestUpdateTime, a.latestUpdateTime) {
		a.Logger.Debug("skipping-update", lager.Data{"cc-latest-update-time": ccLatestUpdateTime, "local-latest-update-time": a.latestUpdateTime})
		return nil
	}
@kinjelom
Copy link
Author

kinjelom commented Sep 28, 2022

a bit closer:

cd /var/vcap/jobs/policy-server-asg-syncer/config/certs/
curl --key cc_internal_client.key --cert cc_internal_client.crt --cacert cc_internal_ca.crt -H "Authorization: Bearer eyJh..." https://cloud-controller-ng.service.cf.internal:9023/internal/v4/asg_latest_update

always returns:

{ "last_update": "1970-01-01T00:00:00Z" }

cloud_controller.asg_timestamps:

describe asg_timestamps;
+--------------+-----------+------+-----+-------------------+-----------------------------+
| Field        | Type      | Null | Key | Default           | Extra                       |
+--------------+-----------+------+-----+-------------------+-----------------------------+
| {:name=>:id} | int(11)   | NO   | PRI | NULL              | auto_increment              |
| last_update  | timestamp | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+--------------+-----------+------+-----+-------------------+-----------------------------+
2 rows in set (0.00 sec)

select * from asg_timestamps;
Empty set (0.00 sec)
  • manifest_version: v21.11.0
  • capi-release: 1.139.0

@ameowlia
Copy link
Member

👋 Hi @kinjelom,

Thank you for this detailed report. We will investigate and get back to you.

@ameowlia ameowlia self-assigned this Sep 28, 2022
@kinjelom
Copy link
Author

kinjelom commented Sep 28, 2022

@ameowlia I think there is a mistake in the PK column name, it should probably be name but is {:name=>:id}.

There is the wrong definition:

Sequel.migration do
  change do
    create_table :asg_timestamps do
      primary_key name: :id <--- ????
      Timestamp :last_update
    end
  end
end

@kinjelom
Copy link
Author

PR: #169

@ameowlia
Copy link
Member

ameowlia commented Oct 3, 2022

Hi @kinjelom,

I was not able to reproduce this bug.

Reproduction Environment Setup

  1. I used a vanilla cf-deployment.
  2. I had 2 diego-cells.
  3. I pushed 2 apps. One on each cell.
  4. The environment already had 2 seeded security groups.
    $ cf security-groups
    Getting security groups as admin...
    
    name              organization   space   lifecycle
    public_networks   <all>          <all>   staging
    public_networks   <all>          <all>   running
    dns               <all>          <all>   staging
    dns               <all>          <all>   running
    
  5. I validated that the current ASGs were working by sshing onto the app and curling google.com
  6. I validated that the app couldn't reach private-network ips by sshing onto the app and trying to curl the other instance via it's diego-cell ip and port.
    vcap@e2c044fe-e78e-480b-4327-fe33:~$ curl 10.0.1.14:61000
    curl: (7) Failed to connect to 10.0.1.14 port 61000: Connection refused
    
  7. I validated that there were no entries for asg_timestamps in cloud controller's db.
    mysql> select * from asg_timestamps;
    Empty set (0.01 sec)
    
  8. I turned on debug logging for the policy-server-asg-syncer.
  9. I validated that the policy-server-asg-syncer logs match yours exactly.
    {"timestamp":"2022-10-03T14:30:41.239220227Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.asg-sync-started","data":{}}
    {"timestamp":"2022-10-03T14:30:41.239372527Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-token","data":{"URL":   {"Scheme":"https","Opaque":"","User":null,"Host":"uaa.service.cf.internal:8443","Path":"/oauth/token","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}}
    {"timestamp":"2022-10-03T14:30:41.259341976Z","level":"info","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-last-update","data":{}}
    {"timestamp":"2022-10-03T14:30:41.283610354Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.cc-internal-json-client.http-do","data":{"body":"{\n  \"last_update\": \"1970-01-01T00:00:00Z\"\n}","session":"3"}}
    {"timestamp":"2022-10-03T14:30:41.283884605Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-last-update-response","data":{"last_update":"1970-01-01T00:00:00Z"}}
    {"timestamp":"2022-10-03T14:30:41.283952205Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.skipping-update","data":{"cc-latest-update-time":"1970-01-01T00:00:00Z","local-latest-update-time":"1970-01-01T00:00:00Z"}}
    {"timestamp":"2022-10-03T14:30:41.284051874Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.asg-sync-complete","data":{}}
    

Reproduction Steps

  1. I created and bound a security group that would let my apps talk to all IPs, including private-network IPs.

    $ cat asg.json
    [
      {
        "protocol": "all",
        "destination": "0.0.0.0-255.255.255.255"
      }
    ]
    
    $ cf create-security-group all-traffic asg.json
    $ cf bind-running-security-group all-traffic
    
  2. I saw the asg-syncer logs sync the new security group

{"timestamp":"2022-10-03T14:38:41.769100251Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-with-page-response","data":{"resources":[{"guid":"fb2ef8aa-614e-49b3-95a7-02e71c4804dc","name":"public_networks","globally_enabled":{"running":true,"staging":true},"rules":[{"protocol":"all","destination":"0.0.0.0-9.255.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"11.0.0.0-169.253.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"169.255.0.0-172.15.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"172.32.0.0-192.167.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"192.169.0.0-255.255.255.255","ports":"","type":0,"code":0,"description":"","log":false}],"relationships":{"staging_spaces":{"data":[]},"running_spaces":{"data":[]}}},{"guid":"8e5ab2bf-c878-4e45-955d-6c0328e2b918","name":"dns","globally_enabled":{"running":true,"staging":true},"rules":[{"protocol":"tcp","destination":"0.0.0.0/0","ports":"53","type":0,"code":0,"description":"","log":false},{"protocol":"udp","destination":"0.0.0.0/0","ports":"53","type":0,"code":0,"description":"","log":false}],"relationships":{"staging_spaces":{"data":[]},"running_spaces":{"data":[]}}},{"guid":"2bebdab0-4ddb-4752-b82e-91ee45d435ed","name":"all-traffic","globally_enabled":{"running":true,"staging":false},"rules":[{"protocol":"all","destination":"0.0.0.0-255.255.255.255","ports":"","type":0,"code":0,"description":"","log":false}],"relationships":{"staging_spaces":{"data":[]},"running_spaces":{"data":[]}}}]}}
{"timestamp":"2022-10-03T14:38:41.769173351Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-response","data":{"response":[{"guid":"fb2ef8aa-614e-49b3-95a7-02e71c4804dc","name":"public_networks","globally_enabled":{"running":true,"staging":true},"rules":[{"protocol":"all","destination":"0.0.0.0-9.255.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"11.0.0.0-169.253.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"169.255.0.0-172.15.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"172.32.0.0-192.167.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"192.169.0.0-255.255.255.255","ports":"","type":0,"code":0,"description":"","log":false}],"relationships":{"staging_spaces":{"data":[]},"running_spaces":{"data":[]}}},{"guid":"8e5ab2bf-c878-4e45-955d-6c0328e2b918","name":"dns","globally_enabled":{"running":true,"staging":true},"rules":[{"protocol":"tcp","destination":"0.0.0.0/0","ports":"53","type":0,"code":0,"description":"","log":false},{"protocol":"udp","destination":"0.0.0.0/0","ports":"53","type":0,"code":0,"description":"","log":false}],"relationships":{"staging_spaces":{"data":[]},"running_spaces":{"data":[]}}},{"guid":"2bebdab0-4ddb-4752-b82e-91ee45d435ed","name":"all-traffic","globally_enabled":{"running":true,"staging":false},"rules":[{"protocol":"all","destination":"0.0.0.0-255.255.255.255","ports":"","type":0,"code":0,"description":"","log":false}],"relationships":{"staging_spaces":{"data":[]},"running_spaces":{"data":[]}}}]}}
{"timestamp":"2022-10-03T14:38:41.769198741Z","level":"info","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-last-update","data":{}}
{"timestamp":"2022-10-03T14:38:41.778005341Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.cc-internal-json-client.http-do","data":{"body":"{\n  \"last_update\": \"2022-10-03T14:38:21Z\"\n}","session":"3"}}
{"timestamp":"2022-10-03T14:38:41.778248761Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-last-update-response","data":{"last_update":"2022-10-03T14:38:21Z"}}
{"timestamp":"2022-10-03T14:38:41.778414351Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-timestamps-match","data":{"newUpdatedAt":"2022-10-03T14:38:21Z","originalUpdatedAt":"2022-10-03T14:38:21Z"}}
{"timestamp":"2022-10-03T14:38:41.778635941Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-last-page","data":{"page":1}}
{"timestamp":"2022-10-03T14:38:41.778757941Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-retry-loop-succeeded","data":{"security-groups":[{"guid":"fb2ef8aa-614e-49b3-95a7-02e71c4804dc","name":"public_networks","globally_enabled":{"running":true,"staging":true},"rules":[{"protocol":"all","destination":"0.0.0.0-9.255.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"11.0.0.0-169.253.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"169.255.0.0-172.15.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"172.32.0.0-192.167.255.255","ports":"","type":0,"code":0,"description":"","log":false},{"protocol":"all","destination":"192.169.0.0-255.255.255.255","ports":"","type":0,"code":0,"description":"","log":false}],"relationships":{"staging_spaces":{"data":[]},"running_spaces":{"data":[]}}},{"guid":"8e5ab2bf-c878-4e45-955d-6c0328e2b918","name":"dns","globally_enabled":{"running":true,"staging":true},"rules":[{"protocol":"tcp","destination":"0.0.0.0/0","ports":"53","type":0,"code":0,"description":"","log":false},{"protocol":"udp","destination":"0.0.0.0/0","ports":"53","type":0,"code":0,"description":"","log":false}],"relationships":{"staging_spaces":{"data":[]},"running_spaces":{"data":[]}}},{"guid":"2bebdab0-4ddb-4752-b82e-91ee45d435ed","name":"all-traffic","globally_enabled":{"running":true,"staging":false},"rules":[{"protocol":"all","destination":"0.0.0.0-255.255.255.255","ports":"","type":0,"code":0,"description":"","log":false}],"relationships":{"staging_spaces":{"data":[]},"running_spaces":{"data":[]}}}]}}
{"timestamp":"2022-10-03T14:38:41.778949271Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.successfully-fetched-security-groups","data":{}}
{"timestamp":"2022-10-03T14:38:41.779450921Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.successfully-sent-performance-metrics","data":{}}
{"timestamp":"2022-10-03T14:38:41.779594281Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.updating-local-latest-update-time","data":{"new-local-latest-update-time":"2022-10-03T14:38:21Z","old-local-latest-update-time":"1970-01-01T00:00:00Z"}}
{"timestamp":"2022-10-03T14:38:41.779690831Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.updating-last-sync-time","data":{"new-last-sync-time":"2022-10-03T14:38:41.779688891Z"}}
{"timestamp":"2022-10-03T14:38:41.790397860Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.asg-sync-complete","data":{}}
  1. I validated that the ASG is enforced by seeing that the app could reach private-network ips by sshing onto the app and trying to curl the other instance via it's diego-cell ip and port.

  2. I see that the last_update time has been updated and is no longer "zero".

{"timestamp":"2022-10-03T14:40:41.946485441Z","level":"debug","source":"cfnetworking.policy-server-asg-syncer","message":"cfnetworking.policy-server-asg-syncer.get-security-groups-last-update-response","data":{"last_update":"2022-10-03T14:38:21Z"}}

@ameowlia
Copy link
Member

ameowlia commented Oct 3, 2022

A few thoughts based on your reproduction...

Time zero is expected

It is expected for the initial cc-latest-update-time to be 1970-01-01T00:00:00Z.This is not a bug. This is how the system was architected.

ASGs take 2 minutes to propagate by default

There are two polling mechanisms. Polling between cloud controller and the policy-server. And polling between the policy server and the vxlan-policy agent. By default both of these polling cycles are set to 60s. This means it could take up to 2 minutes to propagate dynamic ASGs by default. Please see docs here for how increase the frequency of these polling cycles.

❓ Questions

  1. If you wait for 2 minutes does the ASG propagate?
  2. If not, what logs are you seeing for the policy-server-asg-syncer after you create a new security group?
  3. Did you bind your security group? I noticed in your reproduction steps you did not mention this step.

@kinjelom
Copy link
Author

kinjelom commented Oct 4, 2022

@ameowlia
What database do you use? I think there is a problem in MySQL/PXC (I have this one), there is no problem (probably) if you use PostgreSQL: cloudfoundry/cloud_controller_ng@7cb7e88

What the table definition do you have describe cloud_controller.asg_timestamps?
Because of cloudfoundry/cloud_controller_ng#2994

  1. If you wait for 2 minutes does the ASG propagate?

no

  1. If not, what logs are you seeing for the policy-server-asg-syncer after you create a new security group?

there is no additional logs, because syncer doesn't detect any changes

  1. Did you bind your security group? I noticed in your reproduction steps you did not mention this step.

I tested it using ASG running for all spaces/orgs.

@kinjelom
Copy link
Author

It is working now, tested on:

- type: replace
  path: /releases/name=capi
  value:
    name: capi
    version: "1.140.0"
    url: https://bosh.io/d/github.com/cloudfoundry/capi-release?v=1.140.0
    sha1: 2d68022c39c6e2aaeb0dd65cda211c9efc1f73b1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants