Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

during config reload, teamd is restarted twice #3822

Closed
stepanblyschak opened this issue Nov 26, 2019 · 11 comments
Closed

during config reload, teamd is restarted twice #3822

stepanblyschak opened this issue Nov 26, 2019 · 11 comments
Assignees
Labels

Comments

@stepanblyschak
Copy link
Collaborator

Description

Teamd service is restarted twice during config reload/minigraph.
This cause intfmgrd to configure IP address when team devices got created, but then teamd restarts again and recreates teams so IP address is lost on interface.
I believe that there is another issue that intfmgrd cannot reconfigure interfaces after they were recreated, which makes teamd restart whithout bringing down swss useless.
If SONiC has to support teamd individual restart, other daemons in swss should be able to restore configuration properly.

It looks that during restart of swss, dhcp_relay starts and brings teamd up, than in wait() swss restarts teamd.

Steps to reproduce the issue:

  1. config reload -y
  2. ip a s PortChannel0001

Describe the results you received:
No IP addresses on Po

Describe the results you expected:
IP address on Po

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
  SONiC Software Version: SONiC.HEAD.134-794d4594

Distribution: Debian 9.11
Kernel: 4.9.0-9-2-amd64
Build commit: 794d459
Build date: Tue Nov 26 08:18:45 UTC 2019
Built by: johnar@jenkins-worker-4

Platform: x86_64-mlnx_msn2700-r0
HwSKU: ACS-MSN2700
ASIC: mellanox
Serial Number: MT1822K07815
Uptime: 14:23:26 up 2:09, 3 users, load average: 3.53, 4.92, 5.77

Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-mlnx HEAD.134-794d4594 4256768182aa 373MB
docker-syncd-mlnx latest 4256768182aa 373MB
docker-fpm-frr HEAD.134-794d4594 2f9c5c2f1898 321MB
docker-fpm-frr latest 2f9c5c2f1898 321MB
docker-sflow HEAD.134-794d4594 2153ecfa328f 305MB
docker-sflow latest 2153ecfa328f 305MB
docker-lldp-sv2 HEAD.134-794d4594 caf9d4c2102e 299MB
docker-lldp-sv2 latest caf9d4c2102e 299MB
docker-dhcp-relay HEAD.134-794d4594 3a0e1f49a73f 289MB
docker-dhcp-relay latest 3a0e1f49a73f 289MB
docker-database HEAD.134-794d4594 a27870d25c87 281MB
docker-database latest a27870d25c87 281MB
docker-snmp-sv2 HEAD.134-794d4594 e0a37b191f29 335MB
docker-snmp-sv2 latest e0a37b191f29 335MB
docker-orchagent HEAD.134-794d4594 53d0f4f69dfb 322MB
docker-orchagent latest 53d0f4f69dfb 322MB
docker-teamd HEAD.134-794d4594 1978c6a4d093 304MB
docker-teamd latest 1978c6a4d093 304MB
docker-sonic-telemetry HEAD.134-794d4594 7fa1c0f6200d 304MB
docker-sonic-telemetry latest 7fa1c0f6200d 304MB
docker-router-advertiser HEAD.134-794d4594 ed1090c14998 281MB
docker-router-advertiser latest ed1090c14998 281MB
docker-platform-monitor HEAD.134-794d4594 11c35c684d4d 565MB
docker-platform-monitor latest 11c35c684d4d 565MB

```

**Attach debug file `sudo generate_dump`:**

```

```
@stepanblyschak
Copy link
Collaborator Author

Still happens with #3823.
Config reload is fixed but reboot is not

@liat-grozovik
Copy link
Collaborator

Note: this issue is reproduced on both master and 201811

@lguohan
Copy link
Collaborator

lguohan commented Dec 1, 2019

do we know the reason why teamd restart twice in config reload?

@jleveque
Copy link
Contributor

jleveque commented Dec 2, 2019

@lguohan: The config reload issue was due to the teamd dependency being moved to swss.sh for purposes of warm-reboot, but dhcp_relay still had an explicit dependency on temd. This was fixed in #3823, confirmed by @stepanblyschak in the comments above. I have not seen this issue during reboot. @yxieca: have you seen this occur during reboot?

@yxieca
Copy link
Contributor

yxieca commented Dec 2, 2019

Joe,

I never noticed teamd starting twice with just reboot. According to our finding, this would leave system in broken state. So I am leaning towards to say it didn't happen or happened seldomly.

Regards,
Ying

@lguohan
Copy link
Collaborator

lguohan commented Dec 2, 2019

@stepanblyschak is correct, at least I can see the teamd docker has been stopped and restarted during the reboot process.

Dec  2 01:24:27.316111 vlab-01 INFO swss#supervisord: neighsyncd Listens to neigh messages...
Dec  2 01:24:27.728910 vlab-01 NOTICE root: Started syncd service...
Dec  2 01:24:27.736344 vlab-01 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Dec  2 01:24:27.762195 vlab-01 INFO systemd[1]: Started syncd service.
Dec  2 01:24:27.763086 vlab-01 INFO systemd[1]: Starting SNMP container...
Dec  2 01:24:27.763944 vlab-01 INFO systemd[1]: Starting LLDP container...
Dec  2 01:24:27.764688 vlab-01 INFO systemd[1]: Starting Router advertiser container...
Dec  2 01:24:27.765243 vlab-01 INFO systemd[1]: Starting DHCP relay container...
Dec  2 01:24:27.765799 vlab-01 INFO systemd[1]: Starting Telemetry container...
Dec  2 01:24:27.766365 vlab-01 INFO systemd[1]: Starting sFlow container...
Dec  2 01:24:27.773063 vlab-01 INFO systemd[1]: Stopping TEAMD container...

@lguohan
Copy link
Collaborator

lguohan commented Dec 2, 2019

I did two reboots, both show such behavior.

The image is HEAD.168-5b18aa5f , which contains #3823

below is the log, after reboot. you can see teamd docker has been stopped and restarted. I do not know why after syncd started, systemd tried to stop teamd and restart it again.

admin@vlab-01:~$ sudo zgrep -a "4.9.0-9-2-amd64" /var/log/syslog*        
/var/log/syslog.1:Dec  2 02:09:34.479547 vlab-01 NOTICE kernel: [    0.000000] Linux version 4.9.0-9-2-amd64 ([email protected]) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.168-1+deb9u5 (2015-12-19)



/var/log/syslog.1:Dec  2 02:09:53.282530 vlab-01 INFO systemd[1]: Starting TEAMD container...
/var/log/syslog.1:Dec  2 02:09:57.601923 vlab-01 INFO systemd[1]: Started TEAMD container.
/var/log/syslog.1:Dec  2 02:09:58.916817 vlab-01 INFO systemd[1]: Stopping TEAMD container...
/var/log/syslog.1:Dec  2 02:10:00.227715 vlab-01 INFO systemd[1]: Stopped TEAMD container.
/var/log/syslog.1:Dec  2 02:10:00.228267 vlab-01 INFO systemd[1]: Starting TEAMD container...
/var/log/syslog:Dec  2 02:10:13.235114 vlab-01 INFO systemd[1]: Started TEAMD container.

Dec  2 02:09:58.876498 vlab-01 NOTICE root: Started syncd service...
Dec  2 02:09:58.882230 vlab-01 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Dec  2 02:09:58.890008 vlab-01 INFO systemd[1]: Started syncd service.
Dec  2 02:09:58.890847 vlab-01 INFO systemd[1]: Starting LLDP container...
Dec  2 02:09:58.891980 vlab-01 INFO systemd[1]: Starting SNMP container...
Dec  2 02:09:58.895357 vlab-01 INFO systemd[1]: Starting DHCP relay container...
Dec  2 02:09:58.895978 vlab-01 INFO systemd[1]: Starting Telemetry container...
Dec  2 02:09:58.896785 vlab-01 INFO systemd[1]: Starting Router advertiser container...
Dec  2 02:09:58.897508 vlab-01 INFO systemd[1]: Starting sFlow container...
Dec  2 02:09:58.916817 vlab-01 INFO systemd[1]: Stopping TEAMD container...
Dec  2 02:10:00.227715 vlab-01 INFO systemd[1]: Stopped TEAMD container.
Dec  2 02:10:00.228267 vlab-01 INFO systemd[1]: Starting TEAMD container...
Dec  2 02:10:06.133410 vlab-01 INFO systemd[1]: Started Telemetry container.
Dec  2 02:10:12.420400 vlab-01 INFO systemd[1]: Started DHCP relay container.
Dec  2 02:10:12.501004 vlab-01 INFO systemd[1]: Started sFlow container.
Dec  2 02:10:12.811345 vlab-01 INFO systemd[1]: Started Router advertiser container.
Dec  2 02:10:13.003408 vlab-01 INFO systemd[1]: Started LLDP container.
Dec  2 02:10:13.235114 vlab-01 INFO systemd[1]: Started TEAMD container.
Dec  2 02:10:13.246461 vlab-01 INFO systemd[1]: Stopping Router advertiser container...
Dec  2 02:10:15.768857 vlab-01 INFO systemd[1]: Stopped Router advertiser container.
Dec  2 02:10:15.769463 vlab-01 INFO systemd[1]: Starting Router advertiser container...
Dec  2 02:10:17.331494 vlab-01 INFO systemd[1]: Started SNMP container.
Dec  2 02:10:19.664483 vlab-01 INFO systemd[1]: Started Router advertiser container.
Dec  2 02:10:19.675950 vlab-01 INFO systemd[1]: Stopping DHCP relay container...
Dec  2 02:10:21.277468 vlab-01 INFO systemd[1]: Stopped DHCP relay container.
Dec  2 02:10:21.281942 vlab-01 INFO systemd[1]: Starting DHCP relay container...
Dec  2 02:10:25.013029 vlab-01 INFO systemd[1]: Started DHCP relay container.
Dec  2 02:10:27.763203 vlab-01 INFO systemd[1]: Stopped Platform monitor container.
Dec  2 02:10:27.766652 vlab-01 INFO systemd[1]: Starting Platform monitor container...
Dec  2 02:10:31.255406 vlab-01 INFO systemd[1]: Started Platform monitor container.
Dec  2 02:10:36.763129 vlab-01 INFO systemd[1]: Stopped Telemetry container.
Dec  2 02:10:36.767252 vlab-01 INFO systemd[1]: Starting Telemetry container...
Dec  2 02:10:40.119939 vlab-01 INFO systemd[1]: Started Telemetry container.
Dec  2 02:11:09.263839 vlab-01 INFO systemd[1]: Stopped Platform monitor container.

@lguohan
Copy link
Collaborator

lguohan commented Dec 2, 2019

i think it is caused by this line.

https://github.com/Azure/sonic-buildimage/blob/master/files/scripts/swss.sh#L90

after syncd get started, it will sequentially restart teamd radv dhcp_relay. the above log match.

@yxieca and @jleveque , can you check.

@yxieca
Copy link
Contributor

yxieca commented Dec 2, 2019

The restart was added to address the radv not started properly issue.

And dhcp_relay doesn't care being restarted again.

But I think teamd is currently started after swss. Therefore, in the boot sequence, at the time when we 'restart' teamd, it should not have been started the first time yet.

@yxieca
Copy link
Contributor

yxieca commented Dec 2, 2019

We should look into the change further regardless.

@lguohan
Copy link
Collaborator

lguohan commented Dec 2, 2019

"But I think teamd is currently started after swss. ", it is not the case based on the log above. as you can see swss is considered as started in the following log. Then immediately, teamd started.

Dec  2 02:09:53.265467 vlab-01 NOTICE root: Started swss service...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants