Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Times out and terminates aws instance before windows server password retrieval. #201

Closed
JaBurd opened this issue Oct 21, 2015 · 18 comments
Closed

Comments

@JaBurd
Copy link

JaBurd commented Oct 21, 2015

The kitchen ec2 driver waits for the AWS ec2 windows server to be able to provide the password before moving forward.

# rubocop:disable Lint/UnusedBlockArgument
def fetch_windows_admin_password(server, state)
  wait_with_destroy(server, state, "to fetch windows admin password") do |aws_instance|
    enc = server.client.get_password_data(
      :instance_id => state[:server_id]
    ).password_data
    # Password data is blank until password is available
    !enc.nil? && !enc.empty?
  end
  pass = server.decrypt_windows_password(instance.transport[:ssh_key])
  state[:password] = pass
  info("Retrieved Windows password for instance <#{state[:server_id]}>.")
end

However this process can take up to 30 minutes as stated in Amazon's documentation. http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/IIS4.1GettingPassword.html

Kitchen-ec2's default wait time is only 5 minutes. Suggest either better documentation for changing these default values:

retryable_tries: 60
retryable_sleep: 5

Or increasing these when the platform is a windows instance.

@rhass
Copy link

rhass commented Oct 29, 2015

I have a slightly different issue... it doesn't terminate the instance -- but it does fail to obtain the password, then gets into a state in which it tries to login, but cannot:

       [WinRM] connection failed, retrying in 1 seconds (#<WinRM::WinRMAuthorizationError: WinRM::WinRMAuthorizationError>)
       [WinRM] connection failed, retrying in 1 seconds (#<WinRM::WinRMAuthorizationError: WinRM::WinRMAuthorizationError>)
       [WinRM] connection failed, retrying in 1 seconds (#<WinRM::WinRMAuthorizationError: WinRM::WinRMAuthorizationError>)
       [WinRM] connection failed, retrying in 1 seconds (#<WinRM::WinRMAuthorizationError: WinRM::WinRMAuthorizationError>)
$$$$$$ [WinRM] connection failed, terminating (#<WinRM::WinRMAuthorizationError: WinRM::WinRMAuthorizationError>)

@JaBurd I am not really clear about your suggestion to increase the values for retriable_tries and retriable_sleep on Windows though -- did increasing these values fix the issue for you?

@zl4bv
Copy link
Contributor

zl4bv commented Oct 29, 2015

@rhass a WinRM auth error is a different issue because it has successfully grabbed the password from the EC2 API and is now trying to authenticate with it.

Try running test kitchen in debug mode and checking that the username/password/etc that test kitchen attempts to authenticate with are what you would expect them to be.

$ kitchen test -l debug

@JaBurd
Copy link
Author

JaBurd commented Oct 29, 2015

@rhass Yes, that error is indicating kitchen is unable to authenticate properly with the instance. Definitely run the statement suggested by @zl4bv or

$ kitchen diagnose --all  

This will show you the entire configuration kitchen is using.

Also check the following:

In your .kitchen.yml transport make sure the username: tag matches the admin account of the instance.

Also, in the .kitchen folder a .yml file will get created which contains key information kitchen uses to connect to your instance. I've found this file doesn't get written until your kitchen process has completed. i.e. if you exit out of the process before it either errors or completes this file will be blank.

---
server_id: i-1bxxxxxxxx
hostname: 10.2xxx.xxx.xxx
password: "instance admin password"
last_action: converge

Check this file to see if the information is correct to connect to your instance.

I've finally found that my situation is quite unique. I need to be on my corporate proxy for Kitchen to initially create my instance. Once the converge gets to the point of connecting via winrm to upload cookbooks it will fail. It seems my corporate proxy sees the winrm traffic as http web traffic and intercepts it. I have to wait for it to fail, remove my proxy and re-start the converge.

Yes, as to the initial comment above, Windows can really take 20 minutes to be ready for login. The default 5 minutes is far too short for windows instances. At a minimum the readme should call this out somewhere.

@rhass-r7
Copy link

@JaBurd I realize it is unable to authenticate, but the root cause for the authentication failure with WinRM is the initial converge it gets a null value for the password when it attempt to retrieve it from AWS and decrypt the value against the ssh key. The result is the password key/value in the state file is never created, and the driver does not attempt to retrieve the key ever again if the value is missing from the state/state-file.

The following is what I see when I try to converge a node for the first time:

>>>>>> ------Exception-------
>>>>>> Class: Kitchen::ActionFailed
>>>>>> Message: Failed to complete #create action: [no implicit conversion of nil into String]
>>>>>> ----------------------
>>>>>> Please see .kitchen/logs/kitchen.log for more details
>>>>>> Also try running `kitchen diagnose --all` for configuration

I have attempted to hack the driver in various ways to force it to request the key every time to prove out the issue and possibly fix it, but so far I have been unsuccessful. I think the various wait methods assume if the state/state-file is defined, then no work is needed to be done and effectively skips the call to fetch_windows_admin_password One would expect this not to be the case given the following:

https://github.com/test-kitchen/kitchen-ec2/blob/master/lib/kitchen/driver/ec2.rb#L222
https://github.com/test-kitchen/kitchen-ec2/blob/master/lib/kitchen/driver/ec2.rb#L425

However, it seems transport[:password] and state[:password] behave inconsistently. Moreover, I have tried changing this wait_with_destroy to wait_until_ready, under the assumption that Windows takes longer too boot and initialize, which may be causing the value to come back nil -- it did wait a few seconds longer but it still came back nil.

@rhass-r7
Copy link

Quick update -- I did get further with everything specified correctly in my .kitchen.local.yml for this.

---
platforms:
  - name: windows-2012r2
    driver:
      name: ec2
      region: us-west-2
      availability_zone: b
      image_id: ami-f8f715cb
      instance_type: t2.small
      associate_public_ip: true
      subnet_id: subnet-CENSORED
      aws_ssh_key_id: public-key-name-in-aws
    transport:
      ssh_key: /path/to/private/keyfile

Now it seems to get stuck in a loop with Waiting for WinRM service on http://52.x.x.x:5985/wsman, retrying in 3 seconds

@jsmickey
Copy link

jsmickey commented Jan 6, 2016

@rhass-r7 Did you ever get this resolved?

I am setting this up for the first time and have a similar error
Message: Failed to complete #create action: [no implicit conversion of nil into String]

@rhass-r7
Copy link

rhass-r7 commented Jan 6, 2016

@jsmickey Unfortunately, I did not. I had to move forward with other commitments and I haven't had time to revisit this issue.

Maybe @mwrock might be able to help us out here.

@mwrock
Copy link
Member

mwrock commented Jan 9, 2016

I'm not sure I have ever used kitchen-ec2 to converge windows boxes. I have used knife-ec2 and passed a userdata file which creates a user on the fly with username and password you tell it to use. I believe kitchen-ec2 does the same hen you tell it the credentials to use.

Once you get past authorization, you also need to setup the winrm settings necessary depending on the transport you are using (ssl/plain text). This may include allowing unencrypted traffic. If you try to use basic auth and the target machine's winrm config does not allow it, you won't get very far. So those might be some things to look at on the target box to try and figure out why winrm connections are failing.

Might look at http://www.hurryupandwait.io/blog/understanding-and-troubleshooting-winrm-connection-and-authentication-a-thrill-seekers-guide-to-adventure

@JaBurd
Copy link
Author

JaBurd commented Jan 9, 2016

I've converged many Windows machines via kitchen-ec2. The biggest issue
I've found (hence the filing of this issue) is by default kitchen doesn't
wait long enough for the machine to be ready to provide the Admin password.

In my ,kitchen.yml I had to override the default retryable_tries and
retryable_sleep options to give the instance enough time to come up in AWS.

retryable_tries: 200
retryable_sleep: 8

I get winrm issues due to our corporate proxy blocking winrm traffic once
the instance has completed it's initialization. I have to:

  1. Converge on proxy for the system to create in aws, and wait for the
    winrm to fail.
  2. Go off corporate proxy and re-converge to allow the winrm traffic to
    complete the converge process.

On Fri, Jan 8, 2016 at 7:12 PM, Matt Wrock [email protected] wrote:

I'm not sure I have ever used kitchen-ec2 to converge windows boxes. I
have used knife-ec2 and passed a userdata file which creates a user on the
fly with username and password you tell it to use. I believe kitchen-ec2
does the same hen you tell it the credentials to use.

Once you get past authorization, you also need to setup the winrm settings
necessary depending on the transport you are using (ssl/plain text). This
may include allowing unencrypted traffic. If you try to use basic auth and
the target machine's winrm config does not allow it, you won't get very
far. So those might be some things to look at on the target box to try and
figure out why winrm connections are failing.

Might look at
http://www.hurryupandwait.io/blog/understanding-and-troubleshooting-winrm-connection-and-authentication-a-thrill-seekers-guide-to-adventure


Reply to this email directly or view it on GitHub
#201 (comment)
.

@zl4bv
Copy link
Contributor

zl4bv commented Jan 13, 2016

I've converged many Windows machines via kitchen-ec2. The biggest issue
I've found (hence the filing of this issue) is by default kitchen doesn't
wait long enough for the machine to be ready to provide the Admin password.

In my experience kitchen-ec2's default timeouts are usually long enough when converging Windows machines using the Amazon-provided Windows AMIs. However, when using "baked" Windows AMIs the timeouts in kitchen-ec2 are almost always reached before the Windows password is retrieved. Setting retryable_tries: 600 has become a standard in our kitchen config files due to the frequency with which we encounter this issue.

@cheeseplus cheeseplus changed the title kitchen-ec2 times out and terminates aws instance before windows server password retrieval. Times out and terminates aws instance before windows server password retrieval. Feb 9, 2016
@jsmickey
Copy link

@zl4bv @JaBurd
Are you able to retrieve the windows password? I added retries, but test-kitchen only tries one time to fetch the password. I looked at the code and it appears it should retry 200 times, but that's not happening. Any ideas?

Waited 176/1600s for instance <i-0357440d798afd9b2> to become ready.
Waited 0/1600s for instance <i-0357440d798afd9b2> to fetch windows admin password.
>>>>>> ------Exception-------
>>>>>> Class: Kitchen::ActionFailed
>>>>>> Message: Failed to complete #create action: [no implicit conversion of nil into String]
>>>>>> ----------------------
>>>>>> Please see .kitchen/logs/kitchen.log for more details
>>>>>> Also try running `kitchen diagnose --all` for configuration

This is my .kitchen.cloud.yml

driver:
  name: ec2
  aws_ssh_key_id: my-key
  security_group_ids: ["sg-f8129c9c"]
  region: us-west-2
  availability_zone: us-west-2b
  subnet_id: subnet-1234567
  iam_profile_name: my_profile
  instance_type: t2.medium
  associate_pulbic_ip: false
  interface: private
  retryable_sleep: 8
  retryable_tries: 200

transport:
  connection_timeout: 10
  connection_retries: 5

provisioner:
  name: chef_zero
  require_chef_omnibus: 12.8.1

platforms:
  - name: windows-2012r2
    driver:
      tags:
        Name: my-windows-instance

suites:
  - name: default
    run_list:
      - recipe[my_recipe::default]

@zl4bv
Copy link
Contributor

zl4bv commented Apr 28, 2016

@jsmickey yeah, if I set the timeout to long enough it gets the Windows password.

Would you be able to paste the full backtrace from .kitchen/logs/kitchen.log?

@darknighthunder
Copy link

@jsmickey @JaBurd with increasing of the retryable_tries value, I observe the password retrieved for windows instnce but, when it tries to move further, it fails with undefined method encoding for nil:NilClass. Do we need any encoding values to be configured with kitchen.yml

Console Output
2016/05/05 17:33:33Z: Message: Windows is Ready to use
EC2 instance ready.

------Exception-------
Class: Kitchen::ActionFailed
Message: Failed to complete #create action: [undefined method `encoding' for nil:NilClass]

This is how my .kitchen/logs/kitchen.log looks after the error

D------Exception-------
D Class: Kitchen::ActionFailed
D Message: Failed to complete #create action: [undefined method encoding' for nil:NilClass] D ---Nested Exception--- D Class: NoMethodError D Message: undefined methodencoding' for nil:NilClass
D ------Backtrace-------
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/rubyntlm-0.6.0/lib/net/ntlm/encode_util.rb:42:in encode_utf16le' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/rubyntlm-0.6.0/lib/net/ntlm/client/session.rb:187:inoem_or_unicode_str'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/rubyntlm-0.6.0/lib/net/ntlm/client/session.rb:172:in password' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/rubyntlm-0.6.0/lib/net/ntlm/client/session.rb:192:inntlmv2_hash'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/rubyntlm-0.6.0/lib/net/ntlm/client/session.rb:196:in calculate_user_session_key!' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/rubyntlm-0.6.0/lib/net/ntlm/client/session.rb:27:inauthenticate!'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/rubyntlm-0.6.0/lib/net/ntlm/client.rb:36:in init_context' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/http/transport.rb:228:ininit_auth'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/http/transport.rb:166:in send_request' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/winrm_service.rb:489:insend_message'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/winrm_service.rb:390:in run_wql' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/command_executor.rb:171:inos_version'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/command_executor.rb:130:in code_page' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/command_executor.rb:72:inblock in open'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/command_executor.rb:203:in retryable' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/command_executor.rb:71:inopen'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/winrm-1.7.2/lib/winrm/winrm_service.rb:356:in create_executor' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/test-kitchen-1.6.0/lib/kitchen/transport/winrm.rb:321:insession'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/test-kitchen-1.6.0/lib/kitchen/transport/winrm.rb:135:in wait_until_ready' D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/kitchen-ec2-1.0.0/lib/kitchen/driver/ec2.rb:205:increate'
D /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/test-kitchen-1.6.0/lib/kitchen/instance.rb:449:in `public_send'

@rojomisin
Copy link

I am launching into an ec2-classic account, and kept getting Failed to complete #create action: [no implicit conversion of nil into String] on default-windows-2008r2

Using a combination of solutions from above what finally allowed my .kitchen.ec2.yml to create and retrieve the password from a vanilla windows AMI was adding this to the transport and platform...

transport:
  name: 'winrm'
  ssh_key: /Users/user1/chef-repo/cookbooks/win_config/win_config.pem

platforms:
  - name: windows-2008r2
    driver_config:
      guest: windows
      communicator: 'winrm'

full file here if it helps anyone

@rhealitycheck
Copy link

@rojomisin I am so glad I read all the way to your comment because what you have snipped into your post is what got mine working after trying to increase the retry counts etc which did nothing to help. Thank you so much for posting your solution!

@BenLiyanage
Copy link

BenLiyanage commented Dec 6, 2016

This has been apparently going on for quite a while. I'm also having this issue.

My yaml looks like this:

---
transport:
  ssh_key: ~/.ssh/test-kitchen

driver:
  name: ec2
  subnet_id: subnet-5b56cd02
  security_group_ids: ["sg-20694b44"]
  instance_type: t2.medium
  retryable_tries: 120
  retryable_sleep: 10

provisioner:
  name: chef_zero

my .kitchen/default.yml looks like this during the kitchen create command:

--- {}

After I cancel the command or it times out it writes out a file that looks like this:

---
server_id: i-37011ba4
hostname: 52.90.25.94

It seems to be unable to get the username/password from AWS

When I run kitchen converge --log-level=debug it throws this error:

[WinRM] opening remote shell on plaintext::http://52.90.25.94:5985/wsman<{:disable_sspi=>true, :basic_auth_only=>true, :user=>"administrator", :pass=>nil}>

This is really confirming that the password is not getting set correctly in the .kitchen/default.yml

I'm generating my ssh key with this AWS command:

aws ec2 create-key-pair --key-name "${username}" | ruby -e "require 'json'; puts JSON.parse(STDIN.read)['KeyMaterial']" > ~/.ssh/test-kitchen

I notice some people are using a .pem file. Is that the same as what I'm doing with my ssh key? How can I get the default admin password from AWS as atleast a minimum workaround?


As an asside for other people with issues w/ this you need to have the default winrm port 5985 and most likely the rdp port 3389 open for this to have a shot at connecting.

@evanwieren
Copy link

Ok, I did some sleuthing on this issue. This is not an issue with kitchen-ec2 directly.

...lib/kitchen/driver/ec2.rb 

        begin
          server.wait_until(
            :max_attempts => config[:retryable_tries],
            :delay => config[:retryable_sleep],
            :before_attempt => wait_log,
            &block
          )
        rescue

calls the Ruby AWS library aws-sdk and function:
#wait_until(waiter_name, params = {}) {|waiter| ... } ⇒ Boolean

Now the issue is that this times out around 245 seconds or so. No matter what you put in for the timeout, it fails before the number it has been given.

From the docs
:password_data_available #get_password_data 15(timedelay) 40(retries)

It could be that it is using an older version of the API that had a bug, but I am not sure. I have not run a test on the side using the aws-sdk gem to validate. Sorry about the formatting.

@cheeseplus
Copy link
Contributor

I'm closing this one out given the age and meandering nature - we've definitely addressed some of the issues in kitchen-ec2 as best we can as well as aws-sdk updates. If folks are seeing this with version 1.4+ please open a new issue with the relevant diagnostic data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests