Thursday, August 6, 2015

SRM 6 NIGHTMARE!!!

***************************
Update 2

This case is now closed. Only advice that I can give at this point is be VERY careful when you are updating the Certificates in the MOB. If you do it wrong, you will be rebuilding your environment.

The below notes do work for fixing the MOB just make sure you know exactly what your syntax is supposed to be and know which Certificates you are replacing...


Back to the drawing board for me.....


*****SIGH*****


$#@%^&*&%$##^%^&!!!


***************************
UPDATE!!!

After going through all of this again (time has finally permitted me to get a back to this). I have finally got MOST of this working. I say most because I reset my PSC and vCenter Certs to the same thing and now I have to call support to see if I can change this! *Yes, I am an idiot* I will update again as I get this last part figured out.


VMware has updated KB2121701 so many times over the last couple of months that they must really be sick of that page.

The ROOT of the problem is the certificates that are associated with the PSC and the vCenter servers. If they get changed, for some reason VMware does not change them properly in the MOB *or where ever the info is stored* and it then falls on you the Admin to be clever enough to know this is the issue......


If you follow the instructions (very carefully I might add) it gives you instructions how to view what the current certificate that the MOB has listed for your PSCs and your vCenters , how to download a copy of those Certs to get a Thumbprint, how to download your current Certificates, and finally how to use the ls_update_certs.py * which you have to install a new one from the KB article* script to modify what is in the MOB pages. Below is the example from the article of the scripts you will run. I want to point out if you have multiple PSCs and vCenters you will need to do this for ALL of them! You also have to run this from the PSC server.



%VMWARE_PYTHON_BIN%" ls_update_certs.py --url https://psc.vmware.com/lookupservice/sdk --fingerprint 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88 --certfile c:\certificates\new_machine.crt --user Administrator@vsphere.local --password Password


You would need to do the above for:

1.) Your Production PSC *get the thumbprint for the old cert and download the new cert to a central location*
2.) Your Production vCenter *get the thumbprint for the old cert and download the new cert to a central location*
3.) Your DR PSC *get the thumbprint for the old cert and download the new cert to a central location*
4.) Your DR vCenter *get the thumbprint for the old cert and download the new cert to a central location*

I don't know how to make KB 2121701 easier to read but there has to be a way....it is a wealth of knowledge but....it is not easy to obtain that knowledge! 


****************************


I am trying to love VMware vSphere 6 and Site Recovery Manager 6 (SRM). I am trying to show my confidence in VMware. It's not working though.....and I know, I broke the cardinal rule of IT “never adopt early.”

VMware has been my favorite technology for a long time! I drank the Kool-aide and in my mind there is not another company that is doing the kinds of things that they are. Let's face it though....nobody is perfect.

I have now had a case open with them since June 8th about Site Recovery Manager 6 and vCenter 6, about 2 months. I have talked to some great techs there at VMware, but to me I am beginning to sense that there is a lot of confusion among their ranks about the new products. I have had techs tell me that I had to have the same certificate for both the protected and recovery site in order for things to work, and yet their install and configure manual clearly says different. I have had technicians that did not know what the VMCA is and what the function of it was, going as far as to tell me that I needed to do individual certs for each of my vCenter servers, Platform Services Controller (PSC) servers and my ESXi servers. I still have not gotten a good answer as to if SRM and the VMCA work together or if they will sometime in the future. Heck, the first month of my case was spent calling and begging their support team to call me back, it wasn’t until my VP called and started screaming that I started getting any serious traction on the case.

The frustrating part? I have done a bog standard install of SRM. I have setup my environment with VMware’s best practices. I have even gone so far as to ask the technicians to verify the install.

The PSCs are External. The vCenter Servers and the SRM servers are stand-alone VM servers.  I made my VMCAs into a subordinate Certificate Authorities to my in-house Certificate Authority so that all of my clients would trust the sites and we would not have issues.




It is exactly as VMware shows it in a standard Two-Site Topology with one vCenter Server instance per Platform Services Controller (PSC).

 














My issue??  Here goes, when I go to Site Recovery>Sites from my production server, I immediately get the below message:





Error: Failed to connect to Lookup Service at HTTPS://DRPSCSERVER.DOMAIN.COM:443/lookupservice/sdk.
Reason:
com.vmware.vim.vmomi.core.exception. CertificateValidationException: Server certificate chain not verified.

Simple right? My certificates on my vCenter must not be trust that PSC chain right? One of the servers must not be have the chain or the certificate for the DR site….but they do. VMware has verified they do. I can go to the DR PSC server from my Production vCenter Server and it shows the site as trusted…

VMware has combed the logs, and “We ain’t found….”












Now, if I try the same exact thing from the DR side what happens you ask? Same exact thing, but the error message says that it certificate chain is not valid for the Production PSC server. Which is really weird….because I can see both vCenter servers on both the Production and DR sites. Oh, and once again I can go to the Production PSC from the vCenter server and it shows the site as well.

Ahh….so it must be the PSCs don’t trust each other…..NOPE. I can go to each of the PSCs and they both trust the other.

Well so that leaves the SRM servers right? One of them must be the culprit. Well, as before …the vCenter servers all look trusted, and so do the PSC servers. The certificates that the SRM servers have are actually from the parent CA. So they are trusted all the way through….

I am bumfuzzeled….

If anyone has any advice on this PLEASE speak up! Once I get a solution I promise I will append it to this entry….




















30 comments:

  1. Further to my last comment.....can you also list any kb articles you've followed without succes. For example this one is similar to your issue but I'm not sure what you have tried so far.

    http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2121701&sliceId=1&docTypeID=DT_KB_1_1&dialogID=720030279&stateId=0%200%20720038727

    ReplyDelete
    Replies
    1. Thank you so much for the reply! There have been several articles that we have looked at ...problem is that at the time they could not be run on Windows based External PSC servers. KB2109074 and KB2121701 were two articles that we looked at previously. Interestingly enough KB2121701 has been updated again, two days ago with a file to download and replace the existing just like the article that you pointed out. My only concern now is if it going to work seeings how most of the certificates that we have going in house right now are all SHA256 not SHA1.

      Delete
    2. Just tried KB21211701 still no joy. The last time I looked at that article it said you could not run it on Windows External PSCs.

      Delete
  2. ........not sure where last comment went.......anyway sorry to hear of your issues please mail me SR number to lee at vmware dot com. I'll take a look into this with my SRM colleagues.

    ReplyDelete
  3. Echoing Lee's comment, according to a colleague of mine, you might want to have a look at KB 2121689 and KB 2121701, we are looking at re-writing those though.

    ReplyDelete
    Replies
    1. Valentin,

      Looks like some of those articles have been edited within the last two days. I'm continuing to hope for a solution!

      Delete
  4. Glad I'm not the only one...I've been fighting with this all day

    ReplyDelete
  5. Nope, and unfortunately I still don't have a solution. My VP at this point is beyond livid. It has been almost 3 months now that we have been trying to get an answer to this problem from VMware. I you find a way around it let me know please!

    ReplyDelete
  6. Any updates on this issue. I am having the same problem now.

    ReplyDelete
    Replies
    1. I WISH!! I am getting weekly updates from SRM Tech Support saying that nothing further yet. The last one that I got they said "Still no fix from Engineering but there's been a lot of progress.

      Thanks was this last Thursday....

      Sorry to hear you are having the issue. As promised I will update with a fix as soon as I have one!

      Martin

      Delete
    2. Martin...I was able to fix this during this week. I followed the kb2121701. I am at 6.0 U1. This first time I tried it did not work. I rolled back to the snaps I took ahead of time. This time I did a more through investigation of the all the PSC entries in the MOB. I found more than one certificate on both sites that needed to be replaced. I when through both site PCSs and created the command to run for each certificate (with the fingerprint for each) I needed to replace ahead of time. The run those in each site without rebooting until I run them all on both sites. Then reboot the vcenter, srm, and psc VMs on both sites all at once. When everything came back up my SRM was back up and working in vCenter. I could see all the sites and protection groups, etc. It was a pain, but it fixes it.

      Delete
    3. Hi CSCMan. Could you be more specific as to what exactly you have done? You say "each certificate", but which certificates are we talking about? If I follow the KB, they mention only one certificate, not multiple.

      Any help would be much appreciated! I'm planning on going through the same KB one more time (already opened a SR with VMware).

      Delete
    4. Martin,

      Try using a note++ and make sure after every 64th character of the certificate hit enter to start from a new line.

      Delete
  7. Martin, we've resolved this issue following the steps in VMware KB2121689. The only difference with last time was the administrator@vsphere.local password. It contained a special character, which the Python script didn't like. We temporarily changed the password and all was well. We did this for both sites and now SRM is working like a charm.

    Please let me know if this helps. And if not, let me know where you are stuck, as I'm sure we can resolve your little issue!

    ReplyDelete
  8. Sweet, there might be hope then. I figured it would come from the community before it did anywhere else. Any clue as to what the special character was that was killing it? Was it an @ or a $ by chance?

    ReplyDelete
  9. Well, it was a question mark. Go figure. It simply made it impossible to even run the script. So the first time, we created a tempadmin@vsphere.local account, with a simpler password (only letters and numbers), but the script failed. I don't remember the exact error message we got, but Googling didn't give us any new insights.

    Does your password contain any character that isn't a simple letter or number? Then try changing it, it might help. If not, let me know. If we can tackle this issue with this KB article, then so can you :)

    ReplyDelete
  10. Same issue with SRM and certificate...

    Cause:
    Failed to connect to vCenter Server at https://fqdn:443/sdk. Reason: com.vmware.vim.vmomi.core.exception.CertificateValidationException: Server certificate chain not verified

    Currently working with support, but no solutions for now.

    ReplyDelete
  11. I did not have password issues with mine.....for me I found that I had more than one URL in the MOB that had an incorrect ssltrust on it...ie incorrect cert. So I ran the script on both sites for the different fingerprints to the correct certificate. Then rebooted the servers for both sides and the same time. I also needed to make sure I followed the step 9 and added the carriage return after the 64th character or it did not work.

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete
  13. Ah, so it was the carriage return! We simply imported the certificate in Windows and then exported it again. Probably does the same, because that's what helped us out with that step :)

    ReplyDelete
  14. Martin, did you ever come to resolution on this? Google brought me here. My certificates are proper, even an embedded PSC to keep things simple. Yet, I too get the dreaded "Failed to connect to lookup service" in SRM due to Server certificate chain not verified.

    Getting tired of trial and error trying to fix what appears to be a bug, even in 6.0u1b :(

    ReplyDelete
    Replies
    1. YES!!!

      But Ugg! What a pain in the rear it was!

      The article that is listed above KB2121701 is what you have to follow. You have to follow it step by step to the letter too! Don't assume you already have anything already. The long and the short of it is when you install your PSC and vCenter, when you change your certificates from what they originally were, you have to go into the MOB and you have to modify the certificates that are being seen in there. Again, KB2121701 is going to be your best friend on this but OMG is it confusing! I have done this finally about 3 weeks ago and got it all working ....If you want to talk further about it let me know.

      you can email me at mcwells1974 at h o t m a i l . c o m
      It is a junk mail email account but I will make sure I check it.

      Delete
    2. Good to hear! Both KB2121701 and KB2121689 (embedded psc) indicate that this is resolved with 6.0u1b but not until you place the certificates again.

      I'll give it another go & be sure to report back here.

      Thanks!

      Delete
  15. Yeah....they lie.

    You can verify in the MOB what certificates are being used. If you save your certificate for the vCenter and open it in notepad, you can see what hash it is using and compare it to the MOB.

    ReplyDelete
  16. Well sir, I owe you a beer! What a rather irritating path to implementation, I'd much prefer the 5.5 way of doing certs as it was a pain in the ass, but at least you knew when it was going to work.

    So glad I ran across your blog, and I'm serious about the beer if I come across you at a future vmworld!

    ReplyDelete
  17. All I did was replace my machine cert at the HQ and DR site to get rid of the nasty certificate errors when using vSphere web client. My SRM seems paired still, but vSphere replication is no longer working. When I go to manage each vR server at https://ipaddress:5490 and go to the configuration page, enter the SSO credentials , save and restart service - I do get the pop up about trusting the new certificate and it does show the thumbprint. I obviously hit accept, but then after a few moments it says Bad exit code: 1 at the top in red. Also VUM (vSphere Update Manager) is broken as well!

    All I did was replace the machine (reverse http proxy) certificate with one generated by our MS CA as described in this excellent blog here:
    http://www.virtually-limitless.com/certificates/replacing-or-implementing-ssl-certificates-in-vsphere-6/

    We are running 6.0 update 2, and noticed all the KB's I've found that remotely resemble this issue with VR or other services say that it was an issue that was fixed with 6.0 1b, which we jumped right past (upgraded from 5.0).

    ReplyDelete
  18. Keith, Did you check the MOB website? Did you check which certificate it is trying to use for those functions? I used an in-house cert to make my PSCs into Subordinate CAs then from there I did the rest of the work on my SRM stuff.

    ReplyDelete
  19. Ok found a "Known issue" for vSphere replication 6.0 that also seems to apply to vSphere replication 6.1.1 (that we are using). You apparently have to power off the vSphere replication appliance from vmware web client, and then power it back on. It does something with the registration of the ovf doing it that way, rather than restarting it through the web interface port 5490.

    Once this was done I was able to go back in the configuration page in the web interface and accept the certificate.


    ReplyDelete
  20. Hi Martin, I found this page looking for information on integrating my VMCA (default mode) with external PSC's to SRM 6. Can you provide any information on how to do this, or is it done automatically when installing SRM?

    ReplyDelete
  21. The VMCA is completely a separate thing. As it has been a while since I have installed it I went looking again and this is what I found for the install
    https://www.derekseaman.com/2015/04/vsphere-6-0-install-pt-11-vmca-as-subordinate.html

    I installed the VMCA as a subordinate to my MS AD integrated certificate server. I had to publish subordinate root certificates from the root CA. Is there something in particular you are needing help with?

    ReplyDelete