Skip to content
This repository has been archived by the owner on Sep 6, 2024. It is now read-only.

Installation issue w/ release 2022-11-09133750 #26

Open
hugovalente-pm opened this issue Nov 9, 2022 · 31 comments
Open

Installation issue w/ release 2022-11-09133750 #26

hugovalente-pm opened this issue Nov 9, 2022 · 31 comments

Comments

@hugovalente-pm
Copy link

Was trying to install netdata and I guess this is the release used https://github.com/netdata/msi-installer/releases/tag/2022-11-09133750 (based on date/time)

Steps:

  1. I had removed my previous netdata.msi installation but I had the wmi_exporter already installed
  2. Downloaded the netdata.msi mentioned above
  3. Ran it in and Admin PowerShell with msiexec.exe /i C:\Users\hugoj\netdata\netdata.msi TOKEN=<space-token> URL=https://app.netdata.cloud
  4. Saw the PC reboot
  5. Installation tried to resume and was stuck at REGISTERING NETDATA DISTRO WITH WSL2
@hugovalente-pm
Copy link
Author

On the logfile netdata.log I see
image

@dfpr
Copy link
Contributor

dfpr commented Nov 9, 2022

Possibly microsoft/WSL#8714
Please attach WSL logs using these instructions https://github.com/Microsoft/WSL/blob/master/CONTRIBUTING.md#8-detailed-logs

@hugovalente-pm
Copy link
Author

could you share with me your e-mail? not sure these logs are safe to be shared public
my email: [email protected]

@dfpr
Copy link
Contributor

dfpr commented Nov 9, 2022

could you share with me your e-mail? not sure these logs are safe to be shared public my email: [email protected]

Sent you an email.

@hugovalente-pm
Copy link
Author

some updates here, the main issue identified seemed to be caused by having another image using the port 19999. not sure this is an issue that can be surfaced to the user

after stopping that other image this installation went ahead but node wasn't successfully claimed to Netdata Cloud due to not being able to reach api.netdata.cloud
the solution was to restart the PC and entering the Netdata image running the claiming script netdata-claim.sh -token=<space-token>

nodes was claimed, as it can be seen on the image below, but I'm not being able to get the node connected to Cloud get errors on ACLK
image

image

@hugovalente-pm
Copy link
Author

this seems to be related with default DNS, checking the content of /etc/resolv.conf where nameserver is my IPv4 Address
image

looked to another image that has the following
image

this was gotten from https://askubuntu.com/questions/1403886/how-to-fix-wsl-domain-resolution
@dfpr is this something we need to consider while installing/setting up the image?

@dfpr
Copy link
Contributor

dfpr commented Nov 10, 2022

@hugovalente-pm can you confirm it is just the DNS by pinging an IP address? also, the wsl import now switchs to wsl1 if takes more than 2 minutes. And with the MSI argument WSL=1 that will be used. Both dns and import issues appear to be related directly to WSL and not the installer.

@hugovalente-pm
Copy link
Author

@dfpr it was the DNS I did a troubleshooting with some guys on slack to help identify this, I'll add a summary here

And with the MSI argument WSL=1 that will be used. Both dns and import issues appear to be related directly to WSL and not the installer.

Not sure if I follow here, if we install the Netdata image and there are some issues on the /etc/resolv.conf we can't solve them from this installation process. If that is the case can't we at least provide them with a tip to the article shared?

  • in the Netdata image in terms of name resolution everything seems ok but as soon as I try to curl an endpoint it gets stuck
    image

  • in another image Ubuntu I'm able to have a node connected to Cloud and the curls return a response
    image

  • from one of the guys it resolves to a different IP 44.207.131.212

  • execution of curl with -vvv

    ``` hugo-pc:/mnt/c/Users/hugoj# curl -vvv https://api.netdata.cloud/ * Trying 205.251.197.6:443... * Trying 2600:9000:5305:600::1:443... * Immediate connect fail for 2600:9000:5305:600::1: Network unreachable * Trying 2600:9000:5307:2200::1:443... * Immediate connect fail for 2600:9000:5307:2200::1: Network unreachable * Trying 2600:1f18:428d:5e02:7c85:d971:5a27:fe20:443... * Immediate connect fail for 2600:1f18:428d:5e02:7c85:d971:5a27:fe20: Network unreachable * Trying 2600:1f18:428d:5e01:f75a:d1e1:c99f:88ca:443... * Immediate connect fail for 2600:1f18:428d:5e01:f75a:d1e1:c99f:88ca: Network unreachable * Trying 2600:1f18:428d:5e00:f8d1:27b7:3fd1:39a0:443... * Immediate connect fail for 2600:1f18:428d:5e00:f8d1:27b7:3fd1:39a0: Network unreachable * Trying 2600:9000:5305:600::1:443... * Immediate connect fail for 2600:9000:5305:600::1: Network unreachable * Trying 2600:9000:5307:2200::1:443... * Immediate connect fail for 2600:9000:5307:2200::1: Network unreachable * connect to 205.251.197.6 port 443 failed: Operation timed out * Trying 205.251.199.34:443... * After 85265ms connect time, move on! * connect to 205.251.199.34 port 443 failed: Operation timed out * Trying 205.251.193.25:443... * After 42632ms connect time, move on! * connect to 205.251.193.25 port 443 failed: Operation timed out * Trying 205.251.195.76:443... * After 21316ms connect time, move on! * connect to 205.251.195.76 port 443 failed: Operation timed out * Trying 205.251.197.6:443... * After 10657ms connect time, move on! * connect to 205.251.197.6 port 443 failed: Operation timed out * Trying 205.251.199.34:443... * After 5328ms connect time, move on! * connect to 205.251.199.34 port 443 failed: Operation timed out * Trying 205.251.193.25:443... * After 2664ms connect time, move on! * connect to 205.251.193.25 port 443 failed: Operation timed out * Trying 205.251.195.76:443... * After 1331ms connect time, move on! * connect to 205.251.195.76 port 443 failed: Operation timed out * Trying 44.196.50.41:443... * Connected to api.netdata.cloud (44.196.50.41) port 443 (#0) * ALPN: offers h2 * ALPN: offers http/1.1 * CAfile: /etc/ssl/certs/ca-certificates.crt * CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN: server accepted h2 * Server certificate: * subject: CN=app.netdata.cloud * start date: Oct 19 06:50:32 2022 GMT * expire date: Jan 17 06:50:31 2023 GMT * subjectAltName: host "api.netdata.cloud" matched cert's "api.netdata.cloud" * issuer: C=US; O=Let's Encrypt; CN=R3 * SSL certificate verify ok. * Using HTTP2, server supports multiplexing * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * h2h3 [:method: GET] * h2h3 [:path: /] * h2h3 [:scheme: https] * h2h3 [:authority: api.netdata.cloud] * h2h3 [user-agent: curl/7.83.1] * h2h3 [accept: */*] * Using Stream ID: 1 (easy handle 0x7f6e110bf1c0) > GET / HTTP/2 > Host: api.netdata.cloud > user-agent: curl/7.83.1 > accept: */* > * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * old SSL session ID is stale, removing < HTTP/2 404 < date: Thu, 10 Nov 2022 17:01:25 GMT < content-length: 21 < content-type: text/plain; charset=utf-8 < * Connection #0 to host api.netdata.cloud left intact default backend - 404 ```
  • result of dig +trace api.netdata.cloud

    hugo-pc:/mnt/c/Users/hugoj# dig +trace api.netdata.cloud
    
    ; <<>> DiG 9.16.33 <<>> +trace api.netdata.cloud
    ;; global options: +cmd
    ;; connection timed out; no servers could be reached
    
  • result of dig +trace @8.8.8.8 api.netdata.cloud
    image

  • content of /etc/resolv.conf on Netdata image
    image

  • content of /etc/resolv.conf on Ubuntu image
    image

then I remembered I had seen and done this on the Ubuntu https://askubuntu.com/questions/1403886/how-to-fix-wsl-domain-resolution

@dfpr
Copy link
Contributor

dfpr commented Nov 11, 2022

I can't reproduce your issue, the askubuntu article mentions issues when connecting a vpn so I can't pinpoint the exact solution proposed there, I don't know why your Ubuntu points to Google DNS servers, if I put

[network]
generateResolvConf = false

in /etc/resolv.conf DNS queries fail, manually putting google dns server fixes it but at startup the resolv.conf file is deleted. WSL can be affected by a lot of issues and putting them in the readme seems impractical.

@Ferroin
Copy link
Member

Ferroin commented Nov 11, 2022

@dfpr That needs to go in /etc/wsl.conf, not /etc/resolv.conf.

@dfpr
Copy link
Contributor

dfpr commented Nov 11, 2022

Sorry, typo, I did put the lines in the right place but WSL deletes the file, putting an immutable flag created a lot of issues for the docker hostb when building, I'll try after importing. Again, this is a wsl issue not coming from the installer.

@dfpr
Copy link
Contributor

dfpr commented Nov 11, 2022

Latest commit should fix dns issue

@cakrit
Copy link
Contributor

cakrit commented Dec 1, 2022

Now we can no longer access $(hostname).local @hugovalente-pm / @Ferroin
The result is that the whole purpose of the installation is broken, as we get no wmi metrics (host unreachable).

Whatever the network issues were, resolv.conf was NOT the cause. I deleted /etc/wsl.conf and /etc/resolv.conf, restarted wsl and can happily access api.netdata.cloud, app.netdata.cloud AND $(hostname).local. @dfpr please revert this change.

@dfpr
Copy link
Contributor

dfpr commented Dec 1, 2022

Now we can no longer access $(hostname).local @hugovalente-pm / @Ferroin The result is that the whole purpose of the installation is broken, as we get no wmi metrics (host unreachable).

Whatever the network issues were, resolv.conf was NOT the cause. I deleted /etc/wsl.conf and /etc/resolv.conf, restarted wsl and can happily access api.netdata.cloud, app.netdata.cloud AND $(hostname).local. @dfpr please revert this change.

I have reverted the change.

@cakrit
Copy link
Contributor

cakrit commented Dec 5, 2022

@hugovalente-pm try the latest version and let's try and figure out why claiming doesn't work in your case, without changing DNS again. The installer should be left as is IMO.

@hugovalente-pm
Copy link
Author

sure, will try a fresh install tomorrow

@hugovalente-pm
Copy link
Author

@cakrit I was trying a fresh install and got this error which I thought it would mean the node wasn't claimed to Cloud (I tried it twice to make sure I hadn't miscopied the token), the command I ran

msiexec.exe /i netdata.msi TOKEN=<claim-token> URL=https://app.netdata.cloud

looking to the log file on c:\netdata.log I saw that it was claimed so restarted the agent and now see the node as Unseen

Connection attempt 1 successful
uv_pipe_connect(): no such file or directory
Make sure the netdata service is running.
The claim was successful but the agent could not be notified (0)- it requires a restart to connect to the cloud.
STARTING AGENT
ADDING NETDATA TO STARTUP

Looking to the error.log on the agent I get

image

pinging api.netdata.cloud from the linux image works ok, pinging app.netdata.cloud doesn't but from the logs ACLK is trying to connect to api.netdata.cloud

image

@dfpr
Copy link
Contributor

dfpr commented Dec 6, 2022

I can't ping either app.netdata.cloud or api.netdata.cloud from my host, not inside wsl. I also tried an online ping webpage and it couldn't ping them as well.

@thiagoftsm
Copy link

Hello @dfpr ,

Like you I cannot ping:

bash-5.2$ nslookup app.netdata.cloud
Server:         192.168.1.1
Address:        192.168.1.1#53

Non-authoritative answer:
Name:   app.netdata.cloud
Address: 54.198.178.11
Name:   app.netdata.cloud
Address: 44.196.50.41
Name:   app.netdata.cloud
Address: 44.207.131.212
app.netdata.cloud       canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.

bash-5.2$ ping -c 1 app.netdata.cloud
PING app.netdata.cloud (44.207.131.212) 56(84) bytes of data.

--- app.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

bash-5.2$ nslookup api.netdata.cloud
Server:         192.168.1.1
Address:        192.168.1.1#53

Non-authoritative answer:
Name:   api.netdata.cloud
Address: 54.198.178.11
Name:   api.netdata.cloud
Address: 44.196.50.41
Name:   api.netdata.cloud
Address: 44.207.131.212
api.netdata.cloud       canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.

bash-5.2$ ping -c 1 api.netdata.cloud
PING api.netdata.cloud (44.207.131.212) 56(84) bytes of data.

--- api.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

, but I can access host https://app.netdata.cloud. Can you at least access it?

@cakrit
Copy link
Contributor

cakrit commented Dec 6, 2022

Don't try ping, try wget or curl. ICMP isn't necessarily enabled for servers nowadays.
@hugovalente-pm get Timo to help with the debugging. Get on a call with him and he'll figure out what's happening for sure.

@dfpr
Copy link
Contributor

dfpr commented Dec 7, 2022

Hello @dfpr ,

Like you I cannot ping:

bash-5.2$ nslookup app.netdata.cloud
Server:         192.168.1.1
Address:        192.168.1.1#53

Non-authoritative answer:
Name:   app.netdata.cloud
Address: 54.198.178.11
Name:   app.netdata.cloud
Address: 44.196.50.41
Name:   app.netdata.cloud
Address: 44.207.131.212
app.netdata.cloud       canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.

bash-5.2$ ping -c 1 app.netdata.cloud
PING app.netdata.cloud (44.207.131.212) 56(84) bytes of data.

--- app.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

bash-5.2$ nslookup api.netdata.cloud
Server:         192.168.1.1
Address:        192.168.1.1#53

Non-authoritative answer:
Name:   api.netdata.cloud
Address: 54.198.178.11
Name:   api.netdata.cloud
Address: 44.196.50.41
Name:   api.netdata.cloud
Address: 44.207.131.212
api.netdata.cloud       canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.

bash-5.2$ ping -c 1 api.netdata.cloud
PING api.netdata.cloud (44.207.131.212) 56(84) bytes of data.

--- api.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

, but I can access host https://app.netdata.cloud. Can you at least access it?

HTTPS works for me.

@underhood
Copy link

Don't try ping, try wget or curl. ICMP isn't necessarily enabled for servers nowadays. @hugovalente-pm get Timo to help with the debugging. Get on a call with him and he'll figure out what's happening for sure.

where it seems to fail is at netdata connect_to_this_ip46 question of course is why

@hugovalente-pm
Copy link
Author

with @underhood we were able to rule out the issue with DNS, since we got an IP resolution (if right or now we aren't sure), but the issue was reproduced with an emulation to an HTTP call like this
https://api.netdata.cloud/api/v1/env?v=[NETDATAVERSIONWITHOUTvINBEGGINING]&cap=proto,ctx&claim_id=[CLAIMIDHERE] and we weren't able to get a response from inside WSL but we got from my local computer.

@underhood will try to get is setup installed with WSL 2 (for some reason it got WSL 1) to further investigate this network issue that could be a config or a bug on WLS 2

@underhood
Copy link

underhood commented Dec 7, 2022

basically to summarize agent fails to connect_to_this_ip46 in attempt to do GET HTTPS call as follows (replace things in [] with your data):
https://api.netdata.cloud/api/v1/env?v=[NETDATAVERSIONWITHOUTvINBEGGINING]&cap=proto,ctx&claim_id=[CLAIMIDHERE]

we tried to do wget https://api.netdata.cloud/api/v1/env?v=[NETDATAVERSIONWITHOUTvINBEGGINING]&cap=proto,ctx&claim_id=[CLAIMIDHERE] on the affected machine (as this is exact thing agent tries to do when that error apears) and it could not connect too, same command works on other machines (gets response from cloud)

Therefore as wget seems to have same issue I consider this to be some network configuration issue or a bug in WSL2 and is not specific to netdata.

I also tried the msi installer in Win 11 in VirtualBox with WSL1 and cloud connection was working OK. I will try to figure out why WSL2 version was not used and try to see if WSL2 version will have the aforementioned issue.

@cakrit
Copy link
Contributor

cakrit commented Dec 12, 2022

This is all very strange. Were you able to duplicate elsewhere @underhood? I have a different windows machine (laptop) I can try too

@cakrit
Copy link
Contributor

cakrit commented Dec 12, 2022

Never mind, on Windows 10 it couldn't install with WSL 2 it says and it's reverting to WSL 1. So I can't do the test.

@underhood
Copy link

For some reason I cant make WSL2 to work in VM despite trying 100 things :/

@cakrit
Copy link
Contributor

cakrit commented Dec 13, 2022

WSL2 in general doesn't work in VMs, the installer should default to WSL1
We have a closed issue on this.

I may have replicated the network issue on my WSL2 on a laptop I have with me. The PC at home had Win 11 and worked great, this one just doesn't want to work with app.netdata.cloud for some reason. I'll see if anything from above will help.

@hugovalente-pm
Copy link
Author

@cakrit mine is WSL 2 and I bumped into this domain resolution fix ttps://askubuntu.com/questions/1403886/how-to-fix-wsl-domain-resolution

@cakrit
Copy link
Contributor

cakrit commented Dec 13, 2022

Yes, that works, but there should be a way to properly resolve app.netdata.cloud without losing the capability to reach the windows host via $(hostname).local

From https://superuser.com/questions/1714002/wsl2-connect-to-host-without-disabling-the-windows-firewall I got the idea to exclude the interface from the windows firewall (see screenshot below) and that at least got rid of the message
** server can't find app.netdata.cloud: REFUSED

I now get the following:

DESKTOP-KQ81AL4:/mnt/c/Windows/system32# nslookup app.netdata.cloud
;; connection timed out; no servers could be reached

image

@cakrit
Copy link
Contributor

cakrit commented Dec 13, 2022

Never mind, I tried to get the rest of it working and followed some instructions in https://gist.github.com/sivinnguyen/8bc0125b274250683a97e149cf270040
to do run some powershell commands in admin mode and reboot. After the reboot I saw a new IP in resolv.conf, but the firewall again blocking the connection and doing the same thing (unchecking WSL from the protected network connections) makes no difference.

I give up. This is clearly a shoddy implementation that only works occasionally. I have no idea how to get both the name resolution to work AND to get a URL that will let us access the windows_exporter metrics from inside WSL. The moment we change resolv.conf, we lose access to the /metrics endpoint and I have no idea how we can get to it. I found somewhere that if you type ip route, then the via that appears is the IP you can use to reach the windows host, but it didn't work.

If you can find a solution gents, let me know, but it needs to both allow claiming and show the metrics, not just one or the other. At this point, I'm even considering hard-coding an IP in /etc/hosts as a workaround, which is basically the same as accepting defeat.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants