Problems with Digital Ocean

Hi everyone,

I have been in the last three weeks, having troubles with digital ocean. I would say for now that every four to five days, I get a call telling me that server is not responding. Then, another server and another call, I could say simultaneously.

One server is at SFO2 and the other one is at NYC3 and they have been there for more than one year for now. Latest update was on Dec 18 with version 4.2

Today it happened the same. I get a call, then the other, almost at the same time. I try to login but application simply does not responses. I cannot ssh the server, so I have to go to droplets, select it, force turn off and back on.

Do any of you is going thru the same?

Thanks in advance.

Anton Tananaev6 years ago

We have over 30 servers on DigitalOcean and only once in over two years we had an issue that required server reboot. Are you sure your droplets have enough resources?

Ernesto Vallejo6 years ago

Thanks Anton,
I would say I do. Server have around 300 devices.

Check this screenshot
https://drive.google.com/file/d/1BLmG82NFsRbsknFtiDThkcc08DTtvj1W/view?usp=sharing (Droplet Graphs)

Anton Tananaev6 years ago

What about disk space?

Ernesto Vallejo6 years ago

42GB out of 80GB

Anton Tananaev6 years ago

Not sure what the issue then. I guess it's possible that there is some problem with DO servers, but how can it happen at both locations is a mystery. I can tell you that we have servers at both SFO2 and NYC3 and we never had any issues.

Ernesto Vallejo6 years ago

I have had troubles with this one as well.
https://drive.google.com/file/d/1DQwqFM8qyH5XixSohvtE1hYfMG3XPw__/view?usp=sharing

Anton Tananaev6 years ago

It's the same screenshot.

Ernesto Vallejo6 years ago

I never had them before. It happened about three weeks ago and I let it go, but then again, and again.
I raised a ticket but at the time they replied I had already reboot servers, so they reply with a "let us know if it happens again..."
I'm pretty sure its not an issue with the application, because I simply cannot ssh the server when it happens.

Ernesto Vallejo6 years ago

Sorry, I have edited post with correct link.

Pedro6 years ago

Install netdata on a new droplet and configure it as "master"
Install netdata on the affected servers and configure them as slaves.
when the problem shows up again comb trough the data collected and see is something appears out of normal.

Pedro6 years ago

also, look for cpu steal time value on the data. this metric shows how much a virtual cpu is waiting for the real cpu to process data. if it is high, your vm is on a host that has other(s) vm's eating all the cpu and leaving none for you.
but I suspect it's not the case because it only slows down the vm, it does not kill it.

Ernesto Vallejo6 years ago

Thanks Pedro for your info, I can tell about a month after, no changes were needed and server are working normal. Have no additional info to share with you, but will go with your recommendation of netdata and master/slaves scheme.

Thanks for your help.

Pedro6 years ago

you are welcome. Netdata is easy to install and run, very very low on resources and will give you a realtime overview of your servers and services. you can configure netdata master to send you notifications via telegram or similar supported instant messaging apps and know on the spot when something wrong is happening. You can even write simple plugins to collect whatever metric you want ( and alert you ) like number of logged in users, or amount of messages arriving at your service.. whatever.
The current master slave implementation of netdata is actually a hack, but they are working on a proper way of doing it in the near ( I think ) future.