Next step - Monitoring

SamuelMoraesF · June 17, 2015, 11:59pm

I tested all of the following services:

IMHO, Sysdig Cloud doesn’t work for we(we can’t pay for an service that doesn’t attend the basic requirements).

I liked the support team of Server Density and Sysdig Cloud(they immediately sent an mail offering support for the setup, and doesn’t seems to be an automatic message).

New Relic and Uptime Robot are free services, we can use these as secondary monitoring.

By now, the biggest problem is the high cost of the “complete” services, but Server Density seems to be the best solution/price.

I don’t know what are the next steep to do(@majken, can you help me on this?).

We’ll open a budget request? If yes, I think that I know how to make the request document.

Pad: https://communityit.etherpad.mozilla.org/monitoring

majken · June 18, 2015, 1:23am

@tanner can you set up a time to chat with Sam on his findings?

@SamuelMoraesF Thanks for following through on this!

majken · July 1, 2015, 4:13pm

Hey Sam, this is on our radar still. Hopefully now that people are back from Whistler someone will have better info to get back to you with.

tanner · August 12, 2015, 9:29pm

Looking at other services, I think that moving forward, our best option is going to be Nagios.

My reasoning behind this:

Nagios

Free other than server cost (using t2.micro @ $9.50/mo)
Basically everything (alerts, checks, etc.) is “If you can code it, Nagios can do it”
Able to tie into CloudWatch with plugins
Rate of change monitoring
Checks run at any intervals
Checks run at any intervals
Nagios is more or less the industry standard. Great documentation and plugins.
Self-hosted/maintained
Pain to configure checks

Why not DataDog/ServerDensity?

I know that people keep saying that cost shouldn’t be a big factor, but I can’t get past it. Keeping in mind that we’re currently spending about $400 on EC2 servers per month, I cannot justify spending $150/mo (SD) or $350/mo (DD) on monitoring. None of the people who work on this are employed, so counting human capital in the price of Nagios doesn’t make sense to me. I’d consider the experience that we get from maintaining Nagios to be worth something, too.

If anybody else can think of a justification to pay this much, I’m open for it, but I can’t.

mrz · August 12, 2015, 9:35pm

I’d caution that Nagios is free other than compute costs. There’s a people cost that we shouldn’t ignore. And we should include what it takes to build a fault-tolerant Nagios.

Nagios/Sensu/Zabbix/Zenoss would also all check off those same bullet points.

But which match our requirements?

majken · August 12, 2015, 9:45pm

Right, there is a people cost, but at the same time our team is supposed to be providing learning opportunities and things for people to do. We do need to make sure that we will be able to recruit enough contributors to maintain this, but if we choose something that doesn’t require the right amount of human cost, then it means we’re not providing contribution opportunities. There’s a sweet spot in there.

mrz · August 12, 2015, 9:45pm

I think we’re saying the same thing.

We need to have sufficient volunteers/contributors to run and maintain whatever solution we pick.

majken · August 12, 2015, 9:56pm

Right, one question we were discussing on IRC is how much Mozilla still uses Nagios. Choosing Nagios might make it easier for us to find mentors within the org. Certainly it would make it easier than using a tool that Mozilla hasn’t ever used. I suppose we could try to test the waters on whether we could get mentorship before totally committing.

majken · August 19, 2015, 7:47pm

So, if people really want a paid solution, or if a paid solution would be best, even if we can’t afford it, we should still go through the exercise of understanding how much the solutions we’re looking at would cost, as well as figuring out how much budget we think we could justify if we could get it.

@tad - you were saying that quotes depend on the number of hosts we use, and that we don’t need to worry about monitoring for the hosted community sites, as that is on different infrastructure. How many hosts are you thinking we’re looking at?

tad · August 19, 2015, 7:52pm

Probably 20 hosts, because that’s how many we have on AWS.

majken · August 19, 2015, 7:57pm

It was also mentioned in today’s meeting that if we take a similar approach as we did to Discourse, it might be easier to find contributors - which is pick a newer up and coming tool that people might be excited about trying out.

I’m not necessarily siding with this argument, but it’s definitely worth thinking about. Could we get Ops mentors if we offered them the chance to play with and mentor people through using a new tool, not an old one? Could we get more contributors interested if we’re offering them a more rare skill, or getting expertise ahead of everyone else?

majken · August 19, 2015, 8:34pm

We should also talk about timelines. Do we have the time to wait for budget approval? How long do we have, could we not try to get budget approval and fall back on Nagios if it just isn’t working? Or could we get something basic set up with Nagios and either replace it or follow through with it depending on the outcome?