Skip to content
This repository has been archived by the owner on Oct 7, 2021. It is now read-only.

State of ToDD Going in to 2018 #161

Open
Mierdin opened this issue Nov 24, 2017 · 1 comment
Open

State of ToDD Going in to 2018 #161

Mierdin opened this issue Nov 24, 2017 · 1 comment

Comments

@Mierdin
Copy link
Member

Mierdin commented Nov 24, 2017

Since joining StackStorm last year, my open source energies have been fully directed into that project, so ToDD has suffered from a bit of neglect. I still like tinkering around with various things and have done a few demos with ToDD since then, but the reality is that things are just collecting dust here.

I've learned to not make promises about when I'll get around to doing something, so the below won't contain ETAs, but I figured it was high time I document all the things in my head that I feel could use some love in ToDD, in order to push it to the next stage. Many of these things come from lessons I've learned recently, others are things I've always known needed addressed.

Anyways, for those still watching this project, I'd like to express my remorse for letting things stagnate for so long. I do still care about it, and I'm hoping to get enough momentum to start tackling the mountain of work that I feel ToDD deserves. I still think it's a neat project idea, and with some of these changes, could be useful to a lot more people. So, here are a few shorter-term changes that I've been thinking of making lately, so we can get to the cooler stuff long-term.

-- Matt

Changes to Project Messaging and Description

The messaging around ToDD is ambiguous. Those that approach from the neteng side of things tend to "get it" pretty quickly, but those with more of a SW background assume I'm talking CI/CD style of testing. So we need to get clear - ToDD is a network testing tool.

I also want to put some thought into the name of the components. "Agent" is a pretty generic term and for some, has some negative baggage. Might consider renaming to "sensor" to more accurately describe its role. I'm on the fence here, since "sensor" is more of a passive term and ToDD agents actively run tests, etc.

This will involve amending documentation to clarify use cases, and probably the creation of some additional materials for quickly getting a ToDD PoC spun up.

Proper API

The current API implementation is a joke. It's done with a bunch of http.HandleFunc statements, no versioning, no standardized spec, no framework. I started tinkering with Goa after hearing about it on Go Time FM, so it's high time I finish that specification. I have a few other pet projects that could use some ToDD integration, but I don't feel comfortable doing that until the API is done like an adult.

Another part of re-thinking the API is re-thinking the user-facing object model, like groups and testruns. The current UX around this is icky, and needs to be much more consistent.

Re-Evaluation of External Dependencies

Those that have installed and run ToDD know that there are a number of external dependencies required for ToDD to run.

In each case, existing software was selected for many reasons, not the least of which was that I didn't wish to re-invent the wheel. I wanted to get to a quick prototype without having to write my own database or messaging system. However, now that the idea of ToDD has an actual implementation (fraught with problems though it may be), it's useful to return to these subjects individually. So, I'd like to provide some thoughts for each component. In some places, I'm already convinced that a certain component can be removed; in others, more thought is needed. I'm just going to document where my head is at today.

State Database

Currently, todd-server stores information about ongoing testruns in a database - the current implementation supports etcd. This is done so that ToDD doesn't have to maintain this information in memory. In essence, incoming agent data is simply translated and sent to etcd where it resides until the entire testrun is finished. In fact, in order to know that all the agents have reported in, the ToDD server must periodically poll etcd to ensure the agents have provided test data.

There's no need for testruns to be stored persistently outside the context of a proper TSDB (which we'll get to in a second). So long-term etcd doesn't really serve a purpose, and the short term benefits it provides can and should be replaced with a goroutine that stores this information temporarily until TSDB offload occurs. So, low-hanging fruit will be to rethink the way testruns are tracked and managed so they can all be done within todd-server. This will result in ToDD no longer needing etcd or any other external database for testrun tracking.

One thing that's currently stored in etcd is the testrun and group definitions. Again, this was just a useful place to stash these - I think I can come up with a solution that doesn't require a separate database for these. Even if they're just cached on the server filesystem.

Another thing that will have to be considered, is the UX around manual testruns. It's quite useful to be able to execute a testrun and get instant feedback when it finishes. So whatever we do in-memory should be able to support the retrieval of testrun data after it finishes. This should obviously get limited somehow to the last N testruns to keep memory footprint in check, and of course, when the server shuts down, the expectation is that this will be lost.

Helpful resources:

Agent Communications

Similar to the previous point about etcd, rabbitmq was selected as a message broker so that I didn't have to write my own messaging logic into ToDD for server to agent communications. And just like etcd, there's a lot about the way I'm using RMQ that doesn't really require it.

Unlike the previous point, I'm not wholly convinced that removing RMQ is a priority, or even totally needed. I'd like to do some of the lower-hanging fruit first, like removing the need for etcd, adding a proper API, and adding release mechanisms. However, once those are done - especially a brand new API - this could get re-evaluated. Maybe the API can evolve so that agents simply use the server REST API like the todd client or other 3rd party software does.

I have looked at a few alternatives, but it's important that whatever I go with (if anything) as an alternative to the current approach needs to be native-Go, and provide a high enough abstraction that I'm not rewriting a whole layer of the stack myself. The goal wouldn't be to write my own message queue - but rather to simply help reduce the operational burden for users to maintain a separate message broker like RMQ, in addition to ToDD.

This will require some more research and experimentation before doing anything with this.

Some potentially helpful resources:

Time-Series Database

Right now, ToDD offloads test metrics to a TSDB when testruns finish. Note that this doesn't have anything to do with testrun coordination, which as mentioned previously is provided by etcd.

The value of ToDD is not to store and allow retrieval of time-series data, but rather to generate it. So it makes a lot of sense for ToDD to send data to a TSDB and forget about it. So I have no plans to remove this requirement as of now.

That's not to say improvements can't be made. If we do end up replacing etcd with an in-memory system, the TSDB component should be intelligent enough to know which datasets stored in memory have been uploaded to TSDB, and make sure the two are in sync when possible.

Establishing Release Cadence and Packaging

Most of the other changes in this issue are things that have been on my mind for a while. I also felt like I had to fix all these problems before ToDD was "worthy" of an official release, with proper packaging, etc. That's pretty much the only reason I haven't actually "released" ToDD. And obviously that "boil the ocean" approach just doesn't scale. I shouldn't be waiting to create releases until ToDD has solved world hunger.

As a result, the only way to install ToDD was to clone the repo and install using Go toolchains (i.e. compile from source). Again, fine if you're familiar with Go, but my target audience is neteng/sysadmin types, and it's not fair for that to be the only option.

So, I'm planning on - in parallel to everything else - building packaging and release artifacts to fully automate the process of creating an official ToDD release. The goal of this is to make it so easy to release a version of ToDD that I don't have any excuses for not doing it. The result of this will be that it will be way easier to get started with ToDD.

That said, ToDD will remain in "alpha" quality status until at least the vast majority of these issues are solved. In the meantime, though, there's no reason to not start providing easier installation options like DEB/RPM. So that's the plan.

Managing Goroutines Better

A lot of these other issues are going to place even more of a need for ToDD to manage synchronous tasks in parallel better. And I'm sure there's a better way beyond a bunch of go function() statements in main.go (which is more or less what I'm doing now).

I've seen whispers of better ways of doing things, but nothing concrete yet. I just instinctively know that I'm not managing my goroutines well - certainly not the way that channels are created and consumed. So I need to look into best practices and maybe frameworks to help manage this parallel work in a way that scales.

Longer-Term Goals

I figure it would be worth putting in one last section to quickly touch on some of the very long-term goals I have in mind. In many ways, the aforementioned tasks are there to provide a better foundation for some of the really cool things I've always wanted to build in ToDD.

Built-in, Basic Test Assertion and Event System

While it's possible to get all testrun data into a TSDB and write some software to do some kind of statistical analysis on this, notifying when certain deviations occur, it would also be great to have something built-in to ToDD that's simple and fast, which would cut down on the need for a full external system to do basic notifications.

So we should consider building a basic system for describing certain expectations for testrun data, and allow clients to subscribe to a notification stream when expectations aren't met. Should also be able to send webhooks or something.

Test Scheduling

I have always wanted ToDD to be way more autonomous than it is. The current "manual" approach to executing tests is useful enough, and I have no intention of removing this, but I have always wanted to create a second layer on top of this that schedules testruns using some kind of declarative instruction like "I care about HTTP traffic between these two sites - schedule tests accordingly so I get good data about this".

Hand-in-hand with this would be a framework for either making manual assertions about testrun data (i.e. low-water and high-water marks), or (much more advanced) an automated baseline deviation mechanism. In either case, deviations from expected norms would trigger notifications to some external system.

Representing Testrun Data as Graph

This sort of goes hand-in-hand with the previous long-term goal. Basically I'd like to be able to represent the testrun data as a combination of time-series and graph. Each node in the graph would represent a particular ToDD group, and each edge would represent metrics between those nodes generated by testruns between them. This would be useful for quickly answering questions like "I wonder how HTTP traffic between my two sites was performing yesterday?"

GUI

Ah yes, the GUI. Not much to say here, just a nice-to-have for those that don't want to deal with the ToDD CLI or the API. No idea on the scope of this, though it should be noted that influx/grafana is able to do a lot in terms of representing testrun data today, so duplicating this should be avoided.

This may also be useful for showing the aforementioned graph data. While the data itself should still come from a TSDB, the GUI could provide an easy way of showing the graph of todd groups

@Mierdin
Copy link
Member Author

Mierdin commented Oct 13, 2018

I have created https://github.com/orgs/toddproject/projects/1 to track these and other issues that I would like to address in order to move ToDD from "fun science project" to something worthy of being deployed in production.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant