Skip to content
This repository has been archived by the owner on Sep 2, 2020. It is now read-only.

segment search regex leads to timeouts due to large return #246

Open
tehlers320 opened this issue Sep 23, 2016 · 16 comments
Open

segment search regex leads to timeouts due to large return #246

tehlers320 opened this issue Sep 23, 2016 · 16 comments

Comments

@tehlers320
Copy link
Contributor

tehlers320 commented Sep 23, 2016

To reproduce
Grafana or graphite-api search:
prod.us-west-2.collectd_metrics.*.*.disk-*.disk_ops.write

Cyanite will query this:
SELECT * from segment WHERE pos = 8 AND segment LIKE 'prod.us-west-2.collectd_metrics.%' ALLOW FILTERING;

This data contains many metrics that we do not need for this query for example:

prod.us-west-2.collectd_metrics.caps-competition-api-web.9b95f0-ip-10-1-1-247.interface-eth0.if_octets

cqlsh -k metric -e "SELECT * from segment WHERE pos = 8 AND segment LIKE 'prod.us-west-2.collectd_metrics.%'  ALLOW FILTERING;"

<stdin>:1:errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1

I tried to query for this but it doesn't seem supported in cassandra.
SELECT * from segment WHERE pos = 8 AND segment LIKE 'prod.us-west-2.collectd_metrics.%.disk_ops.write' ALLOW FILTERING;

Should cyanite consider using tags and i wonder how much this would grow the segment table size by ?

tags { "prod", "us-west-2", "collectd_metrics", "cyanite-cassandra", "ip-10-1-1-23", "disk-xvdf", "disk_ops", "write" }

Then the query could do this based on any '.' entry that is not a regex

i did not test this...
SELECT * FROM segment WHERE tags CONTAINS 'prod' AND tags CONTAINS 'us-west-2' AND tags 'collectd_metrics' AND tags CONTAINS 'disk_ops' AND tags CONTAINS 'write';

@ifesdjeen
Copy link
Collaborator

I'm afraid the query you mention will work even worse than the current implementation. What Cassandra does is it picks the most selective index and queries it, filtering out the rest of results, even if there are more indexes available.

Although after checking your query I've noticed two things:

  1. We can use a different tokenizer (currently we're using the one that would just split results letter by letter. Since we do not require such level of detail, we can use a tokenizer that would just split words, which may result into smaller trees and better traversal.
  2. We can do contains query and do less post-filtering in the case you indicated.

I'm going to start with (2) right after I'm done with #244

@ifesdjeen
Copy link
Collaborator

I've tested an alternative tokeniser and I have good news: we most likely will be able to (yet again) significantly improve the performance.

I'll still have to modify it to support disk-* form of queries, right now it only supports full segment skip (like *), but in general it turns out we still can improve.

@ifesdjeen
Copy link
Collaborator

Tokeniser impl (prototype) can be found in #249.

@tehlers320
Copy link
Contributor Author

I pulled in 248 and 249 to our test environment. Truncated my segment and metric table. Searching the tree seems snappy (but always does when i truncate). Will let it re-populate over a day or so. I am not able to retrieve metrics though.

our "query" host is spamming this:

ERROR [2016-10-03 21:26:35,953] epollEventLoopGroup-3-1 - io.cyanite.api could not process request
clojure.lang.ArityException: Wrong number of args (2) passed to: index/fn--6086/G--6068--6095

@ifesdjeen
Copy link
Collaborator

@tehlers320 do you have a full stack trace?..

@ifesdjeen
Copy link
Collaborator

@tehlers320 I've found the reason and pushed the fix to #248. Was incorrect arity usage from my side...

@jacobrichard
Copy link

jacobrichard commented Oct 7, 2016

I pulled in the #248 fix and pushed it into our environment. As @tehlers320 reported, the tree is snappy but now I'm seeing a different error when retrieving metrics.

ERROR [2016-10-07 18:12:30,789] epollEventLoopGroup-3-1 - io.cyanite.api could not process request
java.lang.IllegalArgumentException: Don't know how to create ISeq from: clojure.core$partial$fn__4759
    at clojure.lang.RT.seqFrom(RT.java:542) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.RT.seq(RT.java:523) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$map$fn__4785.invoke(core.clj:2637) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.RT.seq(RT.java:521) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:641) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$mapcat.invokeStatic(core.clj:2674) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$maybe_multiplex.invokeStatic(api.clj:109) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$fn__7657.invokeStatic(api.clj:151) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$fn__7657.invoke(api.clj:145) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.MultiFn.invoke(MultiFn.java:229) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$process.invokeStatic(api.clj:89) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$make_handler$fn__7666.invoke(api.clj:167) [cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.http$request_handler$fn__1910.invoke(http.clj:110) [cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.http$netty_handler$fn__1918.invoke(http.clj:125) [cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.http.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:428) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:276) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:263) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:243) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:840) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:830) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:348) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) [cyanite-0.5.1-standalone.jar:na]

@ifesdjeen
Copy link
Collaborator

@jacobrichard thanks for catching this one. Could you tell me what kind of queries are you running?

@jacobrichard
Copy link

jacobrichard commented Oct 7, 2016

This is just a metric retrieval from graphite-web. Specifically for one of the internal reporter metrics for cyanite:

cyanite.us-west-2.$hostname.cyanite.ingestq.events.count

I redacted hostname (since it was an IP), but thats the path to the metric from graphite-web

@ifesdjeen
Copy link
Collaborator

ifesdjeen commented Oct 7, 2016

@jacobrichard @tehlers320 right it was my bad. My usual testing path did not include graphite-web (until now). I'll test more thoroughly with graphite-web from now on. It's fixed and force-pushed to #248.

For #248 i would not expect changes in performance yet (this is a job for #249), but I hope to finish both of them over this weekend. #248 only exposes _min, _max and other metrics (to close #244).

@tehlers320
Copy link
Contributor Author

Out of curiosity would the ElasticSearch index have had this problem as well ?

@ifesdjeen
Copy link
Collaborator

Yes, this was only because we've included these names. It's only because of these "fake" metrics. For grafana, you would expect the metrics to pop up in autocomplete. For name expansion via graphite-web you don't, so that leads to the trouble..

@tehlers320
Copy link
Contributor Author

Are we talking about the same issue, i mean the timeout issue due to the table being too big?

@ifesdjeen
Copy link
Collaborator

You're right. It all boils down to what kind of wildcard is supported: if we can only query by prefix and/or suffix, it'll still be the same...

Technically, we could have a Cyanite node-local index, but then we'd have synchronisation and / or update problems...

@tehlers320
Copy link
Contributor Author

tehlers320 commented Nov 11, 2016

It's really docker that makes this un-manageable i think. Our tree just grows infinitely since they change so often and make new trees over and over.

stats.gauges.foo-app.ads1239adsfz
stats.gauges.foo-app.shadsf1239ad
stats.gauges.foo-app.89asdf39adsf

The hash comes from the servername inside the container. We are kicking around an idea to have the host come online and "register" with something and then changing the name to "server001, server002" based on how many of the same nodes are up . I wonder if anybody has solved this problem already. Even with influxdb over time you would have millions of hostname tags.

@ifesdjeen
Copy link
Collaborator

After a lot of back-and-forth I've figured how to use tokeniser for better and faster queries and more lightweight trees in #256

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants