segment search regex leads to timeouts due to large return #246

tehlers320 · 2016-09-23T21:20:09Z

To reproduce
Grafana or graphite-api search:
prod.us-west-2.collectd_metrics.*.*.disk-*.disk_ops.write

Cyanite will query this:
SELECT * from segment WHERE pos = 8 AND segment LIKE 'prod.us-west-2.collectd_metrics.%' ALLOW FILTERING;

This data contains many metrics that we do not need for this query for example:

prod.us-west-2.collectd_metrics.caps-competition-api-web.9b95f0-ip-10-1-1-247.interface-eth0.if_octets

cqlsh -k metric -e "SELECT * from segment WHERE pos = 8 AND segment LIKE 'prod.us-west-2.collectd_metrics.%'  ALLOW FILTERING;"

<stdin>:1:errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1

I tried to query for this but it doesn't seem supported in cassandra.
SELECT * from segment WHERE pos = 8 AND segment LIKE 'prod.us-west-2.collectd_metrics.%.disk_ops.write' ALLOW FILTERING;

Should cyanite consider using tags and i wonder how much this would grow the segment table size by ?

tags { "prod", "us-west-2", "collectd_metrics", "cyanite-cassandra", "ip-10-1-1-23", "disk-xvdf", "disk_ops", "write" }

Then the query could do this based on any '.' entry that is not a regex

i did not test this...
SELECT * FROM segment WHERE tags CONTAINS 'prod' AND tags CONTAINS 'us-west-2' AND tags 'collectd_metrics' AND tags CONTAINS 'disk_ops' AND tags CONTAINS 'write';

The text was updated successfully, but these errors were encountered:

ifesdjeen · 2016-09-24T08:41:32Z

I'm afraid the query you mention will work even worse than the current implementation. What Cassandra does is it picks the most selective index and queries it, filtering out the rest of results, even if there are more indexes available.

Although after checking your query I've noticed two things:

We can use a different tokenizer (currently we're using the one that would just split results letter by letter. Since we do not require such level of detail, we can use a tokenizer that would just split words, which may result into smaller trees and better traversal.
We can do contains query and do less post-filtering in the case you indicated.

I'm going to start with (2) right after I'm done with #244

ifesdjeen · 2016-09-26T18:10:59Z

I've tested an alternative tokeniser and I have good news: we most likely will be able to (yet again) significantly improve the performance.

I'll still have to modify it to support disk-* form of queries, right now it only supports full segment skip (like *), but in general it turns out we still can improve.

ifesdjeen · 2016-10-03T19:31:58Z

Tokeniser impl (prototype) can be found in #249.

tehlers320 · 2016-10-03T21:30:30Z

I pulled in 248 and 249 to our test environment. Truncated my segment and metric table. Searching the tree seems snappy (but always does when i truncate). Will let it re-populate over a day or so. I am not able to retrieve metrics though.

our "query" host is spamming this:

ERROR [2016-10-03 21:26:35,953] epollEventLoopGroup-3-1 - io.cyanite.api could not process request
clojure.lang.ArityException: Wrong number of args (2) passed to: index/fn--6086/G--6068--6095

ifesdjeen · 2016-10-04T05:17:20Z

@tehlers320 do you have a full stack trace?..

ifesdjeen · 2016-10-05T21:02:13Z

@tehlers320 I've found the reason and pushed the fix to #248. Was incorrect arity usage from my side...

jacobrichard · 2016-10-07T18:15:57Z

I pulled in the #248 fix and pushed it into our environment. As @tehlers320 reported, the tree is snappy but now I'm seeing a different error when retrieving metrics.

ERROR [2016-10-07 18:12:30,789] epollEventLoopGroup-3-1 - io.cyanite.api could not process request
java.lang.IllegalArgumentException: Don't know how to create ISeq from: clojure.core$partial$fn__4759
    at clojure.lang.RT.seqFrom(RT.java:542) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.RT.seq(RT.java:523) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$map$fn__4785.invoke(core.clj:2637) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.RT.seq(RT.java:521) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:641) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.core$mapcat.invokeStatic(core.clj:2674) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$maybe_multiplex.invokeStatic(api.clj:109) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$fn__7657.invokeStatic(api.clj:151) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$fn__7657.invoke(api.clj:145) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.MultiFn.invoke(MultiFn.java:229) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$process.invokeStatic(api.clj:89) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.api$make_handler$fn__7666.invoke(api.clj:167) [cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.http$request_handler$fn__1910.invoke(http.clj:110) [cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.http$netty_handler$fn__1918.invoke(http.clj:125) [cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.http.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:428) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:276) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:263) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:243) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:840) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:830) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:348) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) [cyanite-0.5.1-standalone.jar:na]

ifesdjeen · 2016-10-07T18:21:20Z

@jacobrichard thanks for catching this one. Could you tell me what kind of queries are you running?

jacobrichard · 2016-10-07T18:22:58Z

This is just a metric retrieval from graphite-web. Specifically for one of the internal reporter metrics for cyanite:

cyanite.us-west-2.$hostname.cyanite.ingestq.events.count

I redacted hostname (since it was an IP), but thats the path to the metric from graphite-web

ifesdjeen · 2016-10-07T19:26:52Z

@jacobrichard @tehlers320 right it was my bad. My usual testing path did not include graphite-web (until now). I'll test more thoroughly with graphite-web from now on. It's fixed and force-pushed to #248.

For #248 i would not expect changes in performance yet (this is a job for #249), but I hope to finish both of them over this weekend. #248 only exposes _min, _max and other metrics (to close #244).

tehlers320 · 2016-11-11T20:05:13Z

Out of curiosity would the ElasticSearch index have had this problem as well ?

ifesdjeen · 2016-11-11T20:09:30Z

Yes, this was only because we've included these names. It's only because of these "fake" metrics. For grafana, you would expect the metrics to pop up in autocomplete. For name expansion via graphite-web you don't, so that leads to the trouble..

tehlers320 · 2016-11-11T20:14:37Z

Are we talking about the same issue, i mean the timeout issue due to the table being too big?

ifesdjeen · 2016-11-11T20:21:13Z

You're right. It all boils down to what kind of wildcard is supported: if we can only query by prefix and/or suffix, it'll still be the same...

Technically, we could have a Cyanite node-local index, but then we'd have synchronisation and / or update problems...

tehlers320 · 2016-11-11T20:58:32Z

It's really docker that makes this un-manageable i think. Our tree just grows infinitely since they change so often and make new trees over and over.

stats.gauges.foo-app.ads1239adsfz
stats.gauges.foo-app.shadsf1239ad
stats.gauges.foo-app.89asdf39adsf

The hash comes from the servername inside the container. We are kicking around an idea to have the host come online and "register" with something and then changing the name to "server001, server002" based on how many of the same nodes are up . I wonder if anybody has solved this problem already. Even with influxdb over time you would have millions of hostname tags.

ifesdjeen · 2016-11-26T21:26:24Z

After a lot of back-and-forth I've figured how to use tokeniser for better and faster queries and more lightweight trees in #256

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segment search regex leads to timeouts due to large return #246

segment search regex leads to timeouts due to large return #246

tehlers320 commented Sep 23, 2016 •

edited

Loading

ifesdjeen commented Sep 24, 2016

ifesdjeen commented Sep 26, 2016

ifesdjeen commented Oct 3, 2016

tehlers320 commented Oct 3, 2016

ifesdjeen commented Oct 4, 2016

ifesdjeen commented Oct 5, 2016

jacobrichard commented Oct 7, 2016 •

edited by ifesdjeen

Loading

ifesdjeen commented Oct 7, 2016

jacobrichard commented Oct 7, 2016 •

edited

Loading

ifesdjeen commented Oct 7, 2016 •

edited

Loading

tehlers320 commented Nov 11, 2016

ifesdjeen commented Nov 11, 2016

tehlers320 commented Nov 11, 2016

ifesdjeen commented Nov 11, 2016

tehlers320 commented Nov 11, 2016 •

edited

Loading

ifesdjeen commented Nov 26, 2016

segment search regex leads to timeouts due to large return #246

segment search regex leads to timeouts due to large return #246

Comments

tehlers320 commented Sep 23, 2016 • edited Loading

ifesdjeen commented Sep 24, 2016

ifesdjeen commented Sep 26, 2016

ifesdjeen commented Oct 3, 2016

tehlers320 commented Oct 3, 2016

ifesdjeen commented Oct 4, 2016

ifesdjeen commented Oct 5, 2016

jacobrichard commented Oct 7, 2016 • edited by ifesdjeen Loading

ifesdjeen commented Oct 7, 2016

jacobrichard commented Oct 7, 2016 • edited Loading

ifesdjeen commented Oct 7, 2016 • edited Loading

tehlers320 commented Nov 11, 2016

ifesdjeen commented Nov 11, 2016

tehlers320 commented Nov 11, 2016

ifesdjeen commented Nov 11, 2016

tehlers320 commented Nov 11, 2016 • edited Loading

ifesdjeen commented Nov 26, 2016

tehlers320 commented Sep 23, 2016 •

edited

Loading

jacobrichard commented Oct 7, 2016 •

edited by ifesdjeen

Loading

jacobrichard commented Oct 7, 2016 •

edited

Loading

ifesdjeen commented Oct 7, 2016 •

edited

Loading

tehlers320 commented Nov 11, 2016 •

edited

Loading