Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meet a problem when use owl to monitor yarn #6

Closed
zenglinxi0615 opened this issue Jan 2, 2014 · 19 comments
Closed

meet a problem when use owl to monitor yarn #6

zenglinxi0615 opened this issue Jan 2, 2014 · 19 comments
Assignees

Comments

@zenglinxi0615
Copy link

当在owl的web页面上点击yarn的某个task id时,无法正常进入由opentsdb监控视图组成的页面,而是报错:“A server error occurred. Please contact the administrator.”

查看日志serve.log,发现以下问题:
[02/Jan/2014 15:49:15] "GET /monitor/task/225 HTTP/1.1" 301 0
Traceback (most recent call last):
File "/usr/local/lib/python2.7/wsgiref/handlers.py", line 85, in run
self.result = application(self.environ, self.start_response)
File "/usr/local/lib/python2.7/site-packages/django/contrib/staticfiles/handlers.py", line 67, in call
return self.application(environ, start_response)
File "/usr/local/lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 209, in call
response = self.get_response(request)
File "/usr/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 200, in get_response
response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
File "/usr/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 230, in handle_uncaught_exception
'request': request
File "/usr/local/lib/python2.7/logging/init.py", line 1154, in error
self._log(ERROR, msg, args, **kwargs)
File "/usr/local/lib/python2.7/logging/init.py", line 1246, in _log
self.handle(record)
File "/usr/local/lib/python2.7/logging/init.py", line 1256, in handle
self.callHandlers(record)
File "/usr/local/lib/python2.7/logging/init.py", line 1293, in callHandlers
hdlr.handle(record)
File "/usr/local/lib/python2.7/logging/init.py", line 740, in handle
self.emit(record)
File "/usr/local/lib/python2.7/site-packages/django/utils/log.py", line 106, in emit
connection=self.connection())
File "/usr/local/lib/python2.7/site-packages/django/core/mail/init.py", line 98, in mail_admins
mail.send(fail_silently=fail_silently)
File "/usr/local/lib/python2.7/site-packages/django/core/mail/message.py", line 284, in send
return self.get_connection(fail_silently).send_messages([self])
File "/usr/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 92, in send_messages
new_conn_created = self.open()
File "/usr/local/lib/python2.7/site-packages/django/core/mail/backends/smtp.py", line 51, in open
self.connection = connection_class(self.host, self.port, **connection_params)
File "/usr/local/lib/python2.7/smtplib.py", line 239, in init
(code, msg) = self.connect(host, port)
File "/usr/local/lib/python2.7/smtplib.py", line 295, in connect
self.sock = self._get_socket(host, port, self.timeout)
File "/usr/local/lib/python2.7/smtplib.py", line 273, in _get_socket
return socket.create_connection((port, host), timeout)
File "/usr/local/lib/python2.7/socket.py", line 567, in create_connection
raise error, msg
error: [Errno 111] Connection refused

请问这个问题可能由什么原因造成?

@wuzesheng
Copy link
Contributor

看这个栈都是django和python底层的,有没有minos本身的栈相关的信息?

@zenglinxi0615
Copy link
Author

server.log里面没有找到跟minos本身相关的信息,其他日志文件跟这个问题应该没关系

@wuzesheng
Copy link
Contributor

这个看上去是在连某个smtp的server,而这个server没有起,但怀疑这个不是root cause.
从上面现象来看应该是这样的path: owl有问题->django发邮件给管理员->发邮件失败

@ghost ghost assigned wuzesheng Jan 2, 2014
@wuzesheng
Copy link
Contributor

你贴一下你要点的那个链接,另外看一下后台django日志中该请求对应的http status code

@zenglinxi0615
Copy link
Author

链接是个内网的地址,格式类似于:http://10.10.65.13:8080/monitor/cluster/6/task/,感觉应该是你说的“这样的path: owl有问题->django发邮件给管理员->发邮件失败”,我再检查一下日志。

@wuzesheng
Copy link
Contributor

好,你看看出问题的请求django返回的http status code, 可能会有些帮助

@zenglinxi0615
Copy link
Author

WARNING 2014-01-03 15:07:01,349 collect 16994 140143162611456 <Task: yarn/hadoop-crete/nodemanager/26> failed to update metric: OperationalError(2006, 'MySQL server has gone away')怀疑是数据库连接断开的问题。

@zenglinxi0615
Copy link
Author

owl在更新mysql中的监控数据的时候是先建立mysql连接,然后通过jmx获取json数据,再更新msyql table的吗?

@wuzesheng
Copy link
Contributor

与mysql的连接是底层django维护的,应该是长连接

@wuzesheng
Copy link
Contributor

你能发一下你们搭的owl的collect和mysql各自的cpu使用情况吗?

@zenglinxi0615
Copy link
Author

20182 minos 20 0 101m 23m 3388 S 8 0.0 0:01.99 python2.7
24174 mysql 20 0 265m 41m 7444 S 7 0.1 9:44.78 mysqld

都不大。把收集数据的时间周期设为30,现在mysql的问题没出现了。有个新的问题:
WARNING 2014-01-03 16:22:19,891 collect 16772 139817798592256 <Task: hbase/hadoop-crete/master/0> failed to get metric: KeyError('hadoop:service=Master,name=Master',)
我正在调试。感觉昨天发的那个错误应该是在minos/owl/collector/management/commands/collect.py中执行update state的过程中出现问题引发的。

@wuzesheng
Copy link
Contributor

你看下你的hbase的jmx页面上有没有这项:"name" : "hadoop:service=Master,name=Master"

@wuzesheng
Copy link
Contributor

这个问题可能有两个原因:

  1. Hbase version比较老,jmx里没有上面的说的这项
  2. hbase active master启动了,但数据加载没有完成,也不会有上面这项

@zenglinxi0615
Copy link
Author

应该是hbase版本原因,我们用的是0.96的,jmx页面项有所改变:"name" : "Hadoop:service=HBase,name=MetricsSystem,sub=Stats"

@wuzesheng
Copy link
Contributor

哦,明白了,对多个版本的兼容这一块看来要做的事情还比较多。

@zenglinxi0615
Copy link
Author

嗯,这个在代码里面写的比较死,能改成配置项就好了,最好能提供几个现在常用的hbase版本对应的配置(如果这些版本之间jmx有区别的话)

@zenglinxi0615
Copy link
Author

刚刚有点错误,你说的jmx那项对应于0.96版的应该是 Hadoop:service=HBase,name=Master

@wuzesheng
Copy link
Contributor

好,明白了,谢谢反馈。你的建议挺好,我们会考虑。不过目前人力有限,没那么快来做这个事情,所以你这边就先自己改一下用吧。
BTW: 你这边的现在都正常跑起来了吗?

@wuzesheng
Copy link
Contributor

创建了一个新的Issue来跟踪这个事情,#18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants