Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Hadoop 3? #657

Open
theyaa opened this issue Feb 17, 2020 · 12 comments
Open

Support for Hadoop 3? #657

theyaa opened this issue Feb 17, 2020 · 12 comments
Assignees
Labels

Comments

@theyaa
Copy link

theyaa commented Feb 17, 2020

Does Dr. Elephant provide support for Hadoop 3 with Yarn ATS V2 please?

@ShubhamGupta29 ShubhamGupta29 self-assigned this Mar 2, 2020
@ShubhamGupta29
Copy link
Contributor

@theyaa, No Dr.Elephant currently doesn't support Hadoop3 with ATS v2. But you can use Dr.E with Hadoop3 in prod given that you Yarn REST APIs and history servers are in sync with what Dr.Elephant is excepting.
Kindly try this if you can and let us know the result and reach out in case you need any help.

@theyaa
Copy link
Author

theyaa commented Mar 4, 2020

Hi @ShubhamGupta29, in HDP3 Hadoop3, all hive queries run using the Tez engine. And Tez is built to send query updates/progress to Yarn ATSv2. Using Yarn timeline server v1 rest api, we can not get Tez query progress information anymore. We have to use Yarn ATSv2. Or read from Hive's sys db tables query_data, dag_data.

@ShubhamGupta29
Copy link
Contributor

@theyaa, got the need for ATSv2. I will have a look at all the needs and changes for this requirement and prioritize respectively.

@theyaa
Copy link
Author

theyaa commented Mar 6, 2020

@ShubhamGupta29 thank you very much. Please let me know when you have a working version so I can download and try it out.

@shkhrgpt
Copy link
Contributor

@theyaa Is the Tez UI working in your HDP 3 install?
Can you also provide the value of the property, tez.history.logging.service.class, which should be present in tez-site.xml.
Thank you.

@theyaa
Copy link
Author

theyaa commented Mar 18, 2020

Hi @shkhrgpt the value is: org.apache.tez.dag.history.logging.proto.ProtoHistoryLoggingService

@shkhrgpt
Copy link
Contributor

@theyaa That may be the issue why the timeline server is not returning data for Tez. org.apache.tez.dag.history.logging.proto.ProtoHistoryLoggingService doesn't allow data to go to timeline server and therefore timeline API used Tez fetcher is not working.

Maybe if you change the value of tez.history.logging.service.class to org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService, it might work. As it's described here:

https://tez.apache.org/tez-ui.html

I haven't tested it yet so I don't know if it causes any problem. But maybe you can try?

@shkhrgpt
Copy link
Contributor

@theyaa Did look the solution described here:

#529

@theyaa
Copy link
Author

theyaa commented Mar 18, 2020

Hi @shkhrgpt This will cause issues with Yarn and hive logging since Yarn with Hadoop3 and HDP3 logs to Yarn ATSv2 and the latter uses Protobuf and writes to Hbase. If I switch to the old class for Tez I will loose that logging and cause issues in Yarn. That is why I was asking if there is a way to modify Dr. Elephant to be able to read from Yarn ATSv2.

@shkhrgpt
Copy link
Contributor

Okay @theyaa .
Do you know if ATSv2 rest API provides the Tez data which was provided by older ATS rest API?

@shkhrgpt
Copy link
Contributor

@theyaa
I wrote a logging service that will write Tez events to both ATSv1 and protobuf. Please check the following if you want to try

https://github.com/shkhrgpt/tez-logging

The goal is that dr elephant should be able to access get data from ATSv1 rest api, and the data should go also be written to protobuf so nothing else.
If you can, then, please try this and let me know if it works for you.

@theyaa
Copy link
Author

theyaa commented Mar 25, 2020

Hi @shkhrgpt Tez+Hive in Hive3 do log all query/dag events to a hive database called sys. Under the sys db, there are 2 tables query_data and dag_data. Those are the main two tables. If you can get Dr. Elephant to read from those two tables, then it will be able to process hive queries the same way as before.

Cloudera has a tools called "Data Analytics Studio" It does exactly this and presents the query in a web user interface. I believe if Dr. Elephant can parse the below 2 tables from hive's sys db, it will be able to perform the same exact way.

  • query_data

  • dag_data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants