Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table names for multi-part inserts #186

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sundeepn
Copy link
Contributor

Fix to ensure tablenames for multi-insert/partitioned cached table get reflected on the shark UI.

@harveyfeng
Copy link
Member

Hi Sundeep, the current Shark master doesn't include support for partitioned cached tables.
"insert into" commands that involve UnionRDDs in MemoryStoreSinkOperator are appends to a single, non-partitioned table.
It seems like this patch tracks how many sequential appends (i.e., "insert into"s) have been done to each table, but doesn't account for new RDDs created by interleaved "insert overwrite"s - those RDDs are assigned the table name.

@sundeepn
Copy link
Contributor Author

Hi Harvey, The current patch is meant to allow users to track the storage/memory usage on Shark Storage UI per table as opposed to 'rdd_###'. Inserts/overwrites to the cached tables render the current Storage UI quite hard to follow.

It does not handle drop parititions and overwrites in any special way, but it does guarantee that each block of data is identified by a unique number and has the table name associated with it on the UI.

I am planning on submitting another patch once we have partition support that has naming conventions derived from hive's partition information.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@harveyfeng
Copy link
Member

Yeah, the storage UI is a bit confusing right now :(
Assigning unique IDs to RDDs created from "insert into" definitely helps, but is there a way to assign unique identifiers to RDDs created from "insert overwrite", and possibly distinguish between valid or invalid RDDs? For example, right now it seems like five "insert overwrite" commands will result in five RDDs displayed under the same (table) name.
One way might be to mark overwritten RDDs with something like "stale_table-name".

@sundeepn
Copy link
Contributor Author

Based on hive's documentation, shouldn't the insert overwrite on table unpersist the existing RDDs? (partitions just unpersist the overwritten partitions). If this is the case, I can push a fix on that front.

@harveyfeng
Copy link
Member

Yeah, that sounds good - created a ticket for that here: https://spark-project.atlassian.net/browse/SHARK-202.
Could you assign yourself to it? :)

@sundeepn
Copy link
Contributor Author

Sure. I do not seem to have permissions to assign myself the ticket. If you can help with that, I will take on the ticket. :)

@harveyfeng
Copy link
Member

Done - assigned it to you. Thx!

@harveyfeng
Copy link
Member

Oh, it looks like the assignments were concurrent....

@rxin
Copy link
Member

rxin commented Nov 1, 2013

What's the status of this pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants