Manage and process 44GB data on Spark

Task description

Clean and extract informations form VK data(a social application) (see task_description)
Data size: 44GB
Sample data size: 16GB
Platform: pyspark 2.2.0 in ubuntu 16.04 LTS

How to run

Uncomment the task functions below from vk_project:

NOTE

Since data is pretty large, only sample result(20 rows usually) is presented. And if you wanna check full result, kindly please go run the vk_project.

Results for level basic

count of comments, posts (all), original posts, reposts and likes made by user
sample count of comm per user
- COMMENTS COUNT
- ALL POSTS COUNT
- ORIGINAL POSTS COUNT
- REPOSTS COUNT
- LIKES COUNT
count of friends, groups, followers
count of videos, audios, photos, gifts
- COMBINED COUNTS
count of "incoming" (made by other users) comments, max and mean "incoming" comments per post
- INCOMING COMMENTS STATS:
count of "incoming" likes, max and mean "incoming" likes per post
- INCOMING LIKES
count of geo tagged posts
- Count of geo tagged posts
count of open / closed (e.g. private) groups a user participates in
- Count of opened closed

Results for level medium

count of reposts from subscribed and not-subscribed groups
- COUNTS OF REPOSTS FROM SUB AND NONSUB GROUPS
count of deleted users in friends and followers
- COUNT OF DELETED USER
Aggregate (e.g. count, max, mean) characteristics for comments and likes (separtely) made by (a) friends and (b) followers per post
- LIKE PER POST FROM FOLLOWERS AND FRIENDS
- COMMENTS PER POST FROM FOLLOWERS AND FRIENDS
Aggregate (e.g. count, max, mean) characteristics for comments and likes (separtely) made by (a) friends and (b) followers per user
- LIKE PER USER FROM FOLLOWERS AND FRIENDS
- COMMENTS PER POST FROM FOLLOWERS AND FRIENDS
find emoji (separately, count of: all, negative, positive, others) in (a) user's posts (b) user's comments
- EMOJI CLASSIFICATIONS COUNT