Clean and extract informations form VK data(a social application) (see task_description)
Data size: 44GB
Sample data size: 16GB
Platform: pyspark 2.2.0 in ubuntu 16.04 LTS
Uncomment the task functions below from vk_project:
Since data is pretty large, only sample result(20 rows usually) is presented. And if you wanna check full result, kindly please go run the vk_project.
- count of comments, posts (all), original posts, reposts and likes made by user
sample count of comm per user - count of friends, groups, followers
- count of videos, audios, photos, gifts
- count of "incoming" (made by other users) comments, max and mean "incoming" comments per post
- count of "incoming" likes, max and mean "incoming" likes per post
- count of geo tagged posts
- count of open / closed (e.g. private) groups a user participates in
-
count of reposts from subscribed and not-subscribed groups
-
count of deleted users in friends and followers
-
Aggregate (e.g. count, max, mean) characteristics for comments and likes (separtely) made by (a) friends and (b) followers per post
-
Aggregate (e.g. count, max, mean) characteristics for comments and likes (separtely) made by (a) friends and (b) followers per user
-
find emoji (separately, count of: all, negative, positive, others) in (a) user's posts (b) user's comments