Skip to content

FuHsinyu/pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Manage and process 44GB data on Spark

Task description

Clean and extract informations form VK data(a social application) (see task_description)
Data size: 44GB
Sample data size: 16GB
Platform: pyspark 2.2.0 in ubuntu 16.04 LTS

How to run

Uncomment the task functions below from vk_project:
runtasks

NOTE

Since data is pretty large, only sample result(20 rows usually) is presented. And if you wanna check full result, kindly please go run the vk_project.

Results for level basic

  1. count of comments, posts (all), original posts, reposts and likes made by user
    sample count of comm per user
    • COMMENTS COUNT
      countcomm
    • ALL POSTS COUNT
      allposts
    • ORIGINAL POSTS COUNT
      originalposts
    • REPOSTS COUNT
      reposts
    • LIKES COUNT
      likescount
  2. count of friends, groups, followers
  3. count of videos, audios, photos, gifts
    • COMBINED COUNTS
      videosgroupsetc
  4. count of "incoming" (made by other users) comments, max and mean "incoming" comments per post
    • INCOMING COMMENTS STATS:
      incmingcommstats
  5. count of "incoming" likes, max and mean "incoming" likes per post
    • INCOMING LIKES
      incominglikesstats
  6. count of geo tagged posts
    • Count of geo tagged posts
      geotaggedposts
  7. count of open / closed (e.g. private) groups a user participates in
    • Count of opened closed
      coungopenandclosedgroup

Results for level medium

  1. count of reposts from subscribed and not-subscribed groups

    • COUNTS OF REPOSTS FROM SUB AND NONSUB GROUPS
      countsofsubandnonsub
  2. count of deleted users in friends and followers

    • COUNT OF DELETED USER
      countdeluser
  3. Aggregate (e.g. count, max, mean) characteristics for comments and likes (separtely) made by (a) friends and (b) followers per post

    • LIKE PER POST FROM FOLLOWERS AND FRIENDS
      likeperpostFOLandFRI
    • COMMENTS PER POST FROM FOLLOWERS AND FRIENDS
      commsperpostFOLandFRI
  4. Aggregate (e.g. count, max, mean) characteristics for comments and likes (separtely) made by (a) friends and (b) followers per user

    • LIKE PER USER FROM FOLLOWERS AND FRIENDS
      likeperuserFOLandFRI
    • COMMENTS PER POST FROM FOLLOWERS AND FRIENDS
      commsperpostFOLandFRI
  5. find emoji (separately, count of: all, negative, positive, others) in (a) user's posts (b) user's comments

    • EMOJI CLASSIFICATIONS COUNT
      emojicountcombined

About

Big Data Technology / ITMO university

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published