Skip to content

alexander-matsievsky/tv-recommendations

Repository files navigation

TV Recommendations (Using IMDb Dataset)

Setup

  1. Install conda (see the installation guide)

  2. Run

    conda env create -f environment.yml
    conda activate tv-recommendations
    jupyter lab

     

  3. Update README.md

    jupyter nbconvert --to markdown README.ipynb

Extract, Transform, Load

Extract

!wget -xP'/tmp' --accept '.tsv.gz' --no-parent --recursive 'https://datasets.imdbws.com/'
!tree -h '/tmp/datasets.imdbws.com'
--2022-01-02 14:35:57--  https://datasets.imdbws.com/
Resolving datasets.imdbws.com (datasets.imdbws.com)... 143.204.98.32, 143.204.98.111, 143.204.98.41, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|143.204.98.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 945 [text/html]
Saving to: ‘/tmp/datasets.imdbws.com/index.html.tmp’

datasets.imdbws.com 100%[===================>]     945  --.-KB/s    in 0s      

2022-01-02 14:35:57 (6.23 MB/s) - ‘/tmp/datasets.imdbws.com/index.html.tmp’ saved [945/945]

Loading robots.txt; please ignore errors.
--2022-01-02 14:35:57--  https://datasets.imdbws.com/robots.txt
Reusing existing connection to datasets.imdbws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 945 [text/html]
Saving to: ‘/tmp/datasets.imdbws.com/robots.txt.tmp’

datasets.imdbws.com 100%[===================>]     945  --.-KB/s    in 0s      

2022-01-02 14:35:58 (19.7 MB/s) - ‘/tmp/datasets.imdbws.com/robots.txt.tmp’ saved [945/945]

Removing /tmp/datasets.imdbws.com/index.html.tmp since it should be rejected.

--2022-01-02 14:35:58--  https://datasets.imdbws.com/name.basics.tsv.gz
Reusing existing connection to datasets.imdbws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 221263899 (211M) [binary/octet-stream]
Saving to: ‘/tmp/datasets.imdbws.com/name.basics.tsv.gz’

datasets.imdbws.com 100%[===================>] 211.01M  9.61MB/s    in 21s     

2022-01-02 14:36:21 (10.0 MB/s) - ‘/tmp/datasets.imdbws.com/name.basics.tsv.gz’ saved [221263899/221263899]

--2022-01-02 14:36:21--  https://datasets.imdbws.com/title.akas.tsv.gz
Reusing existing connection to datasets.imdbws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 258943798 (247M) [binary/octet-stream]
Saving to: ‘/tmp/datasets.imdbws.com/title.akas.tsv.gz’

datasets.imdbws.com 100%[===================>] 246.95M  10.3MB/s    in 24s     

2022-01-02 14:36:49 (10.1 MB/s) - ‘/tmp/datasets.imdbws.com/title.akas.tsv.gz’ saved [258943798/258943798]

--2022-01-02 14:36:49--  https://datasets.imdbws.com/title.basics.tsv.gz
Reusing existing connection to datasets.imdbws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 150032645 (143M) [binary/octet-stream]
Saving to: ‘/tmp/datasets.imdbws.com/title.basics.tsv.gz’

datasets.imdbws.com 100%[===================>] 143.08M  10.3MB/s    in 14s     

2022-01-02 14:37:05 (10.0 MB/s) - ‘/tmp/datasets.imdbws.com/title.basics.tsv.gz’ saved [150032645/150032645]

--2022-01-02 14:37:05--  https://datasets.imdbws.com/title.crew.tsv.gz
Reusing existing connection to datasets.imdbws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 58444351 (56M) [binary/octet-stream]
Saving to: ‘/tmp/datasets.imdbws.com/title.crew.tsv.gz’

datasets.imdbws.com 100%[===================>]  55.74M  9.72MB/s    in 5.6s    

2022-01-02 14:37:11 (9.95 MB/s) - ‘/tmp/datasets.imdbws.com/title.crew.tsv.gz’ saved [58444351/58444351]

--2022-01-02 14:37:11--  https://datasets.imdbws.com/title.episode.tsv.gz
Reusing existing connection to datasets.imdbws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 34630107 (33M) [binary/octet-stream]
Saving to: ‘/tmp/datasets.imdbws.com/title.episode.tsv.gz’

datasets.imdbws.com 100%[===================>]  33.03M  10.3MB/s    in 3.2s    

2022-01-02 14:37:15 (10.3 MB/s) - ‘/tmp/datasets.imdbws.com/title.episode.tsv.gz’ saved [34630107/34630107]

--2022-01-02 14:37:15--  https://datasets.imdbws.com/title.principals.tsv.gz
Reusing existing connection to datasets.imdbws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 383673006 (366M) [binary/octet-stream]
Saving to: ‘/tmp/datasets.imdbws.com/title.principals.tsv.gz’

datasets.imdbws.com 100%[===================>] 365.90M  9.62MB/s    in 36s     

2022-01-02 14:37:52 (10.1 MB/s) - ‘/tmp/datasets.imdbws.com/title.principals.tsv.gz’ saved [383673006/383673006]

--2022-01-02 14:37:52--  https://datasets.imdbws.com/title.ratings.tsv.gz
Reusing existing connection to datasets.imdbws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 5989985 (5.7M) [binary/octet-stream]
Saving to: ‘/tmp/datasets.imdbws.com/title.ratings.tsv.gz’

datasets.imdbws.com 100%[===================>]   5.71M  10.0MB/s    in 0.6s    

2022-01-02 14:37:53 (10.0 MB/s) - ‘/tmp/datasets.imdbws.com/title.ratings.tsv.gz’ saved [5989985/5989985]

FINISHED --2022-01-02 14:37:53--
Total wall clock time: 1m 56s
Downloaded: 9 files, 1.0G in 1m 45s (10.1 MB/s)
wget -xP'/tmp' --accept '.tsv.gz' --no-parent --recursive   3.73s user 8.09s system 10% cpu 1:56.19 total
[4.0K]  /tmp/datasets.imdbws.com
├── [211M]  name.basics.tsv.gz
├── [ 945]  robots.txt.tmp
├── [247M]  title.akas.tsv.gz
├── [143M]  title.basics.tsv.gz
├── [ 56M]  title.crew.tsv.gz
├── [ 33M]  title.episode.tsv.gz
├── [366M]  title.principals.tsv.gz
└── [5.7M]  title.ratings.tsv.gz

0 directories, 8 files
neo4j_staging = "/tmp/neo4j_staging/datasets.imdbws.com"
!rm -fr {neo4j_staging}
!mkdir -p '{neo4j_staging}/data' '{neo4j_staging}/import' '{neo4j_staging}/logs'
!tree {neo4j_staging}
/tmp/neo4j_staging/datasets.imdbws.com
├── data
├── import
└── logs

3 directories, 0 files

Transform

%load_ext lab_black

import pyspark
from IPython.display import Markdown

kwargs_read_csv = dict(header=True, nullValue=r"\N", sep="\t", quote="")
kwargs_write_csv = dict(compression="gzip", escape='"', header=True, mode="overwrite")

spark = pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate()

name.basics.tsv.gz – Contains the following information for names:

  • nconst (string) - alphanumeric unique identifier of the name/person
  • primaryName (string)– name by which the person is most often credited
  • birthYear – in YYYY format
  • deathYear – in YYYY format if applicable, else '\N'
  • primaryProfession (array of strings)– the top-3 professions of the person
  • knownForTitles (array of tconsts) – titles the person is known for
spark.read.csv(
    "/tmp/datasets.imdbws.com/name.basics.tsv.gz",
    schema="""
        nconst string,
        primaryName string,
        birthYear integer,
        deathYear integer,
        primaryProfession string,
        knownForTitles string
    """,
    **kwargs_read_csv
).createOrReplaceTempView("`name.basics`")
spark.table("`name.basics`").limit(10).toPandas().T
0 1 2 3 4 5 6 7 8 9
nconst nm0000001 nm0000002 nm0000003 nm0000004 nm0000005 nm0000006 nm0000007 nm0000008 nm0000009 nm0000010
primaryName Fred Astaire Lauren Bacall Brigitte Bardot John Belushi Ingmar Bergman Ingrid Bergman Humphrey Bogart Marlon Brando Richard Burton James Cagney
birthYear 1899 1924 1934 1949 1918 1915 1899 1924 1925 1899
deathYear 1987.0 2014.0 NaN 1982.0 2007.0 1982.0 1957.0 2004.0 1984.0 1986.0
primaryProfession soundtrack,actor,miscellaneous actress,soundtrack actress,soundtrack,music_department actor,soundtrack,writer writer,director,actor actress,soundtrack,producer actor,soundtrack,producer actor,soundtrack,director actor,soundtrack,producer actor,soundtrack,director
knownForTitles tt0050419,tt0072308,tt0053137,tt0031983 tt0075213,tt0038355,tt0117057,tt0037382 tt0054452,tt0049189,tt0057345,tt0056404 tt0072562,tt0077975,tt0080455,tt0078723 tt0083922,tt0060827,tt0050986,tt0050976 tt0077711,tt0038109,tt0036855,tt0034583 tt0033870,tt0034583,tt0037382,tt0043265 tt0078788,tt0047296,tt0070849,tt0068646 tt0087803,tt0057877,tt0059749,tt0061184 tt0035575,tt0031867,tt0029870,tt0042041
spark.sql(
    """
    select nconst          as `nconst:ID(Name)`,
           primaryName     as `primaryName`,
           birthYear       as `birthYear:long`,
           deathYear       as `deathYear:long`,
           array_join(array('Name') ||
                      ifnull(transform(split(primaryProfession, ','), `_` -> 'primaryProfession=' || `_`),
                             array()),
                      ';') as `:LABEL`
    from `name.basics`
    """
).coalesce(1).write.csv(f"{neo4j_staging}/import/name.basics", **kwargs_write_csv)
spark.sql(
    """
    select nconst                              as `:START_ID(Name)`,
           'knownForTitles'                    as `:TYPE`,
           explode(split(knownForTitles, ',')) as `:END_ID(Title)`
    from `name.basics`
    """
).coalesce(1).write.csv(
    f"{neo4j_staging}/import/name.basics.knownForTitles", **kwargs_write_csv
)

title.akas.tsv.gz - Contains the following information for titles:

  • titleId (string) - a tconst, an alphanumeric unique identifier of the title
  • ordering (integer) – a number to uniquely identify rows for a given titleId
  • title (string) – the localized title
  • region (string) - the region for this version of the title
  • language (string) - the language of the title
  • types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
  • attributes (array) - Additional terms to describe this alternative title, not enumerated
  • isOriginalTitle (boolean) – 0: not original title; 1: original title
spark.read.csv(
    "/tmp/datasets.imdbws.com/title.akas.tsv.gz",
    schema="""
        titleId string,
        ordering integer,
        title string,
        region string,
        language string,
        types string,
        attributes string,
        isOriginalTitle integer
    """,
    **kwargs_read_csv
).createOrReplaceTempView("`title.akas`")
spark.table("`title.akas`").limit(10).toPandas().T
0 1 2 3 4 5 6 7 8 9
titleId tt0000001 tt0000001 tt0000001 tt0000001 tt0000001 tt0000001 tt0000001 tt0000001 tt0000002 tt0000002
ordering 1 2 3 4 5 6 7 8 1 2
title Карменсіта Carmencita Carmencita - spanyol tánc Καρμενσίτα Карменсита Carmencita Carmencita カルメンチータ Le clown et ses chiens Le clown et ses chiens
region UA DE HU GR RU US None JP None FR
language None None None None None None None ja None None
types imdbDisplay None imdbDisplay imdbDisplay imdbDisplay imdbDisplay original imdbDisplay original imdbDisplay
attributes None literal title None None None None None None None None
isOriginalTitle 0 0 0 0 0 0 1 0 1 0
spark.sql(
    """
    select titleId || '#' || ordering as `tconst:ID(TitleAka)`,
           title                      as `title`,
           region                     as `region`,
           language                   as `language`,
           boolean(isOriginalTitle)   as `isOriginalTitle:boolean`,
           array_join(array('TitleAka') ||
                      ifnull(transform(split(attributes, ','), `_` -> 'attributes=' || `_`),
                             array()) ||
                      ifnull(transform(split(types, ','), `_` -> 'types=' || `_`),
                             array()),
                      ';')            as `:LABEL`
    from `title.akas`
    """
).coalesce(1).write.csv(f"{neo4j_staging}/import/title.akas", **kwargs_write_csv)
spark.sql(
    """
    select titleId                    as `:START_ID(Title)`,
           'akas'                     as `:TYPE`,
           titleId || '#' || ordering as `:END_ID(TitleAka)`,
           ordering                   as `ordering:long`
    from `title.akas`
    """
).coalesce(1).write.csv(f"{neo4j_staging}/import/title.akas.akas", **kwargs_write_csv)

title.basics.tsv.gz - Contains the following information for titles:

  • tconst (string) - alphanumeric unique identifier of the title
  • titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
  • primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
  • originalTitle (string) - original title, in the original language
  • isAdult (boolean) - 0: non-adult title; 1: adult title
  • startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
  • endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
  • runtimeMinutes – primary runtime of the title, in minutes
  • genres (string array) – includes up to three genres associated with the title
spark.read.csv(
    "/tmp/datasets.imdbws.com/title.basics.tsv.gz",
    schema="""
        tconst string,
        titleType string,
        primaryTitle string,
        originalTitle string,
        isAdult integer,
        startYear integer,
        endYear integer,
        runtimeMinutes integer,
        genres string
    """,
    **kwargs_read_csv
).createOrReplaceTempView("`title.basics`")
spark.table("`title.basics`").limit(10).toPandas().T
0 1 2 3 4 5 6 7 8 9
tconst tt0000001 tt0000002 tt0000003 tt0000004 tt0000005 tt0000006 tt0000007 tt0000008 tt0000009 tt0000010
titleType short short short short short short short short short short
primaryTitle Carmencita Le clown et ses chiens Pauvre Pierrot Un bon bock Blacksmith Scene Chinese Opium Den Corbett and Courtney Before the Kinetograph Edison Kinetoscopic Record of a Sneeze Miss Jerry Leaving the Factory
originalTitle Carmencita Le clown et ses chiens Pauvre Pierrot Un bon bock Blacksmith Scene Chinese Opium Den Corbett and Courtney Before the Kinetograph Edison Kinetoscopic Record of a Sneeze Miss Jerry La sortie de l'usine Lumière à Lyon
isAdult 0 0 0 0 0 0 0 0 0 0
startYear 1894 1892 1892 1892 1893 1894 1894 1894 1894 1895
endYear NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
runtimeMinutes 1 5 4 12 1 1 1 1 40 1
genres Documentary,Short Animation,Short Animation,Comedy,Romance Animation,Short Comedy,Short Short Short,Sport Documentary,Short Romance,Short Documentary,Short

title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles

  • tconst (string) - alphanumeric unique identifier of the title
  • averageRating – weighted average of all the individual user ratings
  • numVotes - number of votes the title has received
spark.read.csv(
    "/tmp/datasets.imdbws.com/title.ratings.tsv.gz",
    schema="""
        tconst string,
        averageRating float,
        numVotes integer
    """,
    **kwargs_read_csv
).createOrReplaceTempView("`title.ratings`")
spark.table("`title.ratings`").limit(10).toPandas().T
0 1 2 3 4 5 6 7 8 9
tconst tt0000001 tt0000002 tt0000003 tt0000004 tt0000005 tt0000006 tt0000007 tt0000008 tt0000009 tt0000010
averageRating 5.7 6.0 6.5 6.1 6.2 5.2 5.4 5.5 5.9 6.9
numVotes 1847 237 1609 154 2432 160 760 1992 192 6651
spark.sql(
    """
    select tconst           as `tconst:ID(Title)`,
           primaryTitle     as `primaryTitle`,
           originalTitle    as `originalTitle`,
           boolean(isAdult) as `isAdult:boolean`,
           startYear        as `startYear:long`,
           endYear          as `endYear:long`,
           runtimeMinutes   as `runtimeMinutes:long`,
           averageRating    as `averageRating:double`,
           numVotes         as `numVotes:long`,
           array_join(array('Title') ||
                      ifnull(transform(array(titleType), `_` -> 'titleType=' || `_`), array()) ||
                      ifnull(transform(split(genres, ','), `_` -> 'genres=' || `_`), array()),
                      ';')  as `:LABEL`
    from `title.basics`
             left join `title.ratings` using (tconst)
    """
).coalesce(1).write.csv(f"{neo4j_staging}/import/title.basics", **kwargs_write_csv)

title.crew.tsv.gz – Contains the director and writer information for all the titles in IMDb. Fields include:

  • tconst (string) - alphanumeric unique identifier of the title
  • directors (array of nconsts) - director(s) of the given title
  • writers (array of nconsts) – writer(s) of the given title
spark.read.csv(
    "/tmp/datasets.imdbws.com/title.crew.tsv.gz",
    schema="""
        tconst string,
        directors string,
        writers string
    """,
    **kwargs_read_csv
).createOrReplaceTempView("`title.crew`")
spark.table("`title.crew`").limit(10).toPandas().T
0 1 2 3 4 5 6 7 8 9
tconst tt0000001 tt0000002 tt0000003 tt0000004 tt0000005 tt0000006 tt0000007 tt0000008 tt0000009 tt0000010
directors nm0005690 nm0721526 nm0721526 nm0721526 nm0005690 nm0005690 nm0005690,nm0374658 nm0005690 nm0085156 nm0525910
writers None None None None None None None None nm0085156 None
spark.sql(
    """
    select tconst                         as `:START_ID(Title)`,
           'directors'                    as `:TYPE`,
           explode(split(directors, ',')) as `:END_ID(Name)`
    from `title.crew`
    union
    select tconst                       as `:START_ID(Title)`,
           'writers'                    as `:TYPE`,
           explode(split(writers, ',')) as `:END_ID(Name)`
    from `title.crew`
    """
).coalesce(1).write.csv(f"{neo4j_staging}/import/title.crew", **kwargs_write_csv)

title.episode.tsv.gz – Contains the tv episode information. Fields include:

  • tconst (string) - alphanumeric identifier of episode
  • parentTconst (string) - alphanumeric identifier of the parent TV Series
  • seasonNumber (integer) – season number the episode belongs to
  • episodeNumber (integer) – episode number of the tconst in the TV series
spark.read.csv(
    "/tmp/datasets.imdbws.com/title.episode.tsv.gz",
    schema="""
        tconst string,
        parentTconst string,
        seasonNumber integer,
        episodeNumber integer
    """,
    **kwargs_read_csv
).createOrReplaceTempView("`title.episode`")
spark.table("`title.episode`").limit(10).toPandas().T
0 1 2 3 4 5 6 7 8 9
tconst tt0020666 tt0020829 tt0021166 tt0021612 tt0021655 tt0021663 tt0021664 tt0021701 tt0021802 tt0022009
parentTconst tt15180956 tt15180956 tt15180956 tt15180956 tt15180956 tt15180956 tt15180956 tt15180956 tt15180956 tt15180956
seasonNumber 1 1 1 2 2 2 2 2 2 2
episodeNumber 2 1 3 2 5 6 4 1 11 10
spark.sql(
    """
    select parentTconst  as `:START_ID(Title)`,
           'episodes'    as `:TYPE`,
           tconst        as `:END_ID(Title)`,
           seasonNumber  as `seasonNumber:long`,
           episodeNumber as `episodeNumber:long`
    from `title.episode`
    """
).coalesce(1).write.csv(f"{neo4j_staging}/import/title.episode", **kwargs_write_csv)

title.principals.tsv.gz – Contains the principal cast/crew for titles

  • tconst (string) - alphanumeric unique identifier of the title
  • ordering (integer) – a number to uniquely identify rows for a given titleId
  • nconst (string) - alphanumeric unique identifier of the name/person
  • category (string) - the category of job that person was in
  • job (string) - the specific job title if applicable, else '\N'
  • characters (string) - the name of the character played if applicable, else '\N'
spark.read.csv(
    "/tmp/datasets.imdbws.com/title.principals.tsv.gz",
    schema="""
        tconst string,
        ordering integer,
        nconst string,
        category string,
        job string,
        characters string
    """,
    **kwargs_read_csv
).createOrReplaceTempView("`title.principals`")
spark.table("`title.principals`").limit(10).toPandas().T
0 1 2 3 4 5 6 7 8 9
tconst tt0000001 tt0000001 tt0000001 tt0000002 tt0000002 tt0000003 tt0000003 tt0000003 tt0000003 tt0000004
ordering 1 2 3 1 2 1 2 3 4 1
nconst nm1588970 nm0005690 nm0374658 nm0721526 nm1335271 nm0721526 nm1770680 nm1335271 nm5442200 nm0721526
category self director cinematographer director composer director producer composer editor director
job None None director of photography None None None producer None None None
characters ["Self"] None None None None None None None None None
spark.sql(
    """
    select tconst          as `:START_ID(Title)`,
           'principals'    as `:TYPE`,
           nconst          as `:END_ID(Name)`,
           ordering        as `ordering:long`,
           category        as `category`,
           job             as `job`,
           array_join(from_json(characters, 'array<string>'),
                      ';') as `characters`
    from `title.principals`
    """
).coalesce(1).write.csv(f"{neo4j_staging}/import/title.principals", **kwargs_write_csv)

Load

!tree -h '{neo4j_staging}/import'
[4.0K]  /tmp/neo4j_staging/datasets.imdbws.com/import
├── [4.0K]  name.basics
│   ├── [148M]  part-00000-4adc0db8-e1be-42b1-8009-093744e4c1a9-c000.csv.gz
│   └── [   0]  _SUCCESS
├── [4.0K]  name.basics.knownForTitles
│   ├── [ 98M]  part-00000-0fc6773a-b8ea-472d-afe7-843875c4197b-c000.csv.gz
│   └── [   0]  _SUCCESS
├── [4.0K]  title.akas
│   ├── [252M]  part-00000-c812541e-2d10-4815-91e7-7facc74186e5-c000.csv.gz
│   └── [   0]  _SUCCESS
├── [4.0K]  title.akas.akas
│   ├── [ 89M]  part-00000-d57d1d81-5587-4717-98ed-f1af47f10328-c000.csv.gz
│   └── [   0]  _SUCCESS
├── [4.0K]  title.basics
│   ├── [153M]  part-00000-2a4fa35f-fdc7-4596-b2b1-3b9c52bd1ead-c000.csv.gz
│   └── [   0]  _SUCCESS
├── [4.0K]  title.crew
│   ├── [117M]  part-00000-ddc7238b-bd6c-41e4-875a-c110a35cc530-c000.csv.gz
│   └── [   0]  _SUCCESS
├── [4.0K]  title.episode
│   ├── [ 35M]  part-00000-c3653084-841d-4972-a927-66d263d57cec-c000.csv.gz
│   └── [   0]  _SUCCESS
└── [4.0K]  title.principals
    ├── [373M]  part-00000-65eb0487-4aa4-4d26-b3f9-a3e89e919142-c000.csv.gz
    └── [   0]  _SUCCESS

8 directories, 16 files

Run the following command to ingest data into Neo4j:

Markdown(
    fr"""
```shell
docker pull neo4j:4.1.4-community

docker run \
   --rm \
   -e NEO4J_AUTH=none \
   -p 7474:7474 \
   -p 7687:7687 \
   -v {neo4j_staging}/data:/data \
   -v {neo4j_staging}/logs:/logs \
   -v {neo4j_staging}/import:/var/lib/neo4j/import \
 neo4j:4.1.4-community bin/neo4j-admin import \
   --database=imdb \
   --high-io=true \
   --max-memory=2G \
   --nodes='import/name.basics/.+.csv.gz' \
   --nodes='import/title.akas/.+.csv.gz' \
   --nodes='import/title.basics/.+.csv.gz' \
   --relationships='import/name.basics.knownForTitles/.+.csv.gz' \
   --relationships='import/title.akas.akas/.+.csv.gz' \
   --relationships='import/title.crew/.+.csv.gz' \
   --relationships='import/title.episode/.+.csv.gz' \
   --relationships='import/title.principals/.+.csv.gz' \
   --skip-bad-relationships=true \
   --skip-duplicate-nodes=true
```"""
)
docker pull neo4j:4.1.4-community

docker run \
   --rm \
   -e NEO4J_AUTH=none \
   -p 7474:7474 \
   -p 7687:7687 \
   -v /tmp/neo4j_staging/datasets.imdbws.com/data:/data \
   -v /tmp/neo4j_staging/datasets.imdbws.com/logs:/logs \
   -v /tmp/neo4j_staging/datasets.imdbws.com/import:/var/lib/neo4j/import \
 neo4j:4.1.4-community bin/neo4j-admin import \
   --database=imdb \
   --high-io=true \
   --max-memory=2G \
   --nodes='import/name.basics/.+.csv.gz' \
   --nodes='import/title.akas/.+.csv.gz' \
   --nodes='import/title.basics/.+.csv.gz' \
   --relationships='import/name.basics.knownForTitles/.+.csv.gz' \
   --relationships='import/title.akas.akas/.+.csv.gz' \
   --relationships='import/title.crew/.+.csv.gz' \
   --relationships='import/title.episode/.+.csv.gz' \
   --relationships='import/title.principals/.+.csv.gz' \
   --skip-bad-relationships=true \
   --skip-duplicate-nodes=true
IMPORT DONE in 9m 34s 920ms. 
Imported:
  50354453 nodes
  119199226 relationships
  455432935 properties
Peak memory usage: 748.0MiB

Run the following command to boot up Neo4j:

Markdown(
    fr"""
```shell
docker run \
    --rm \
    -e NEO4J_AUTH=none \
    -e NEO4J_dbms_default__database=imdb \
    -p 7474:7474 \
    -p 7687:7687 \
    -v {neo4j_staging}/data:/data \
    -v {neo4j_staging}/logs:/logs \
    -v {neo4j_staging}/import:/var/lib/neo4j/import \
  neo4j:4.1.4-community
```"""
)
docker run \
    --rm \
    -e NEO4J_AUTH=none \
    -e NEO4J_dbms_default__database=imdb \
    -p 7474:7474 \
    -p 7687:7687 \
    -v /tmp/neo4j_staging/datasets.imdbws.com/data:/data \
    -v /tmp/neo4j_staging/datasets.imdbws.com/logs:/logs \
    -v /tmp/neo4j_staging/datasets.imdbws.com/import:/var/lib/neo4j/import \
  neo4j:4.1.4-community

Exploratory Data Analysis

MATCH (alex_garland:Name {nconst: 'nm0307497'}),
      (denis_villeneuve:Name {nconst: 'nm0898288'})
RETURN shortestPath((alex_garland)-[*..10]-(denis_villeneuve));
MATCH (alex_garland:Name {nconst: 'nm0307497'}),
      (denis_villeneuve:Name {nconst: 'nm0898288'})
RETURN allShortestPaths((alex_garland)-[*..5]-(denis_villeneuve));

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published