Welcome to the 2000s Movie Database, the dataset contains 2100 films released between 2000 and 2009. Data points include title, genre, year, language and country of production, content rating, duration, aspect ratio, director, cast, budget, box office, number of reviews (by critics and users) and IMDB score.
The Heroku-based command line interface (CLI) allows the user to browse the dataset and retrieve statistics, rankings and specific information. The instructions are extremely simply written and require only a minimum of interaction to achieve the desired result.
How it's done
How it works
Features
User Stories
Testing
Technologies
Deployment
Credits
This project is inspired to the five Stages of Design Thinking and its further development will stricly follow the same principles.
I'm a film blogger. I want to write about what ingredients make a film successful, and I want to do this by analysing film-related data from the last few decades. I also want to be able to explore and query my database whenever I write an article. But I don't know how to get meaningful information from my database.
I turned this problem into an opportunity.
I tried to understand why this problem is important to the blogger by getting to the heart of the matter and developing a targeted solution from there, keeping this question in mind: "What solves the problem according to the blogger's needs and goals?" I shared the blogger's vision and hit their needs right where they needed to be addressed.
To do this, I researched film blogs and conducted interviews that led to the creation of a potential user persona.
This program is coded thinking at the potential needs of film bloggers in their thirties/forties with intermediate to low IT skills who want to gain insights into their personalised film database and tell their followers about it.
Let us call our blogger Nastya.
- She wants to set up her database on her own with information about films that is relevant to her.
- The database should be hosted on a software Nastya is familiar with.
- She wants to gain insight into this data through an "old-fashioned" and easy-to-use interface.
- Nastya wants a developer to code a program to elaborate the data and turn it into the information she is looking for and can access through the interface.
- She has followers to whom she needs to deliver content, so it is important that the developer doesn't break this chain of expectations. Therefore, she is keen to find a developer with whom she can establish a close working relationship.
The brainstorming phase followed by research, challenges and discussions lead to the integration of the following tools to find a solution to the stated problem.
- Python and its libraries are ideal for working with small dataframes.
- The Heroku-based app provides an easy-to-use solution with an old-fashioned interface. (What is Heroku)
- Google Sheets can be shared, are intuitive and easy to edit.
- GitHub is the go to solution for software development.
And that is how I arrived at the current prototype. It is a scaled-down version of the main idea with a demo of the potential features:
SEARCH the database by keywords and get pre-calculated DATA.
- The functions have been developed with the user's goal in mind and designed so that the user does not get lost when using the programme.
- The prompts are obvious, short and direct, but a HELP option is always available.
Other important points are:
- Consistency of the displayed messages in colour and language.
- Constant availability of information and functions.
- Creation of a recursive architecture to avoid dead ends.
- Handling of invalid inputs and the avoidance of unexpected behaviour.
My personal challenge was to put myself in the shoes of the user: What is clear, obvious and self-evident to the developer may not be to the user.
I brought together people who matched the persona aka our film blogger Nastya, wich have different IT abilities.
I presented the prototype to them and asked them to comment and raise questions while using the app.
I listened to their comments, observed their reactions, took notes and showed my appreciation for their feedback.
I then used their feedback to go to-and-fro the Design Thinking stages.
This app, created with Heroku, offers simple functions but is well structured to facilitate further development, expansion of the dataset and troubleshooting.
I believe that in development, work is better than rework. Adding features according to the client's and team's inputs is more efficient and time-wise than Removing programme features that the developer has spent time on but the client doesn't actually need.
My goal is to meet the clients' needs by taking their suggestions and frequently keeping in touch with them through chat, emails and video calls. This way I can make timely and frequent adjustments and fix problems as they arise. I believe in offering a tailor-made solution that adds value to the client.
My promise to the client is that I'll take care of all phases of development while striving for improvement. The client is on board with the developer through the plan > design > develop > test > release > review cicrle. I'm highly motivated to develop this project and want to assemble a team of goal-oriented, autonomous and empowered programmers.
Read more about the guiding principles of Agile Development
- The user is welcomed by a large title and a short message presenting the dataset and its main functionalities.
- The app has two main features: display processed data and perform queries.
- Side features are HELP and EXIT which can be invoked at any point by typing the desired functionality after any question (outside the search functionalities).
- The first time the app is launched, the user is offered the choice to get HELP, EXIT the program or press the Enter key to continue (especially in case the user is already familiar with the app and wants skip the HELP section)
- Throgh a series of questions, the user is lead to the desired output.
- Each answer (input) from the user is verified. If the check fails, a message explaining the error is shown and the question is asked again.
- After each result (output), the user is returned to the main question and can chose again how to explore the database (SEARCH/DATA).
- The app won't terminate unless the user types exit, closes the window or refreshes the page.
- The app is designed to avoid dead ends which will force the user to restart the app in order to continue. The user can always type a command.
A flowchart of the program's main process was created with Lucid.app.
The option is available by typing data in response to this question Type SEARCH or DATA to explore the database: which will be asked after each output. Users are offered ten options with pre-calculated statistics and rankings to choose from.
- The average budget, score and duration of this films'decade.
- Number of films in each language.
- Number of films produced each year.
- The most prolific directors of the decade and their scores.
- Top 10 countries that produced films with the highest IMDB score.
- The 10 best films of the decade.
- The 10 worst films of the decade.
- The most profitable films in terms of return of investment.
- Top 10 box-office flops: the most unprofitable films.
- The content ratings and their average IMDB Score.
After the choice is validated and the output displayed, the user is returned to the main question and can chose again how to explore the database (SEARCH/DATA).
The option is available by typing search in response to this question Type SEARCH or DATA to explore the database: which will be asked after each output.
- Users can browse the dataset searching by title, genre, actor and director and get info related to that entry.
- Matching is also possible with partial text but limited to 10 results due to Heroku's terminal constraints (80 characters by 24 rows), so a targeted entry will yield accurate results.
- Searching by title is the only query that returns all available information (genre, year, language and country of production, content rating, duration, aspect ratio, director, cast, budget, box office, number of reviews and IMDB score).
- The other options, which are more likely to find multiple matches, display only the most relevant information (title, genre, director, cast and IMDB score) to improve readability given the aforementioned Heroku's terminal limitations mentioned above.
- After the choice is validated and the output displayed, the user is returned to the main question and can chose again how to explore the database (SEARCH/DATA).
The exit option can be called by typing exit after each prompt (outside the search functions). The function prints the message Thank you! Goodbye! clear the screen and causes the program to quit after 3 seconds.
The exit function is not available in the search functions because it could be part of a name or title, therefore causing the app to quit and the result not to be shown.
The app can be restarted by clicking Heroku's red button "RUN PROGRAM" above the terminal.
The help option can be called by typing help after each prompt (outside the search functions). The function provides basic information about the dataframe and instructions about how to explore the program. The help function is not available in the search functions because it could be part of a name or title, therefore causing the app to quit and the result not to be shown. ('Help' is a movie from 2021 and 'The Help' is a movie from 2011). After the help text is displayed, the user is asked to press the Enter key to continue.
All functions have a general purpose and can be applied to a similar dataset or, for this particular project, allow the current dataset to be extended with minimal further implementation.
- The app is intuitive, the instructions are clear and simple, requiring minimal interaction from the user to achieve the result.
- The text displayed on the black background of Heroku's CLI is legible and bright. The four colours (blue, yellow, red, white) are chosen consistently to differentiate instructions, messages, errors and outputs.
- Input isn't case-sensitive, but output is consistently presented with the first letter capitalised.
- The code is iterative so that users can perform multiple searches/actions without restarting the program.
- From the Heroku app link, the program can be restarted any time by clicking the red "Run Program" button on the Heroku app page.
- The app is not available for mobile and accessible from desktop only.
Some potential features include:
- Searches possible with two or more options at the same time (e.g.: search by genre AND actor, search by actor AND director).
- A collection of films from the 90s and 10s to be added to the dataset.
- Additional statistics and lists.
- Deployment with Jupyter Lab to create meaningful istograms, distributions and charts.
Future features will be based on the users' requests and consequent necessities.
The following user stories with their respective acceptance criteria and tasks are available on the Issues tab of this repository. The user stories were considered completed and subsequently closed.
Looking at our "persona" from the design thinking process, the following user story was the crucial point around which I created an efficient query. The acceptance criteria points have been addressed and documented in the following Fixed Issue section. The User Story is available here.
The instructions have been tailored looking at our "persona" with intermediate to low IT skills. All the tasks were accomplished and documented in the How it works and Features sections of this file. The User Story is available here.
I manually tested this project throughout the development process by doing the following:
- I ran the code through the PEP8 linter.
- Given invalid input and checked the logical and visual consistency of the error messages.
- Entered substrings, extended ASCII characters, strings containing
'
(apostrophe), lower and upper case letters. - Checked how many lines to display for better readability.
- Tested colours and their consistnecy for better readability.
The user will test the program, just by using it and will be asked to provide feedback.
The program has so far proven to be free of arithmetic, syntax, resource, multi-threading and interfacing bugs. The program operates correctly and doesn't terminate abnormally. The following logical errors provided undesired output. While the output was consitent with the input, a much broader result was desired.
-
Matching is not possible with a partial string. e.g. the title must be complete, actor/director must searched by full name in order to display the desired result.
- Solution:
Implementation of a nested loop to work efficiently with a multi-dimensional data structure like this dataset. If the substring provided by the user was matched by iterating through the spreadsheet and its columns (this dataset is a list that contains other lists), boolean variable returns true and the output displayed.
- Solution:
-
Extended ASCII characters (character code 128-255) present in some names couldn't be matched providing printable ASCII characters (character code 32-127).
- Solution:
In each search function (title, director, actor, genres) I created a copy of the dataframe and applied the normalize encode decode methods to the Series (Columns) I wanted to parse. I applied the unicodedata normalize to the user's input. - Explaination:
In this way, strings with diacritics (extended ASCII) can be matched by typing the closest latin letter (printable ASCII). Normalization method decomposes a letter with diacritic into its equivalent in latin characters and its diacritic symbol. Additionally, similar names with different diacritcs such (Zoe/Zoë/Zoé) and (Chloe/Chloë/Chlöe/Chloé) will be matched in all of their forms. e.g., Input: "Zoe" Outup: "Zoe Saldaña", "Zoë Kravitz". Example of the output
- Solution:
-
Entries with
'
(apostrophe) are not matched by the queries.- Definition of the problem:
Apostrophes are found in movie's titles as contractions or possessives and in names, as part of the name or quoting a nickname. - Context of the problem:
In an attempt to match the user's input (wheter lowcase or uppercase) with the dataset entries' case, the.title()
method was applied the user's input. Sample of dataset entries - Reason of the problem:
Apostrophes act as word boundaries, this became evident applying the .title() method to the user's input.
Example - which doesn't match (Ripley's) as present in the dataset:
input = "ripley's" input = input.title() #Ripley'S
- Attempt to solve the problem:
I Harnessed the.title()
method behaviour by passing the user's input as argument of a function that used a regex, as suggested here.- It failed because: It worked for the abovementioned example but not for names containing quoted nicknames like (Joanna 'JoJo' Levesque) and names like (Mo'Nique) or (DJ Pooh).
- Solution:
I applied the.lower()
method to the user's input and to the copy of the dataframe in order for the query to make an exact comparison. The.lower()
method also proved to be useful to match movie titles such as (Mission: Impossible II) or (Jurassic Park III) which otherwise would have escaped the query with the.title()
method.
- Definition of the problem:
-
Input
?
in the search function resulted error and app crash.- Solution:
The follwing message was shownError : nothing to repeat at position 0
and no further action were possible through the app's CLI. The issue was fixed settingregex=False
to the.contains()
method. In this way, the input is considered as a literal string. Documentation is available here.
- Solution:
Unicode normalization. The .lower()
and .title()
methods, the Regular Expressions, the lambda functions, the nature and behaviour of Python's Panda objects and practiced debugging by printing intermediate results.
The terminal constraints don't allow to display large results and graphs. When the project will be subjected to further developments, a different deployment system may be taken into consideration.
- PEP8: no errors were returned from the PEP8 validator.
The project is coded with Python and relies on pandas 1.4.2: to analyse data.
- unicodedata
- pyfiglet 0.8.post1
- pandas 1.4.2: which was installed along with seaborn 0.11.2
- colorama 0.4.4
- The dataset is a Google Sheet file. A copy of the dataset is available in this repository.
The project is coded and hosted on GitHub and deployed with Heroku.
The steps needed to deploy this projects are as follows:
- Create a
requirement.txt
file in GitHub, for Heroku to read, listing the dependancies the program needs in order to run. push
the recent changes to GitHub and go to your Heroku account page to create and deploy the app running the project.- Chose "CREATE NEW APP", give it a unique name, and select a geographical region.
- From the Settings tab, configure the environment variables (config var section).
- Copy/paste the
CREDS.json
file, if the project has credentials, in theVALUE
field, typeCREDS
in the corresponding KEY box, click the "ADD" button. - Create another config var, set
PORT
as KEY and assign it the VALUE8000
. - Add two buildpacks from the Settings tab. The ordering is as follows:
heroku/python
heroku/nodejs
- From the Deployment tab, chose GitHub as deployment method, connect to GitHub and select the project's repository.
- Click to "Enable Automatic Deploys " or chose to "Deploy Branch" from the Manual Deploy section.
- Wait for the logs to run while the dependencies are installed and the app is being built.
- The mock terminal is then ready and accessible from a link similar to
https://your-projects-name.herokuapp.com/
Extract from Heroku Incident 2413:
Based on Salesforce’s initial investigation, it appears that unauthorized access to Heroku's GitHub account was the result of a compromised OAuth token. Salesforce immediately disabled the compromised user’s OAuth tokens and disabled the compromised user’s GitHub account. Additionally, GitHub reported that the threat actor was enumerating GitHub customer accounts using OAuth tokens issued to Heroku’s OAuth integration dashboard hosted on GitHub.
Since this issue arose and until furter notice or in case automatic deployments are not available for whatever reason, the steps to deploy the Heroku app are as follows:
Visual example of the following instructions can be found here.
Deploying your app to heroku:
- Login to heroku and enter your details. From GitPod bash, enter:
command: heroku login -i
- Get your app name from heroku.
command: heroku apps
- Set the heroku remote. (Replace <app_name> with your actual app name)
command: heroku git:remote -a <app_name>
- Add, commit and push to github
command: git add . && git commit -m "Deploy to Heroku via CLI"
- Push to both github and heroku
command: git push origin main
command: git push heroku main
In case the app needs API Keys, these additional steps have to be considered: MFA/2FA enabled?
- Click on Account Settings (under the avatar menu)
- Scroll down to the API Key section and click Reveal. Copy the key.
- Enter the command: heroku_config , and enter your api key you copied when prompted
- Complete the steps above, if you see an input box at the top middle of the editor... a. enter your heroku username b. enter the api key you just copied
Note: Thanks to Code Institute for providing the abovementioned Heroku app deployment steps.
By forking this GitHub Repository you make a copy of the original repository on our GitHub account to view and/or make changes without affecting the original repository. The steps to fork the repository are as follows:
- Locate this GitHub Repository of this project and log into your GitHub account.
- Click on the "Fork" button, on the top right of the page, just above the Settings.
- Decide where to fork the repository (your account for instance)
- You now have a copy of the original repository in your GitHub account.
Cloning a repository pulls down a full copy of all the repository data that GitHub.com has at that point in time, including all versions of every file and folder for the project. The steps to clone a repository are as follows:
- Locate this GitHub Repository of this project and log into your GitHub account.
- Click on the "Code" button, on the top right of the page, next to the green "Gitpod" button.
- Chose one of the available options: Clone with HTTPS, Open with Git Hub desktop, Download ZIP.
- To clone the repository using HTTPS, under "Clone with HTTPS", copy the link.
- Open Git Bash. How to download and install.
- Chose the location where you want the repository to be created.
- Type:
$ git clone https://github.com/cla-cif/movie-DB-2000s.git
- Press Enter, the following lines will appear and your repository is now created.
Cloning into 'movie-DB-2000s'... remote: Enumerating objects: 257, done. remote: Counting objects: 100% (257/257), done. remote: Compressing objects: 100% (182/182), done. remote: Total 257 (delta 157), reused 158 (delta 72), pack-reused 0Receiving obj Receiving objects: 81% (209/257) Receiving objects: 100% (257/257), 54.76 KiB | 549.00 KiB/s, done. Resolving deltas: 100% (157/157), done.
- Click here for a more detailed explaination.
- All content written by developer Claudia Cifaldi - cla-cif on GitHub.
- The template used for this project belongs to CodeInstitute - GitHub and website.
- The dataset is part of Kaggle's "The Movies Dataset" under CC0: Public Domain Licence.