Hadoop MapReduce using AWS
Steps:
- Create a maven project.
- Create Map and Reduce class.
- Update dependencies in pom.xml.
- Run the project as Java application.
- Test the application locally.
- Export the project as a Runnable JAR file.
- Create a Amazon S3 bucket and upload all input files and JAR's to it.
- Create a cluster in Amazon EMR by selecting required EC2 instance type and number of instances. 1 will be a master instance and the others will be slave instances.
- After instances are running, create a task by selecting the appropriate JAR, input files and ouput folder that were uploaded previously in S3 bucket.
- After the task is completed, output will be written to the mentioned output folder in S3 bucket.
- Download the output files if required.
Tasks Completed:
- Counting number of occurrences of each word in the given input file.
- Counting number of occurrences of each double word sequence in the given input file.
- Counting number of occurrences of individual words in the given input file that are present in another txt file using Distributed Caching feature in AWS.