Skip to content
Piotr Wendykier edited this page Sep 20, 2013 · 8 revisions

Introduction

COntent ANalysis SYStem is a framework for mining scientific publications using Apache Hadoop. It is primarily developed by employees of the Centre for Open Science (CeON) at Interdisciplinary Centre for Mathematical and Computational Modelling (ICM), University of Warsaw (UW).

Apache Hadoop

In order to run the application, you need to have Cloudera's Hadoop installed. The steps of the installation procedure are given below. IMPORTANT: Because of a bug in the Oozie version provided with Cloudera's Hadoop (by the way: this bug is removed in the version of Oozie available in the source code repository), you need to have Oracle Java JDK 1.6 installed. Oozie does not work with JDK 1.7.

The instructions below show how to install Cloudera Hadoop CDH4 with MRv1 in accordance with the instructions given in Cloudera CDH4 intallation guide.

It is important to know that Hadoop can be run in one of three modes:

  • standalone mode - runs all of the Hadoop processes in a single JVM which makes it easy to debug the application.
  • pseudo-distributed mode - runs a full-fledged Hadoop on your local computer.
  • distributed mode - runs the application on a cluster consisting of many nodes/hosts.

Below we will show how to install Hadoop initially in the pseudo-distributed mode but with a possibility to switch between the standalone and the pseudo-distributed mode.

Installation

Installing Hadoop in pseudo-distributed mode (based on Cloudera CDH4 pseudo distributed mode installation guide) in case of 64-bit Ubuntu 12.04:

  • create a new file /etc/apt/sources.list.d/cloudera.list with contents:

      deb http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib
      deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib
    
  • add a repository key:

      curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
    
  • update

      sudo apt-get update
    
  • install packages

      sudo apt-get install hadoop-0.20-conf-pseudo
    
  • next, follow the steps described in the Cloudera's guide to installing Hadoop in the pseudo-distributed mode starting from the step "Step 1: Format the NameNode." This is available at Cloudera CDH4 pseudo distributed mode installation guide - "Step 1: Format the Namenode".

Configuration

Switching between Hadoop modes

When you have Hadoop installed, you can switch between standalone and pseudo-distributed configurations (or other kinds of configurations) of Hadoop using the update-alternatives command, e.g.:

  • update-alternatives --display hadoop-conf for list of available configurations and information which one is currently active
  • sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.empty to set the active configuration to /etc/hadoop/conf.empty which corresponds to Hadoop standalone mode.

Web interfaces

You can view the web interfaces to the following services using appropriate addresses:

  • NameNode - provides a web console for viewing HDFS, number of Data Nodes, and logs - http://localhost:50070/
    • In the pseudo-distributed configuration, you should see one live DataNode named "localhost".
  • JobTracker - allows viewing the completed, currently running, and failed jobs along with their logs - http://localhost:50030/

##Oozie

Apache Oozie Workflow Scheduler for Hadoop is a workflow and coordination service for managing Apache Hadoop jobs. The description below is based on Cloudera CDH4 Oozie installation guide.

  • Install Oozie with

      sudo apt-get install oozie oozie-client
    
  • Create Oozie database schema

      sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -run
    
    • this should result an output similar to this one:

        Validate DB Connection
        DONE
        Check DB schema does not exist
        DONE
        Check OOZIE_SYS table does not exist
        DONE
        Create SQL schema
        DONE
        Create OOZIE_SYS table
        DONE
      
        Oozie DB has been created for Oozie version '3.1.3-cdh4.0.1'
      
        The SQL commands have been written to: /tmp/ooziedb-8221670220279408806.sql
      
  • Install version 2.2 of ExtJS library:

  • Install Oozie ShareLib:

      mkdir /tmp/ooziesharelib
      cd /tmp/ooziesharelib
      tar -zxf /usr/lib/oozie/oozie-sharelib.tar.gz
      sudo -u hdfs hadoop fs -mkdir /user/oozie
      sudo -u hdfs hadoop fs -chown oozie /user/oozie
      sudo -u oozie hadoop fs -put share /user/oozie/share
    
  • Start the Oozie server:

      sudo service oozie start
    
  • Check the status of the server:

    • From command-line:

        oozie admin -oozie http://localhost:11000/oozie -status
      

    as a result, should be printed out:

      	System mode: NORMAL
    

If you want to check if Oozie correctly executes its workflows, you can run some of the example workflows provided with Oozie as described in Cloudera Oozie example workflows. Note that contrary to what is written there, the Oozie server is not available at http://localhost:8080/oozie but at http://localhost:11000/oozie address.

Citation Matching

During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. Citation matching module in CoAnSys scales up to handle great amounts of data using appropriate indexing and a MapReduce paradigm.

References

  1. Fedoryszak, M. Tkaczyk, D. and Bolikowski, Ł. Large Scale Citation Matching Using Apache Hadoop, Research and Advanced Technology for Digital Libraries, Springer Berlin Heidelberg, 2013, 8092, 362-365

  2. Dendek, P. J. Czeczko, A. Fedoryszak, M. Kawa, A. Wendykier, P. and Bolikowski Ł. Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop, arXiv, 2013

  3. Dendek, P. J. Czeczko, A. Fedoryszak, M. Kawa, A. Wendykier, P. and Bolikowski Ł. How to perform research in Hadoop environment not losing mental equilibrium - case study, arXiv, 2013