diff --git a/README.md b/README.md index e7b9bcd..6de9535 100644 --- a/README.md +++ b/README.md @@ -11,15 +11,100 @@ This README provies instructions on how to replicate our work. # Setup -## anserini +Clone [dstlr](https://github.com/dstlry/dstlr): -Download and build [Anserini](http://anserini.io) and then follow the [Solrini](https://github.com/castorini/anserini/blob/master/docs/solrini.md) instructions to get a Solr instance running for indexing text documents. Index a document collection with Anserini, such as the Washington Post collection, and ensure the appropriate Solr [command-line parameters](https://github.com/dstlry/dstlr/blob/master/src/main/scala/io/dstlr/package.scala) for `dstlr` are adjusted if use non-default options. +``` +git clone https://github.com/dstlry/dstlr.git +``` + +[sbt](https://www.scala-sbt.org/) is the build tool used for Scala projects, download it if you don't have it yet. + +Build the JAR using sbt: + +``` +sbt assembly +```` + +There is a [known issue](https://github.com/stanfordnlp/CoreNLP/issues/556) between recent Spark versions and CoreNLP 3.8. To fix this, delete the `protobuf-java-2.5.0.jar` file in `$SPARK_HOME/jars` and replace it with [version 3.0.0](https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/3.0.0/protobuf-java-3.0.0.jar). + +## Anserini + +### Download and build Anserini + +Clone [Anserini](http://anserini.io): + +``` +git clone https://github.com/castorini/anserini.git + +cd anserini +``` + +Change the [config file](https://github.com/castorini/anserini/blob/master/src/main/resources/solr/anserini/conf/managed-schema#L521) so that "contents" would be searchable and stored: + +``` +sed -i.bak 's/field name="contents" type="text_en_anserini" indexed="true" stored="false" multiValued="false"/field name="contents" type="text_en_anserini" indexed="true" stored="true" multiValued="false"/g' src/main/resources/solr/anserini/conf/managed-schema +``` + +Build Anserini using Maven: + +``` +mvn clean package appassembler:assemble +``` + +### Setting up a SolrCloud Instance for indexing text documents + +From the Solr [archives](https://archive.apache.org/dist/lucene/solr/), find the Solr version that matches Anserini's [Lucene version](https://github.com/castorini/anserini/blob/master/pom.xml#L36), download the `solr-[version].tgz` (non `-src`), and move it into the `anserini/` directory. + +Extract the archive: + +``` +mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1 +``` + +Start Solr: + +``` +solrini/bin/solr start -c -m 8G +``` + +Note: Adjust memory usage (i.e., `-m 8G` as appropriate). + +Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper: + +``` +pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd +``` + +Solr should now be available at [http://localhost:8983/](http://localhost:8983/) for browsing. + +### Indexing document collections into SolrCloud from Anserini + +We'll index [Washington Post collection](https://github.com/castorini/anserini/blob/master/docs/regressions-core18.md) as an example. + +First, create the `core18` collection in Solr: + +``` +solrini/bin/solr create -n anserini -c core18 +``` + +Run the Solr indexing command for `core18`: + +``` +sh target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -generator WapoGenerator \ + -threads 8 -input /path/to/WashingtonPost \ + -solr -solr.index core18 -solr.zkUrl localhost:9983 \ + -storePositions -storeDocvectors -storeTransformedDocs +``` + +Note: Make sure `/path/to/WashingtonPost` is updated with the appropriate path. + +Once indexing has completed, you should be able to query `core18` from the Solr [query interface](http://localhost:8983/solr/#/core18/query). ## neo4j Start a neo4j instance via Docker with the command: ```bash -docker run -d --publish=7474:7474 --publish=7687:7687 \ +docker run -d --name neo4j --publish=7474:7474 --publish=7687:7687 \ --volume=`pwd`/neo4j:/data \ -e NEO4J_dbms_memory_pagecache_size=2G \ -e NEO4J_dbms_memory_heap_initial__size=4G \ @@ -29,6 +114,8 @@ docker run -d --publish=7474:7474 --publish=7687:7687 \ Note: You may wish to update the memory settings based on the amount of available memory on your machine. +neo4j should should be available shortly at [http://localhost:7474/](http://localhost:7474/) with the default username/password of `neo4j`/`neo4j`. You will be prompted to change the password, this is the password you will pass to the load script. + In order for efficient inserts and queries, build the following indexes in neo4j: ``` CREATE INDEX ON :Document(id) @@ -45,32 +132,110 @@ CREATE INDEX ON :Relation(type) CREATE INDEX ON :Relation(type, confidence) ``` +## Running + +### Extraction + +For each document in the collection, we extract mentions of named entities, the relations between them, and links to entities in an external knowledge graph. + +Run `ExtractTriples`: + +``` +./bin/extract.sh +``` + +Note: Modify `extract.sh` based on your environment (e.g., available memory, number of executors, Solr, neo4j password, etc.) - options available [here](src/main/scala/io/dstlr/package.scala). + +After the extraction is done, check if an output folder (called `triples/` by default) is created, and several Parquet files are generated inside the output folder. + +If you want to inspect the Parquet file: + +- Download and build [parquet-tools](https://github.com/apache/parquet-mr/tree/master/parquet-tools) following instructions. + +Note: If you are on Mac, you could also install it with Homebrew `brew install parquet-tools`. + +- View the Parquet file in JSON format: + +``` +parquet-tools cat --json [filename] +``` + +### Enrichment + +We augment the raw knowledge graph with facts from the external knowledge graph (Wikidata in our case). + +Run `EnrichTriples`: + +``` +./bin/enrich.sh +``` + +Note: Modify `enrich.sh` based on your environment. + +After the enrichment is done, check if an output folder (called `triples-enriched/` by default) is created with output Parquet files. + +### Load + +Load raw knowledge graph and enriched knowledge graph produced from the above commands to neo4j. + +Set `--input triples` in `load.sh`, run `LoadTriples`: + +``` +./bin/load.sh +``` + +Note: Modify `load.sh` based on your environment. + +Set `--input triples-enriched` in `load.sh`, run `LoadTriples` again: + +``` +./bin/load.sh +``` + +Open [http://localhost:7474/](http://localhost:7474/) to view the loaded knowledge graph in neo4j. + ## Data Cleaning Queries -Find CITY_OF_HEADQUARTERS relation between two mentions: +The following queries can be run against the knowledge graph in neo4j to discover sub-graphs of interest. + +### Supporting Information + +This query finds sub-graphs where the value extracted from the document matches the ground-truth from Wikidata. + ``` MATCH (d:Document)-->(s:Mention)-->(r:Relation {type: "ORG_CITY_OF_HEADQUARTERS"})-->(o:Mention) MATCH (s)-->(e:Entity)-->(f:Fact {relation: r.type}) +WHERE o.span = f.value RETURN d, s, r, o, e, f -LIMIT 25 ``` -Find CITY_OF_HEADQUARTERS relation between two mentions where the subject node doesn't have a linked entity: +### Inconsistent Information + +This query finds sub-graphs where the value extracted from the document does not match the ground-truth from Wikidata. + ``` MATCH (d:Document)-->(s:Mention)-->(r:Relation {type: "ORG_CITY_OF_HEADQUARTERS"})-->(o:Mention) -OPTIONAL MATCH (s)-->(e:Entity) -WHERE e IS NULL -RETURN d, s, r, o, e -LIMIT 25 +MATCH (s)-->(e:Entity)-->(f:Fact {relation: r.type}) +WHERE NOT(o.span = f.value) +RETURN d, s, r, o, e, f ``` ### Missing Information -Find CITY_OF_HEADQUARTERS relation between two mentions where the linked entity doesn't have the relation we're looking for: + +This query finds sub-graphs where the value extracted from the document does not have a corresponding ground-truth in Wikidata. + ``` MATCH (d:Document)-->(s:Mention)-->(r:Relation {type: "ORG_CITY_OF_HEADQUARTERS"})-->(o:Mention) MATCH (s)-->(e:Entity) OPTIONAL MATCH (e)-->(f:Fact {relation: r.type}) WHERE f IS NULL RETURN d, s, r, o, e, f -LIMIT 25 +``` + +### Delete Relationships + +This query deletes all relationships in the database. + +``` +MATCH (n) DETACH DELETE n ``` diff --git a/bin/enrich.sh b/bin/enrich.sh index 60d5c3b..2068f92 100755 --- a/bin/enrich.sh +++ b/bin/enrich.sh @@ -2,6 +2,6 @@ spark-submit --class io.dstlr.EnrichTriples \ --num-executors 1 --executor-cores 1 \ - --driver-memory 64G --executor-memory 64G \ - --conf spark.executor.heartbeatInterval=60 \ - target/scala-2.11/dstlr-assembly-0.1.jar --input triples --output triples-enriched --partitions 1 \ No newline at end of file + --driver-memory 8G --executor-memory 8G \ + --conf spark.executor.heartbeatInterval=10000 \ + target/scala-2.11/dstlr-assembly-0.1.jar --input triples --output triples-enriched --partitions 1 diff --git a/bin/extract.sh b/bin/extract.sh index a5f3273..8099d88 100755 --- a/bin/extract.sh +++ b/bin/extract.sh @@ -3,7 +3,7 @@ spark-submit --class io.dstlr.ExtractTriples \ --num-executors 32 --executor-cores 8 \ --driver-memory 64G --executor-memory 48G \ - --conf spark.executor.heartbeatInterval=60 \ + --conf spark.executor.heartbeatInterval=10000 \ --conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-9-openjdk-amd64 \ target/scala-2.11/dstlr-assembly-0.1.jar \ - --solr.uri 192.168.1.111:9983 --solr.index core18 --query *:* --partitions 2048 --output triples-$RANDOM --doc-length-threshold 10000 --sent-length-threshold 256 \ No newline at end of file + --solr.uri localhost:9983 --solr.index core18 --query *:* --partitions 2048 --output triples --sent-length-threshold 256 diff --git a/bin/load.sh b/bin/load.sh index 4ada382..4cc25d5 100755 --- a/bin/load.sh +++ b/bin/load.sh @@ -2,7 +2,7 @@ spark-submit --class io.dstlr.LoadTriples \ --num-executors 1 --executor-cores 1 \ - --driver-memory 16G --executor-memory 16G \ - --conf spark.executor.heartbeatInterval=60 \ + --driver-memory 8G --executor-memory 8G \ + --conf spark.executor.heartbeatInterval=10000 \ target/scala-2.11/dstlr-assembly-0.1.jar \ - --input triples-5000d-128s --neo4j.password password --neo4j.uri bolt://192.168.1.110:7687 --neo4j.batch.size 10000 \ No newline at end of file + --input triples --neo4j.password password --neo4j.uri bolt://localhost:7687 --neo4j.batch.size 10000