Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md and running scripts #26

Merged
merged 16 commits into from
Oct 9, 2019
189 changes: 177 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,100 @@ This README provies instructions on how to replicate our work.

# Setup

## anserini
Clone [dstlr](https://github.com/dstlry/dstlr):

Download and build [Anserini](http://anserini.io) and then follow the [Solrini](https://github.com/castorini/anserini/blob/master/docs/solrini.md) instructions to get a Solr instance running for indexing text documents. Index a document collection with Anserini, such as the Washington Post collection, and ensure the appropriate Solr [command-line parameters](https://github.com/dstlry/dstlr/blob/master/src/main/scala/io/dstlr/package.scala) for `dstlr` are adjusted if use non-default options.
```
git clone https://github.com/dstlry/dstlr.git
```

[sbt](https://www.scala-sbt.org/) is the build tool used for Scala projects, download it if you don't have it yet.

Build the JAR using sbt:

```
sbt assembly
````

There is a [known issue](https://github.com/stanfordnlp/CoreNLP/issues/556) between recent Spark versions and CoreNLP 3.8. To fix this, delete the `protobuf-java-2.5.0.jar` file in `$SPARK_HOME/jars` and replace it with [version 3.0.0](https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/3.0.0/protobuf-java-3.0.0.jar).

## Anserini

### Download and build Anserini

Clone [Anserini](http://anserini.io):

```
git clone https://github.com/castorini/anserini.git

cd anserini
```

Change the [config file](https://github.com/castorini/anserini/blob/master/src/main/resources/solr/anserini/conf/managed-schema#L521) so that "contents" would be searchable and stored:

```
sed -i.bak 's/field name="contents" type="text_en_anserini" indexed="true" stored="false" multiValued="false"/field name="contents" type="text_en_anserini" indexed="true" stored="true" multiValued="false"/g' src/main/resources/solr/anserini/conf/managed-schema
```

Build Anserini using Maven:

```
mvn clean package appassembler:assemble
```

### Setting up a SolrCloud Instance for indexing text documents

From the Solr [archives](https://archive.apache.org/dist/lucene/solr/), find the Solr version that matches Anserini's [Lucene version](https://github.com/castorini/anserini/blob/master/pom.xml#L36), download the `solr-[version].tgz` (non `-src`), and move it into the `anserini/` directory.

Extract the archive:

```
mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1
```

Start Solr:

```
solrini/bin/solr start -c -m 8G
```

Note: Adjust memory usage (i.e., `-m 8G` as appropriate).

Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper:

```
pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd
```

Solr should now be available at [http://localhost:8983/](http://localhost:8983/) for browsing.

### Indexing document collections into SolrCloud from Anserini

We'll index [Washington Post collection](https://github.com/castorini/anserini/blob/master/docs/regressions-core18.md) as an example.

First, create the `core18` collection in Solr:

```
solrini/bin/solr create -n anserini -c core18
```

Run the Solr indexing command for `core18`:

```
sh target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -generator WapoGenerator \
-threads 8 -input /path/to/WashingtonPost \
-solr -solr.index core18 -solr.zkUrl localhost:9983 \
-storePositions -storeDocvectors -storeTransformedDocs
```

Note: Make sure `/path/to/WashingtonPost` is updated with the appropriate path.

Once indexing has completed, you should be able to query `core18` from the Solr [query interface](http://localhost:8983/solr/#/core18/query).

## neo4j

Start a neo4j instance via Docker with the command:
```bash
docker run -d --publish=7474:7474 --publish=7687:7687 \
docker run -d --name neo4j --publish=7474:7474 --publish=7687:7687 \
--volume=`pwd`/neo4j:/data \
-e NEO4J_dbms_memory_pagecache_size=2G \
-e NEO4J_dbms_memory_heap_initial__size=4G \
Expand All @@ -29,6 +114,8 @@ docker run -d --publish=7474:7474 --publish=7687:7687 \

Note: You may wish to update the memory settings based on the amount of available memory on your machine.

neo4j should should be available shortly at [http://localhost:7474/](http://localhost:7474/) with the default username/password of `neo4j`/`neo4j`. You will be prompted to change the password, this is the password you will pass to the load script.

In order for efficient inserts and queries, build the following indexes in neo4j:
```
CREATE INDEX ON :Document(id)
Expand All @@ -45,32 +132,110 @@ CREATE INDEX ON :Relation(type)
CREATE INDEX ON :Relation(type, confidence)
```

## Running

### Extraction

For each document in the collection, we extract mentions of named entities, the relations between them, and links to entities in an external knowledge graph.

Run `ExtractTriples`:

```
./bin/extract.sh
```

Note: Modify `extract.sh` based on your environment (e.g., available memory, number of executors, Solr, neo4j password, etc.) - options available [here](src/main/scala/io/dstlr/package.scala).

After the extraction is done, check if an output folder (called `triples/` by default) is created, and several Parquet files are generated inside the output folder.

If you want to inspect the Parquet file:

- Download and build [parquet-tools](https://github.com/apache/parquet-mr/tree/master/parquet-tools) following instructions.

Note: If you are on Mac, you could also install it with Homebrew `brew install parquet-tools`.

- View the Parquet file in JSON format:

```
parquet-tools cat --json [filename]
```

### Enrichment

We augment the raw knowledge graph with facts from the external knowledge graph (Wikidata in our case).

Run `EnrichTriples`:

```
./bin/enrich.sh
```

Note: Modify `enrich.sh` based on your environment.

After the enrichment is done, check if an output folder (called `triples-enriched/` by default) is created with output Parquet files.

### Load

Load raw knowledge graph and enriched knowledge graph produced from the above commands to neo4j.

Set `--input triples` in `load.sh`, run `LoadTriples`:

```
./bin/load.sh
```

Note: Modify `load.sh` based on your environment.

Set `--input triples-enriched` in `load.sh`, run `LoadTriples` again:

```
./bin/load.sh
```

Open [http://localhost:7474/](http://localhost:7474/) to view the loaded knowledge graph in neo4j.

## Data Cleaning Queries

Find CITY_OF_HEADQUARTERS relation between two mentions:
The following queries can be run against the knowledge graph in neo4j to discover sub-graphs of interest.

### Supporting Information

This query finds sub-graphs where the value extracted from the document matches the ground-truth from Wikidata.

```
MATCH (d:Document)-->(s:Mention)-->(r:Relation {type: "ORG_CITY_OF_HEADQUARTERS"})-->(o:Mention)
MATCH (s)-->(e:Entity)-->(f:Fact {relation: r.type})
WHERE o.span = f.value
RETURN d, s, r, o, e, f
LIMIT 25
```

Find CITY_OF_HEADQUARTERS relation between two mentions where the subject node doesn't have a linked entity:
### Inconsistent Information

This query finds sub-graphs where the value extracted from the document does not match the ground-truth from Wikidata.

```
MATCH (d:Document)-->(s:Mention)-->(r:Relation {type: "ORG_CITY_OF_HEADQUARTERS"})-->(o:Mention)
OPTIONAL MATCH (s)-->(e:Entity)
WHERE e IS NULL
RETURN d, s, r, o, e
LIMIT 25
MATCH (s)-->(e:Entity)-->(f:Fact {relation: r.type})
WHERE NOT(o.span = f.value)
RETURN d, s, r, o, e, f
```

### Missing Information
Find CITY_OF_HEADQUARTERS relation between two mentions where the linked entity doesn't have the relation we're looking for:

This query finds sub-graphs where the value extracted from the document does not have a corresponding ground-truth in Wikidata.

```
MATCH (d:Document)-->(s:Mention)-->(r:Relation {type: "ORG_CITY_OF_HEADQUARTERS"})-->(o:Mention)
MATCH (s)-->(e:Entity)
OPTIONAL MATCH (e)-->(f:Fact {relation: r.type})
WHERE f IS NULL
RETURN d, s, r, o, e, f
LIMIT 25
```

### Delete Relationships

This query deletes all relationships in the database.

```
MATCH (n) DETACH DELETE n
```
6 changes: 3 additions & 3 deletions bin/enrich.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

spark-submit --class io.dstlr.EnrichTriples \
--num-executors 1 --executor-cores 1 \
--driver-memory 64G --executor-memory 64G \
--conf spark.executor.heartbeatInterval=60 \
target/scala-2.11/dstlr-assembly-0.1.jar --input triples --output triples-enriched --partitions 1
--driver-memory 8G --executor-memory 8G \
--conf spark.executor.heartbeatInterval=10000 \
target/scala-2.11/dstlr-assembly-0.1.jar --input triples --output triples-enriched --partitions 1
4 changes: 2 additions & 2 deletions bin/extract.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
spark-submit --class io.dstlr.ExtractTriples \
--num-executors 32 --executor-cores 8 \
--driver-memory 64G --executor-memory 48G \
--conf spark.executor.heartbeatInterval=60 \
--conf spark.executor.heartbeatInterval=10000 \
--conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-9-openjdk-amd64 \
target/scala-2.11/dstlr-assembly-0.1.jar \
--solr.uri 192.168.1.111:9983 --solr.index core18 --query *:* --partitions 2048 --output triples-$RANDOM --doc-length-threshold 10000 --sent-length-threshold 256
--solr.uri localhost:9983 --solr.index core18 --query *:* --partitions 2048 --output triples --sent-length-threshold 256
6 changes: 3 additions & 3 deletions bin/load.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

spark-submit --class io.dstlr.LoadTriples \
--num-executors 1 --executor-cores 1 \
--driver-memory 16G --executor-memory 16G \
--conf spark.executor.heartbeatInterval=60 \
--driver-memory 8G --executor-memory 8G \
--conf spark.executor.heartbeatInterval=10000 \
target/scala-2.11/dstlr-assembly-0.1.jar \
--input triples-5000d-128s --neo4j.password password --neo4j.uri bolt://192.168.1.110:7687 --neo4j.batch.size 10000
--input triples --neo4j.password password --neo4j.uri bolt://localhost:7687 --neo4j.batch.size 10000