Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
cgnorthcutt authored May 5, 2021
1 parent b65494b commit 756c81e
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions examples/amazon_reviews_dataset/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,21 @@
We released pre-prepared version of the Amazon5core reviews dataset.
Download it here: https://github.com/cgnorthcutt/label-errors/releases/tag/amazon-reviews-dataset

From the Amazon 5core dataset (40+ million examples), select only the data that adheres to:
From the Amazon 5core dataset (40+ million examples), we select only the data that adheres to:
1. non-empty reviews.
2. label must be 1 star, 3 stars, or 5 stars. (2 and 4 star reviews are removed)
3. Only consider reviews with more than upvotes than downvotes (and at least one upvote).

You should have about 10 million examples left-over. These are higher quality, which will allow us to have more control over noise in the labels (instead of just general noise in the text itself).

Pre-process the data for reading by fast text. Here are the first two lines of my formatted training data file:
The dataset has been formatted in [fastext format](https://fasttext.cc/docs/en/supervised-tutorial.html#getting-and-preparing-the-data) for you. Here are the first two lines of my formatted training data file:

```
__label__5 I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!
__label__4 This work bears deep connections to themes first explored in Tad Williams' original breakout novel, about a brave young cat who travels to an underground netherworld to face an ancient evil. As the owner of two cats myself, after the second read-through, I realized that this novel has much to teach about the critical importance of dealing with fur and dust.I could only give four stars, though, because the cats do not agree, and indeed wish I had not made this purchase.
```

Pre-process the training data as follows:
When training, we pre-process the training data as follows:

```bash
cat amazon5core.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > amazon5core.preprocessed.txt
Expand Down

0 comments on commit 756c81e

Please sign in to comment.