-
Notifications
You must be signed in to change notification settings - Fork 0
/
dataset.html
233 lines (232 loc) · 16.5 KB
/
dataset.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="RecSys Challenge 2022">
<meta name="keywords"
content="Recommender Systems, RecSys Challenge, Social Media, Dressipi, Fashion recommendations">
<meta name="author" content="RecSysChallenge 2022 Organizers">
<!-- <link rel="icon" href="http://recsys.acm.org/wp-content/themes/primo-wp/favicon.ico"> -->
<title>RecSys Challenge 2022</title>
<!-- CSS -->
<link href="./css/bootstrap.min.css" rel="stylesheet">
<link href="./css/ekko-lightbox.min.css" rel="stylesheet">
<link href="./css/main.css" rel="stylesheet">
<link rel="apple-touch-icon" sizes="180x180" href="images/apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="32x32" href="images/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="16x16" href="images/favicon-16x16.png">
<link rel="manifest" href="images/site.webmanifest">
<meta name="msapplication-TileColor" content="#da532c">
<meta name="theme-color" content="#ffffff">
</head>
<body>
<div id="top" class="container">
<div class="header clearfix" style="padding-bottom:5px;">
<span style="float:left; padding-right: 15px;">
<!-- <img src="./images/logo.png" alt="logo" width="50"> </span> <span style="float:right; padding-right: 15px;"> -->
<a style="text-decoration: none;" href="https://twitter.com/acmrecsys" title="Twitter"> <img
src="./images/ico-twitter.svg" alt="" width="24">
</a>
</span>
<h3><a href="http://recsyschallenge.com/2022" style="text-decoration:none;">
RecSys Challenge 2022</a>
</h3>
<div style="margin-top:30px;font-family: Arial,sans-serif">
<ul class="nav nav-pills navbar-nav navbar-left">
<!--
<li role="presentation" class="dropdown"> <a class="dropdown-toggle" data-toggle="dropdown" href="#" role="button" aria-haspopup="true" aria-expanded="false">About <span class="caret"></span>
</a> <ul class="dropdown-menu">
<li><a href="#about">About</a></li> <li><a href="#scenario">Scenario</a></li>
<li><a href="#challenges">Challenges</a></li> <li><a href="#evaluation">Evaluation</a></li>
<li><a href="#dataset">Dataset</a></li> <li><a href="#baseline">Baseline</a>
</ul> </li>-->
<!--
<li role="presentation" class="dropdown"> <a class="dropdown-toggle" data-toggle="dropdown" href="#" role="button" aria-haspopup="true" aria-expanded="false">Participate <span class="caret"></span>
</a> <ul class="dropdown-menu">
<li><a href="#participation">Participation</a></li> <li><a href="#leaderboard">Leaderboard</a>
<li><a href="#rules">Rules</a></li> <li><a href="#questions">Ask questions</a></li>
<li><a href="#prizes">Prizes</a></li> </ul>
</li>-->
<li role="presentation"><a href="index.html#about">About</a></li>
<!--<li role="presentation"><a href="#publications">Publications</a></li>-->
<li role="presentation"><a href="index.html#participation">Participation</a></li>
<li role="presentation"><a href="index.html#dates">Timeline</a></li>
<li role="presentation"><a href="index.html#program">Program</a></li>
<li role="presentation"><a href="index.html#organizers">Organization</a></li>
<li role="presentation"><a href="https://recsys.acm.org/recsys22/" target="\_blank">RecSys 2022</a></li>
</ul>
</div>
</div>
<!-- ABOUT -->
<div class="lead">
<h1>Dataset<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h1>
<p>
The full dataset consists of 1.1 million online retail sessions in the fashion domain, sampled from a 18-month period. All sessions in the dataset are “purchasing sessions” that resulted in at least one item purchased. The items viewed and purchased are clothing and footwear. The dataset contains content data for each of the items viewed and purchased, this is an extract of Dressipi’s fashion item data and represents descriptive labels assigned to the items such as color, neckline, sleeve length etc. The task is to predict the item that was purchased.
</p>
<p>
The dataset has:
<ul>
<li>Sessions: The items that were viewed in a session. In this dataset a session is equal to a day, so a session is one user's activity on one day. Content: session_id, item_id, timestamp.</li>
<li>Purchases: The purchase that happened at the end of the session. One purchased item is given per session. Content: session_id, item_id.</li>
<li>Item features: The label data of items. Things like “color: green”, “neckline: v-neck”, etc. Content: item_id, feature_category_id, feature_value_id.</li>
</ul>
</p>
<br>
<h2>Training - Test Split<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h2>
<p>
The data is split into a training and test set by date. The test set is one month, the training set is 17 months prior to the test month. The test month is then further randomly split into the “leaderboard” test set, used for the public leaderboard, and the “final” test set which will determine the final winners of the challenge. The task is to submit 100 predictions for each query session in the leaderboard and final test sets. The training set contains 1m sessions, the two test sets contain 50k sessions each.
</p>
<figure>
<img src="images/image2.png"
alt="Training - Test Split">
<caption>Fig 1: Training - Test Split</caption>
</figure>
<br>
<h2>Training Set<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h2>
<p>
The training data are sessions of users who bought something. Session length is a day. The session has all item views up to and not including the first view of the item that was bought on that day. In a separate file you get the purchased item for each session.
</p>
<br>
<h2>Test Set<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h2>
<p>
These are the query sessions to generate recommendations for. There are two test sets:
</p>
<p>
<ul>
<li>Leaderboard Test Set: Determines the leaderboard position. Evaluated against every time you submit a prediction file.</li>
<li>Final Test Set: Determines the final winners. Evaluated against once at the end.</li>
</ul>
</p>
<br>
<h2>Candidate Items<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h2>
<p>
The candidate set of items is given in a separate file and is the set of items that were purchased in the test month. The same candidate set is used for the leaderboard and final test sets.
</p>
</div>
<br><br>
<!-- <div id="publications" class="lead">
<h1>Publications</h1>
<p>Luca Belli, Sofia Ira Ktena, Alykhan Tejani, Alexandre Lung-Yut-Fon, Frank Portman, Xiao Zhu, Yuanpu Xie, Akshay Gupta, Michael Bronstein, Amra Delić, Gabriele Sottocornola, Walter Anelli, Nazareno Andrade, Jessie Smith, and Wenzhe Shi. 2020. <a href="https://arxiv.org/abs/2004.13715" target="_blank">Privacy-Preserving Recommender Systems Challenge on Twitter's Home Timeline</a>, arXiv:2004.13715.</p>
</div> -->
<div class="lead">
<h1>Submitting Predictions<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h1>
<h2>Predictions Format<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h2>
<p>
The task is to submit a csv that has 100 ranked predictions for each query session.
</p>
<p>
Header and columns as in the example below. Header is required. The order of rows does not matter for the evaluation system but we recommend to sort the file by session_id and rank for easier manual inspection.
<code>
session_id,item_id,rank <br>
1,100,1 <br>
1,105,2 <br>
1,107,3 <br>
... <br>
1,101,100 <br>
2,108,1 <br>
2,107,2 <br>
... <br>
</code>
</p>
<h2>Evaluation Metric<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h2>
<p>
The evaluation metric will be Mean Reciprocal Rank (<a href="https://en.wikipedia.org/wiki/Mean_reciprocal_rank" target="_blank">https://en.wikipedia.org/wiki/Mean_reciprocal_rank</a>). The higher the purchased item was in the ranking (rank 1 is best) the better the success score.
</p>
</div>
<br><br>
<div class="lead">
<h1>Data Characteristics <sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h1>
<ul>
<li>Sessions are anonymous, there is a session_id but no user_id, so you won't know if two sessions are by the same person.</li>
<li>The dataset only contains one purchased item per order (chosen at random). This means a session might have resulted in the purchase of a shirt and a trouser but in the dataset you can only see the shirt purchase. This is a limitation but the size of the dataset should be sufficiently large to compensate for it.</li>
<li>A large chunk of test sessions have only one or two item views as input for prediction. This is partly due to the reality of a lot of sessions being very short in the underlying data and partly due to how the challenge dataset was constructed, see "Details on Data Construction" below.
<li>
Content data
<ul>
<li>Content data (garment labels data) is supplied for all items in the dataset. Some candidate items might not have any data in the training sessions or purchases but they will have content data.</li>
<li>Content data is representative of a “category: value” taxonomy. E.g. “color: blue” or “neckline: v-neck”. Most feature categories will only have one value for a garment, however, there are some that have multiple values. For example an item might have both “secondary_color: black” and “secondary_color: white”. In these cases an item will have two or more entries (rows) with the same category id and different value ids.</li>
<li>Some items may not share any feature_category_ids if they are different types of items, for example trousers will share very few category ids with shirts and will share even less category ids with sandals.</li>
<li>Items will have a different number of label assignments depending on how complicated they are. A basic black t-shirt will have less feature category ids, and thus less rows in the content data, compared to an evening dress with intricate details.</li>
</ul>
</li>
</ul>
</div>
<br><br>
<div class="lead">
<h1>Details on Dataset Construction<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h1>
<h2>Sessions and Purchases<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h2>
<p>
The dataset consists only of “purchasing sessions”, which are sessions that resulted in at least one item being bought. For each order placed we have chosen one purchased item at random to be the purchase of the session. The view activity of the session is then the item views up to and not including the first view of the item that was purchased. The diagram below shows how the purchasing sessions were constructed from the full session and purchasing data.
</p>
<figure>
<img src="images/image1.png"
alt="Sessions and Purchases">
<caption>Fig 2: Sessions and Purchases</caption>
</figure>
<br>
<h2>Constructing Data for Test Sessions<sup><a class="dropup" style="font-size:10px;" href="#top"><span class="caret"></span>
top</a></sup>
</h2>
<p>
For each test session the input for prediction are the first x% of views in the session, where this is a randomly selected value between 50% and 100% for each session. The maximum length of the input data for a session would be up to and not including the first view of the item that was bought. You won’t know at which point each test sessions was cut.
</p>
<p>
The reasoning behind the random cuts is that in the real system recommendations are shown to the user at various points in their session. We want a recommender that can predict the item the user will purchase as early as possible in the session, but we have to balance that with having more information available for better accuracy of predictions as the session goes on. At some point the recommendation may no longer be useful because the user has worked through a sufficiently long journey (in the existing ranking of items presented to them) and is about to find the item they want themselves, without the intervention of the recommender. The random cuts in the test sessions are an effort to have the challenge evaluation be as close as possible to what success means in reality without making it overly complex. Please note that no cuts like this are applied to the training data and you have all item views leading up to the purchase there.
</p>
<p>
The diagram below shows how the input data for prediction is constructed from the hidden full test set.
</p>
<figure>
<img src="images/image3.png"
alt="Constructing Data for Test Sessions">
<caption>Fig 3: Constructing Data for Test Sessions</caption>
</figure>
</div>
<!-- /.container -->
<!-- JavaScript -->
<script src="./js/jquery-1.11.3.min.js"></script>
<script src="./js/bootstrap.min.js"></script>
<script src="./js/ekko-lightbox.min.js"></script>
<script type="text/javascript">
$(document).delegate('*[data-toggle="lightbox"]', 'click', function (event) {
event.preventDefault();
$(this).ekkoLightbox();
});
</script>
<script>
(function (i, s, o, g, r, a, m) {
i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
(i[r].q = i[r].q || []).push(arguments)
}, i[r].l = 1 * new Date(); a = s.createElement(o),
m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
})(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');
ga('create', 'UA-70716117-1', 'auto');
ga('send', 'pageview');
</script>
</div>
</body>
</html>