Skip to content

Latest commit

ย 

History

History

exploration_08

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
ย 
ย 
ย 
ย 
ย 
ย 

8. ์•„์ด์œ ํŒฌ์ด ์ข‹์•„ํ•  ๋งŒํ•œ ๋‹ค๋ฅธ ์•„ํ‹ฐ์ŠคํŠธ ์ฐพ๊ธฐ

ํ•™์Šต๋ชฉํ‘œ

  • ์ถ”์ฒœ์‹œ์Šคํ…œ์˜ ๊ฐœ๋…๊ณผ ๋ชฉ์ ์„ ์ดํ•ดํ•œ๋‹ค.
  • Implicit ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ Matrix Factorization(์ดํ•˜ MF) ๊ธฐ๋ฐ˜์˜ ์ถ”์ฒœ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด ๋ณธ๋‹ค.
  • ์Œ์•… ๊ฐ์ƒ ๊ธฐ๋ก์„ ํ™œ์šฉํ•˜์—ฌ ๋น„์Šทํ•œ ์•„ํ‹ฐ์ŠคํŠธ๋ฅผ ์ฐพ๊ณ  ์•„ํ‹ฐ์ŠคํŠธ๋ฅผ ์ถ”์ฒœํ•ด ๋ณธ๋‹ค.
  • ์ถ”์ฒœ ์‹œ์Šคํ…œ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์ธ CSR Matrix์„ ์ตํžŒ๋‹ค
  • ์œ ์ €์˜ ํ–‰์œ„ ๋ฐ์ดํ„ฐ ์ค‘ Explicit data์™€ Implicit data์˜ ์ฐจ์ด์ ์„ ์ตํžŒ๋‹ค.
  • ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ง์ ‘ ์ถ”์ฒœ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด ๋ณธ๋‹ค.

์ถ”์ฒœ์‹œ์Šคํ…œ์ด๋ž€?

์ถ”์ฒœ์‹œ์Šคํ…œ์ด๋ž€, ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์œ ์ €๊ฐ€ ์ข‹์•„ํ•  ๋งŒํ•œ ์ฝ˜ํ…์ธ ๋ฅผ ์ฐพ์•„์„œ ์ž๋™์œผ๋กœ ๋ณด์—ฌ์ฃผ๊ฑฐ๋‚˜ ์ถ”์ฒœํ•ด์ฃผ๋Š” ๊ธฐ๋Šฅ์ด๋‹ค. ์ถ”์ฒœ์‹œ์Šคํ…œ์˜ ์›๋ฆฌ๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋ฉด ๋‚˜์™€ ๋น„์Šทํ•œ ๋‹ค๋ฅธ ์‚ฌ์šฉ์ž๋“ค์ด ์ข‹์•„ํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ๊ฒƒ์„ ๋‚ด๊ฒŒ ์ถ”์ฒœํ•ด์ค€๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์ถ”์ฒœ์‹œ์Šคํ…œ์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€๋กœ ๋‚˜๋‰œ๋‹ค.

http://www.kocca.kr/insight/vol05/vol05_04.pdf

(1) ํ˜‘์—… ํ•„ํ„ฐ๋ง


ํ˜‘์—… ํ•„ํ„ฐ๋ง์ด๋ž€ ๋Œ€๊ทœ๋ชจ์˜ ๊ธฐ์กด ์‚ฌ์šฉ์ž ํ–‰๋™ ์ •๋ณด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ํ•ด๋‹น ์‚ฌ์šฉ์ž์™€ ๋น„์Šทํ•œ ์„ฑํ–ฅ์˜ ์‚ฌ์šฉ์ž๋“ค์ด ๊ธฐ์กด์— ์ข‹์•„ํ–ˆ๋˜ ํ•ญ๋ชฉ์„ ์ถ”์ฒœํ•˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค.

  • ์žฅ์ 
    • ๊ฒฐ๊ณผ๊ฐ€ ์ง๊ด€์ ์ด๋ฉฐ ํ•ญ๋ชฉ์˜ ๊ตฌ์ฒด์ ์ธ ๋‚ด์šฉ์„ ๋ถ„์„ํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค.
  • ๋‹จ์ 
    • ์ฝœ๋“œ ์Šคํƒ€ํŠธ(Cold Start)๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.
    • ๊ณ„์‚ฐ๋Ÿ‰์ด ๋น„๊ต์  ๋งŽ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฏ€๋กœ ์‚ฌ์šฉ์ž ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ ํšจ์œจ์ ์œผ๋กœ ์ถ”์ฒœํ•  ์ˆ˜ ์—†๋‹ค.
    • ์‹œ์Šคํ…œ ํ•ญ๋ชฉ์ด ๋งŽ๋‹ค ํ•˜๋”๋ผ๋„ ์‚ฌ์šฉ์ž๋“ค์€ ์†Œ์ˆ˜์˜ ์ธ๊ธฐ์žˆ๋Š” ํ•ญ๋ชฉ์—๋งŒ ๊ด€์‹ฌ์„ ๋ณด์ด๋Š” ๋กฑํ…Œ์ผ(Long tail)๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.
  • ํ–‰๋ ฌ๋ถ„ํ•ด(Matrix Factorization), k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ (k-Nearest Neighbor algorithm; kNN) ๋“ฑ์˜ ๋ฐฉ๋ฒ•์ด ๋งŽ์ด ์‚ฌ์šฉ๋œ๋‹ค.

(2) ์ฝ˜ํ…์ธ  ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง


์ฝ˜ํ…์ธ  ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง์€ ํ•ญ๋ชฉ ์ž์ฒด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ถ”์ฒœ์„ ๊ตฌํ˜„ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์Œ์•…์„ ์ถ”์ฒœํ•˜๊ธฐ ์œ„ํ•ด ์Œ์•… ์ž์ฒด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์œ ์‚ฌํ•œ ์Œ์•…์„ ์ถ”์ฒœํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

์ฝ˜ํ…์ธ  ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง์„ ์œ„ํ•ด์„œ๋Š” ํ•ญ๋ชฉ์„ ๋ถ„์„ํ•œ ํ”„๋กœํŒŒ์ผ(item profile)๊ณผ ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ๋„๋ฅผ ์ถ”์ถœํ•œ ํ”„๋กœํŒŒ์ผ(user profile)์„ ์ถ”์ถœํ•˜์—ฌ ์ด์˜ ์œ ์‚ฌ์„ฑ์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์œ ๋ช…ํ•œ ์Œ์•… ์‚ฌ์ดํŠธ์ธ ํŒ๋„ ๋ผ(Pandora)์˜ ๊ฒฝ์šฐ, ์‹ ๊ณก์ด ์ถœ์‹œ๋˜๋ฉด ์Œ์•…์„ ๋ถ„์„ํ•˜์—ฌ ์žฅ๋ฅด, ๋น„ํŠธ, ์Œ์ƒ‰ ๋“ฑ ์•ฝ 400์—ฌ ํ•ญ๋ชฉ์˜ ํŠน ์„ฑ์„ ์ถ”์ถœํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์‚ฌ์šฉ์ž๋กœ๋ถ€ํ„ฐ๋Š” โ€˜likeโ€™๋ฅผ ๋ฐ›์€ ์Œ์•…์˜ ํŠน์ƒ‰์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•ด๋‹น ์‚ฌ์šฉ์ž์˜ ํ”„๋กœ ํŒŒ์ผ์„ ์ค€๋น„ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์Œ์•…์˜ ํŠน์„ฑ๊ณผ ์‚ฌ์šฉ์ž ํ”„๋กœํŒŒ์ผ์„ ๋น„๊ตํ•จ์œผ๋กœ์จ ์‚ฌ์šฉ์ž๊ฐ€ ์„ ํ˜ธํ•  ๋งŒํ•œ ์Œ์•…์„ ์ œ๊ณตํ•˜๊ฒŒ ๋œ๋‹ค.

  • ๊ตฐ์ง‘๋ถ„์„(Clustering analysis), ์ธ๊ณต์‹ ๊ฒฝ๋ง(Artificial neural network), tf-idf(term frequencyinverse document frequency) ๋“ฑ์˜ ๊ธฐ์ˆ ์ด ์‚ฌ์šฉ๋œ๋‹ค.

์ถ”์ฒœ์‹œ์Šคํ…œ์€ ์•„์ดํ…œ์€ ๋งค์šฐ ๋งŽ๊ณ , ์œ ์ €์˜ ์ทจํ–ฅ์€ ๋‹ค์–‘ํ•  ๋•Œ ์œ ์ €๊ฐ€ ์†Œ๋น„ํ•  ๋งŒํ•œ ์•„์ดํ…œ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค.

  • ์œ ํŠœ๋ธŒ : ๋™์˜์ƒ์ด ๋งค์ผ ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ์ด ์˜ฌ๋ผ์˜ค๊ณ  ์œ ์ €์˜ ์ทจํ–ฅ(๊ฒŒ์ž„ ์„ ํ˜ธ, ๋ทฐํ‹ฐ ์„ ํ˜ธ, ์ง€์‹ ์„ ํ˜ธ, ๋‰ด์Šค ์„ ํ˜ธ)์ด ๋‹ค์–‘
  • ํŽ˜์ด์Šค๋ถ : ํฌ์ŠคํŒ…๋˜๋Š” ๊ธ€์ด ์—„์ฒญ ๋งŽ๊ณ  ์œ ์ €๊ฐ€ ๊ด€์‹ฌ ์žˆ๋Š” ํŽ˜์ด์ง€, ์นœ๊ตฌ, ๊ทธ๋ฃน์€ ์ „๋ถ€ ๋‹ค๋ฆ„
  • ์•„๋งˆ์กด : ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํ•œ์ •ํ•ด๋„ ํŒ๋งค ํ’ˆ๋ชฉ์ด ์—„์ฒญ ๋งŽ๊ณ  ์ข‹์•„ํ•˜๋Š” ๋ธŒ๋žœ๋“œ, ๊ตฌ๋งค ๊ธฐ์ค€์ด ๋‹ค์–‘

๋ฐ์ดํ„ฐ ํƒ์ƒ‰๊ณผ ์ „์ฒ˜๋ฆฌ

(1) ๋ฐ์ดํ„ฐ ์ค€๋น„


http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html ๋ฐ์ดํ„ฐ์…‹ ํ™ˆํŽ˜์ด์ง€

# ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import pandas as pd
import os

fname = os.getenv('HOME') + '/aiffel/recommendata_iu/data/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv'
col_names = ['user_id', 'artist_MBID', 'artist', 'play']   # ์ž„์˜๋กœ ์ง€์ •ํ•œ ์ปฌ๋Ÿผ๋ช…
data = pd.read_csv(fname, sep='\t', names= col_names)      # sep='\t'
data.head(10)

# ์‚ฌ์šฉํ•˜๋Š” ์ปฌ๋Ÿผ ์žฌ์ •์˜
using_cols = ['user_id', 'artist', 'play']
data = data[using_cols]
data.head(10)

data['artist'] = data['artist'].str.lower() # ๊ฒ€์ƒ‰์„ ์‰ฝ๊ฒŒํ•˜๊ธฐ ์œ„ํ•ด ์•„ํ‹ฐ์ŠคํŠธ ๋ฌธ์ž์—ด์„ ์†Œ๋ฌธ์ž๋กœ ๋ณ€๊ฒฝ
data.head(10)

# ์ฒซ ๋ฒˆ์งธ ์œ ์ € ๋ฐ์ดํ„ฐ ํ™•์ธ
condition = (data['user_id']== data.loc[0, 'user_id'])
data.loc[condition]

(2) ๋ฐ์ดํ„ฐ ํƒ์ƒ‰


ํ™•์ธ์ด ํ•„์š”ํ•œ ์ •๋ณด

  • ์œ ์ €์ˆ˜, ์•„ํ‹ฐ์ŠคํŠธ์ˆ˜, ์ธ๊ธฐ ๋งŽ์€ ์•„ํ‹ฐ์ŠคํŠธ
  • ์œ ์ €๋“ค์ด ๋ช‡ ๋ช…์˜ ์•„ํ‹ฐ์ŠคํŠธ๋ฅผ ๋“ฃ๊ณ  ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ํ†ต๊ณ„
  • ์œ ์ € play ํšŸ์ˆ˜ ์ค‘์•™๊ฐ’์— ๋Œ€ํ•œ ํ†ต๊ณ„
# ์œ ์ € ์ˆ˜
data['user_id'].nunique()

# ์•„ํ‹ฐ์ŠคํŠธ ์ˆ˜
data['artist'].nunique()

# ์ธ๊ธฐ ๋งŽ์€ ์•„ํ‹ฐ์ŠคํŠธ
artist_count = data.groupby('artist')['user_id'].count()
artist_count.sort_values(ascending=False).head(30)

# ์œ ์ €๋ณ„ ๋ช‡ ๋ช…์˜ ์•„ํ‹ฐ์ŠคํŠธ๋ฅผ ๋“ฃ๊ณ  ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ํ†ต๊ณ„
user_count = data.groupby('user_id')['artist'].count()
user_count.describe()

# ์œ ์ €๋ณ„ playํšŸ์ˆ˜ ์ค‘์•™๊ฐ’์— ๋Œ€ํ•œ ํ†ต๊ณ„
user_median = data.groupby('user_id')['play'].median()
user_median.describe()

# ์ด๋ฆ„์€ ๊ผญ ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” ๊ฒƒ์œผ๋กœ
my_favorite = ['black eyed peas' , 'maroon5' ,'jason mraz' ,'coldplay' ,'beyoncรฉ']

# 'zimin'์ด๋ผ๋Š” user_id๊ฐ€ ์œ„ ์•„ํ‹ฐ์ŠคํŠธ์˜ ๋…ธ๋ž˜๋ฅผ 30ํšŒ์”ฉ ๋“ค์—ˆ๋‹ค๊ณ  ๊ฐ€์ •
my_playlist = pd.DataFrame({'user_id': ['zimin']*5, 'artist': my_favorite, 'play':[30]*5})

if not data.isin({'user_id':['zimin']})['user_id'].any(): # user_id์— 'zimin'์ด๋ผ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋‹ค๋ฉด
    data = data.append(my_playlist) # ์œ„์— ์ž„์˜๋กœ ๋งŒ๋“  my_favorite ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€

data.tail(10) # ์ž˜ ์ถ”๊ฐ€๋˜์—ˆ๋Š”์ง€ ํ™•์ธ

(3) ๋ชจ๋ธ์— ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ


๋ฐ์ดํ„ฐ์˜ ๊ด€๋ฆฌ๋ฅผ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด indexing ์ž‘์—…์„ ํ•ด์ค€๋‹ค.

# ๊ณ ์œ ํ•œ ์œ ์ €, ์•„ํ‹ฐ์ŠคํŠธ๋ฅผ ์ฐพ์•„๋‚ด๋Š” ์ฝ”๋“œ
user_unique = data['user_id'].unique()
artist_unique = data['artist'].unique()

# ์œ ์ €, ์•„ํ‹ฐ์ŠคํŠธ indexing ํ•˜๋Š” ์ฝ”๋“œ idx๋Š” index์˜ ์•ฝ์ž
user_to_idx = {v:k for k,v in enumerate(user_unique)}
artist_to_idx = {v:k for k,v in enumerate(artist_unique)}

# ์ธ๋ฑ์‹ฑ์ด ์ž˜ ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•ด ๋ด…๋‹ˆ๋‹ค. 
print(user_to_idx['zimin'])    # 358869๋ช…์˜ ์œ ์ € ์ค‘ ๋งˆ์ง€๋ง‰์œผ๋กœ ์ถ”๊ฐ€๋œ ์œ ์ €์ด๋‹ˆ 358868์ด ๋‚˜์™€์•ผ ํ•ฉ๋‹ˆ๋‹ค. 
print(artist_to_idx['black eyed peas'])

# indexing์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ ๋‚ด ๊ฐ’์„ ๋ฐ”๊พธ๋Š” ์ฝ”๋“œ
# dictionary ์ž๋ฃŒํ˜•์˜ get ํ•จ์ˆ˜๋Š” https://wikidocs.net/16 ์„ ์ฐธ๊ณ ํ•˜์„ธ์š”.

# user_to_idx.get์„ ํ†ตํ•ด user_id ์ปฌ๋Ÿผ์˜ ๋ชจ๋“  ๊ฐ’์„ ์ธ๋ฑ์‹ฑํ•œ Series๋ฅผ ๊ตฌํ•ด ๋ด…์‹œ๋‹ค. 
# ํ˜น์‹œ ์ •์ƒ์ ์œผ๋กœ ์ธ๋ฑ์‹ฑ๋˜์ง€ ์•Š์€ row๊ฐ€ ์žˆ๋‹ค๋ฉด ์ธ๋ฑ์Šค๊ฐ€ NaN์ด ๋  ํ…Œ๋‹ˆ dropna()๋กœ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. 
temp_user_data = data['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(data):   # ๋ชจ๋“  row๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ์ธ๋ฑ์‹ฑ๋˜์—ˆ๋‹ค๋ฉด
    print('user_id column indexing OK!!')
    data['user_id'] = temp_user_data   # data['user_id']์„ ์ธ๋ฑ์‹ฑ๋œ Series๋กœ ๊ต์ฒดํ•ด ์ค๋‹ˆ๋‹ค. 
else:
    print('user_id column indexing Fail!!')

# artist_to_idx์„ ํ†ตํ•ด artist ์ปฌ๋Ÿผ๋„ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ์ธ๋ฑ์‹ฑํ•ด ์ค๋‹ˆ๋‹ค. 
temp_artist_data = data['artist'].map(artist_to_idx.get).dropna()
if len(temp_artist_data) == len(data):
    print('artist column indexing OK!!')
    data['artist'] = temp_artist_data
else:
    print('artist column indexing Fail!!')

data

์‚ฌ์šฉ์ž์˜ ๋ช…์‹œ์ /์•”๋ฌต์  ํ‰๊ฐ€

  • ๋ช…์‹œ์  ๋ฐ์ดํ„ฐ(Explicit Data) : ์ข‹์•„์š”, ํ‰์ ๊ณผ ๊ฐ™์ด ์œ ์ €๊ฐ€ ์ž์‹ ์˜ ์„ ํ˜ธ๋„๋ฅผ ์ง์ ‘(Explicit)ํ‘œํ˜„ํ•œ ๋ฐ์ดํ„ฐ
  • ์•”๋ฌต์  ๋ฐ์ดํ„ฐ(Implicit Data) : ์œ ์ €๊ฐ€ ๊ฐ„์ ‘์ (Implicit)์œผ๋กœ ์„ ํ˜ธ, ์ทจํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ์ดํ„ฐ. ๊ฒ€์ƒ‰๊ธฐ๋ก, ๋ฐฉ๋ฌธํŽ˜์ด์ง€, ๊ตฌ๋งค๋‚ด์—ญ, ๋งˆ์šฐ์Šค ์›€์ง์ž„ ๊ธฐ๋ก ๋“ฑ์ด ์žˆ๋‹ค.
# 1ํšŒ๋งŒ playํ•œ ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ์„ ๋ณด๋Š” ์ฝ”๋“œ
only_one = data[data['play']<2]
one, all_data = len(only_one), len(data)
print(f'{one},{all_data}')
print(f'Ratio of only_one over all data is {one/all_data:.2%}')

์ด๋ฒˆ์— ๋งŒ๋“ค ๋ชจ๋ธ์—์„œ๋Š” ์•”๋ฌต์  ๋ฐ์ดํ„ฐ์˜ ํ•ด์„์„ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ทœ์น™์„ ์ ์šฉํ•œ๋‹ค.

  • ํ•œ ๋ฒˆ์ด๋ผ๋„ ๋“ค์—ˆ์œผ๋ฉด ์„ ํ˜ธํ•œ๋‹ค๊ณ  ํŒ๋‹จํ•œ๋‹ค.
  • ๋งŽ์ด ์žฌ์ƒํ•œ ์•„ํ‹ฐ์ŠคํŠธ์— ๋Œ€ํ•ด ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด์„œ ๋” ํ™•์‹คํžˆ ์ข‹์•„ํ•œ๋‹ค๊ณ  ํŒ๋‹จํ•œ๋‹ค.

Matrix Factorization(MF)

์ถ”์ฒœ์‹œ์Šคํ…œ์˜ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ์ค‘ Matrix Factorization(MF, ํ–‰๋ ฌ๋ถ„ํ•ด) ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

images00

MF๋Š” ํ‰๊ฐ€ํ–‰๋ ฌ R์„ P์™€ Q ๋‘ ๊ฐœ์˜ Feature Matrix๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ๋Š” P๊ฐ€ ์‚ฌ์šฉ์ž์˜ ํŠน์„ฑ(Feature) ๋ฒกํ„ฐ๊ณ , Q๋Š” ์˜ํ™”์˜ ํŠน์„ฑ ๋ฒกํ„ฐ๊ฐ€ ๋œ๋‹ค. ๋‘ ๋ฒกํ„ฐ๋ฅผ ๋‚ด์ ํ•ด์„œ ์–ป์–ด์ง€๋Š” ๊ฐ’์ด ์˜ํ™” ์„ ํ˜ธ๋„๋กœ ๊ฐ„์ฃผํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

images01

๋ฒกํ„ฐ๋ฅผ ์ž˜ ๋งŒ๋“œ๋Š” ๊ธฐ์ค€์€ ์œ ์ €i์˜ ๋ฒกํ„ฐ์™€ ์•„์ดํ…œj์˜ ๋ฒกํ„ฐ๋ฅผ ๋‚ด์ ํ–ˆ์„ ๋•Œ ์œ ์ €i๊ฐ€ ์•„์ดํ…œj์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ•œ ์ˆ˜์น˜์™€ ๋น„์Šทํ•œ์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

fomula

์ด๋ฒˆ์— ์‚ฌ์šฉํ•  ๋ชจ๋ธ์€ Collaborative Filtering for Implicit Feedback Datasets ๋…ผ๋ฌธ์—์„œ ์ œํ•œํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.

CSR(Compressed Sparse Row) Matrix


์œ ์ € X ์•„์ดํ…œ ํ‰๊ฐ€ํ–‰๋ ฌ์„ ํ–‰๋ ฌ๋กœ ํ‘œํ˜„ํ•œ๋‹ค๊ณ  ํ•˜๋ฉด 36๋งŒ * 29๋งŒ * 1byte = ์•ฝ 97GB๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ด๋ ‡๊ฒŒ ํฐ ์šฉ๋Ÿ‰์„ ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ ค๋†“๊ณ  ์ž‘์—…์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— CSR์„ ์‚ฌ์šฉํ•œ๋‹ค.

CSR์€ Sparseํ•œ matrix์—์„œ 0์ด ์•„๋‹Œ ์œ ํšจํ•œ ๋ฐ์ดํ„ฐ๋กœ ์ฑ„์›Œ์ง€๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฐ’๊ณผ ์ขŒํ‘œ ์ •๋ณด๋งŒ์œผ๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ตœ์†Œํ™” ํ•˜๋ฉด์„œ๋„ Sparseํ•œ matrix์™€ ๋™์ผํ•œ ํ–‰๋ ฌ์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์ด๋‹ค.

# ์‹ค์Šต ์œ„์— ์„ค๋ช…๋ณด๊ณ  ์ดํ•ดํ•ด์„œ ๋งŒ๋“ค์–ด๋ณด๊ธฐ
from scipy.sparse import csr_matrix

num_user = data['user_id'].nunique()
num_artist = data['artist'].nunique()

csr_data = csr_matrix((data.play, (data.user_id, data.artist)), shape= (num_user, num_artist))
csr_data

MF ๋ชจ๋ธ ํ•™์Šตํ•˜๊ธฐ


Matrix Factorization ๋ชจ๋ธ์„ implicit ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•ด๋ณด์ž.

  • implicit ํŒจํ‚ค์ง€๋Š” ์ด์ „ ์Šคํ…์—์„œ ์„ค๋ช…ํ•œ ์•”๋ฌต์ (implicit) dataset์„ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ๊ต‰์žฅํžˆ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ํŒจํ‚ค์ง€์ด๋‹ค.
  • ์ด ํŒจํ‚ค์ง€์— ๊ตฌํ˜„๋œ als(AlternatingLeastSquares) ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค. Matrix Factorization์—์„œ ์ชผ๊ฐœ์ง„ ๋‘ Feature Matrix๋ฅผ ํ•œ๊บผ๋ฒˆ์— ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ์ž˜ ์ˆ˜๋ ดํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ํ•œ์ชฝ์„ ๊ณ ์ •์‹œํ‚ค๊ณ  ๋‹ค๋ฅธ ์ชฝ์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์„ ๋ฒˆ๊ฐˆ์•„ ์ˆ˜ํ–‰ํ•˜๋Š” AlternatingLeastSquares ๋ฐฉ์‹์ด ํšจ๊ณผ์ ์ธ ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋‹ค.
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# implicit ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ๊ถŒ์žฅํ•˜๊ณ  ์žˆ๋Š” ๋ถ€๋ถ„
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

# Implicit AlternatingLeastSquares ๋ชจ๋ธ์˜ ์„ ์–ธ
als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=15, dtype=np.float32)

# als ๋ชจ๋ธ์€ input์œผ๋กœ (item X user ๊ผด์˜ matrix๋ฅผ ๋ฐ›๊ธฐ ๋•Œ๋ฌธ์— Transposeํ•ด์ค๋‹ˆ๋‹ค.)
csr_data_transpose = csr_data.T
csr_data_transpose

# ๋ชจ๋ธ ํ›ˆ๋ จ
als_model.fit(csr_data_transpose)

# ๋ฒกํ„ฐ๊ฐ’ ํ™•์ธ
zimin, black_eyed_peas = user_to_idx['zimin'], artist_to_idx['black eyed peas']
zimin_vector, black_eyed_peas_vector = als_model.user_factors[zimin], als_model.item_factors[black_eyed_peas]

zimin_vector
black_eyed_peas_vector

# zimin๊ณผ black_eyed_peas๋ฅผ ๋‚ด์ ํ•˜๋Š” ์ฝ”๋“œ
np.dot(zimin_vector, black_eyed_peas_vector) # 0.5098079

# ๋‹ค๋ฅธ ์•„ํ‹ฐ์ŠคํŠธ์— ๋Œ€ํ•œ ์„ ํ˜ธ๋„
queen = artist_to_idx['queen']
queen_vector = als_model.item_factors[queen]
np.dot(zimin_vector, queen_vector) # 0.3044492

๋น„์Šทํ•œ ์•„ํ‹ฐ์ŠคํŠธ ์ฐพ๊ธฐ + ์œ ์ €์—๊ฒŒ ์ถ”์ฒœํ•˜๊ธฐ

(1) ๋น„์Šทํ•œ ์•„ํ‹ฐ์ŠคํŠธ ์ฐพ๊ธฐ


AlternatingLeastSquares ํด๋ž˜์Šค์— ๊ตฌํ˜„๋˜์–ด ์žˆ๋Š” similar_items ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•˜์—ฌ ๋น„์Šทํ•œ ์•„ํ‹ฐ์ŠคํŠธ๋ฅผ ์ฐพ๋Š”๋‹ค.

# ๋น„์Šทํ•œ ์•„ํ‹ฐ์ŠคํŠธ ์ฐพ๊ธฐ
favorite_artist = 'coldplay'
artist_id = artist_to_idx[favorite_artist]
similar_artist = als_model.similar_items(artist_id, N=15)
similar_artist

# #artist_to_idx ๋ฅผ ๋’ค์ง‘์–ด, index๋กœ๋ถ€ํ„ฐ artist ์ด๋ฆ„์„ ์–ป๋Š” dict๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. 
idx_to_artist = {v:k for k,v in artist_to_idx.items()}
[idx_to_artist[i[0]] for i in similar_artist]

# ๋น„์Šทํ•œ ์•„ํ‹ฐ์ŠคํŠธ๋ฅผ ์ฐพ์•„์ฃผ๋Š” ํ•จ์ˆ˜
def get_similar_artist(artist_name: str):
    artist_id = artist_to_idx[artist_name]
    similar_artist = als_model.similar_items(artist_id)
    similar_artist = [idx_to_artist[i[0]] for i in similar_artist]
    return similar_artist

# ๋‹ค๋ฅธ ์•„ํ‹ฐ์ŠคํŠธ ํ™•์ธ
get_similar_artist('2pac')
get_similar_artist('lady gaga')

ํŠน์ • ์žฅ๋ฅด๋ฅผ ์„ ํ˜ธํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์€ ์„ ํ˜ธ๋„๊ฐ€ ์ง‘์ค‘๋˜๊ธฐ ๋•Œ๋ฌธ์— ์žฅ๋ฅด๋ณ„ ํŠน์„ฑ์ด ๋‘๋“œ๋Ÿฌ์ง„๋‹ค.

(2) ์œ ์ €์—๊ฒŒ ์•„ํ‹ฐ์ŠคํŠธ ์ถ”์ฒœํ•˜๊ธฐ


AlternatingLeastSquares ํด๋ž˜์Šค์— ๊ตฌํ˜„๋˜์–ด ์žˆ๋Š” recommend ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•˜์—ฌ ์ข‹์•„ํ•  ๋งŒํ•œ ์•„ํ‹ฐ์ŠคํŠธ๋ฅผ ์ถ”์ฒœ๋ฐ›๋Š”๋‹ค. filter_already_liked_items ๋Š” ์œ ์ €๊ฐ€ ์ด๋ฏธ ํ‰๊ฐ€ํ•œ ์•„์ดํ…œ์€ ์ œ์™ธํ•˜๋Š” Argument์ด๋‹ค.

user = user_to_idx['zimin']
# recommend์—์„œ๋Š” user*item CSR Matrix๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค.
artist_recommended = als_model.recommend(user, csr_data, N=20, filter_already_liked_items=True)
artist_recommended

# index to artist
[idx_to_artist[i[0]] for i in artist_recommended]

# ์ถ”์ฒœ ๊ธฐ์—ฌ๋„ ํ™•์ธ
rihanna = artist_to_idx['rihanna']
explain = als_model.explain(user, csr_data, itemid=rihanna)

[(idx_to_artist[i[0]], i[1]) for i in explain[1]]

(3) ๋งˆ๋ฌด๋ฆฌ


์ถ”์ฒœ์‹œ์Šคํ…œ์—์„œ Baseline์œผ๋กœ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” MF๋ฅผ ํ†ตํ•ด ์•„ํ‹ฐ์ŠคํŠธ๋ฅผ ์ถ”์ฒœํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋ณด์•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ชจ๋ธ์€ ๋ช‡ ๊ฐ€์ง€ ์•„์‰ฌ์šด ์ ์ด ์žˆ๋‹ค.

  1. ์œ ์ €, ์•„ํ‹ฐ์ŠคํŠธ์— ๋Œ€ํ•œ Meta์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜๊ธฐ ์‰ฝ์ง€ ์•Š๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์—ฐ๋ น๋Œ€๋ณ„๋กœ ์Œ์•… ์ทจํ–ฅ์ด ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„์„ ๋ฐ˜์˜ํ•˜๊ธฐ ์–ด๋ ต๋‹ค.
  2. ์œ ์ €๊ฐ€ ์–ธ์ œ playํ–ˆ๋Š”์ง€ ๋ฐ˜์˜ํ•˜๊ธฐ ์–ด๋ ต๋‹ค. 10๋…„ ์ „์— ์žฌ์ƒ๋œ ๋…ธ๋ž˜๋ž‘ ์ง€๊ธˆ ์žฌ์ƒ๋˜๋Š” ๋…ธ๋ž˜๋ž‘ ๋น„๊ตํ•ด๋ณด์ž.

ํšŒ๊ณ ๋ก

  • csr_data๋ฅผ ๋งŒ๋“œ๋Š”๋ฐ ์™œ์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ unique๋กœ ๋ฝ‘์€ ๊ฐ’์„ shape์—๋‹ค ๋„ฃ์–ด์คฌ๋”๋‹ˆ row index๊ฐ€ ๋„˜์—ˆ๋‹ค๊ณ  ์—๋Ÿฌ๊ฐ€ ๋–ด๋‹ค. ๊ทธ๋ž˜์„œ shape๋ฅผ ๋„ฃ์ง€ ์•Š๊ณ  ๊ทธ๋ƒฅ ๋Œ๋ ธ๋”๋‹ˆ ์ƒ์„ฑ์ด ๋˜์—ˆ๊ณ , ๋งŒ๋“ค์–ด์ง„ csr_data์˜ shape๋ฅผ ํ™•์ธํ•ด๋ณด๋‹ˆ unique๊ฐ’๊ณผ ๋‹ฌ๋ž๋‹ค. ์™œ ๊ทธ๋Ÿฐ์ง€๋Š” ์ž˜ ๋ชจ๋ฅด๊ฒ ๋‹ค... NaN๊ฐ’์ด ์žˆ๋‚˜..?
  • ์˜ค๋Š˜ ํ•œ ๊ณผ์ œ๊ฐ€ ์ง€๊ธˆ๊นŒ์ง€ ํ•œ ๊ณผ์ œ ์ค‘์—์„œ ๊ฐ€์žฅ ์–ด๋ ค์› ๋˜ ๊ฒƒ ๊ฐ™๋‹ค. ์•„์ง ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์— ์ต์ˆ™ํ•˜์ง€ ์•Š์•„์„œ ๊ทธ๋Ÿด ์ˆ˜๋„ ์žˆ์ง€๋งŒ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๊ฐ€ ๊ฑฐ์˜ ์ „๋ถ€์ธ ๊ฒƒ ๊ฐ™๋‹ค.
  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋งŒ ์ž˜ ํ•ด๋„ ๋ฐ˜์€ ์„ฑ๊ณตํ•œ๋‹ค๋Š” ๋Š๋‚Œ์ด๋‹ค. ์• ์ดˆ์— ๋ฐ์ดํ„ฐ๊ฐ€ ์—†์œผ๋ฉด ์‹œ์ž‘์กฐ์ฐจ ํ•  ์ˆ˜ ์—†์œผ๋‹ˆ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ pandas, numpy๋“ฑ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์ž˜ ์ตํ˜€๋‘ฌ์•ผ ๊ฒ ๋‹ค.
  • ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜์—ฌ ์‹ค์ œ๋กœ ์ถ”์ฒœ๋ฐ›์€ ๋ชฉ๋ก์„ ๋ณด๋‹ˆ ํ† ์ด์Šคํ† ๋ฆฌ๋ฅผ ๊ณจ๋ž์„ ๋•Œ ํ† ์ด์Šคํ† ๋ฆฌ2, ๋ฒ…์Šค๋ผ์ดํ”„, ์•Œ๋ผ๋”˜ ๋“ฑ์„ ์ถ”์ฒœํ•ด์ฃผ๋Š” ๊ฑธ ๋ณด๋ฉด ๋งŒํ™”์• ๋‹ˆ๋ฉ”์ด์…˜ ์žฅ๋ฅด์ชฝ์„ ์ถ”์ฒœํ•ด์ฃผ๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ›ˆ๋ จ์ด ์ž˜ ์ด๋ฃจ์–ด์กŒ๋‹ค๊ณ  ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.
  • csr_data๋ฅผ ๋งŒ๋“ค ๋•Œ shape ์—์„œ Error๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ด์œ ๋ฅผ ์•Œ์•„๋ƒˆ๋‹ค. ๊ทธ ์›์ธ์€ csr_data์˜ (row_ind, col_ind) parameter๊ฐ€ max(row_ind), max(col_ind)๋กœ ์ž‘๋™ํ•˜์—ฌ row_ind์™€ col_ind์˜ index์˜ ์ตœ๋Œ“๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ฌผ๋ก  row์™€ col์ด index ์ˆœ์œผ๋กœ ์ž˜ ์ •๋ ฌ๋˜์–ด ์žˆ๋‹ค๋ฉด ์ด๋ ‡๊ฒŒ ํ•ด๋„ ๋ฌธ์ œ๊ฐ€ ์—†์ง€๋งŒ, ์‹ค์ œ๋กœ๋Š” movie_id์˜ ์ค‘๊ฐ„์— ๋น ์ง„ index๋“ค์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— movie_id์˜ ์ด ๊ฐฏ์ˆ˜์ธ 3628๊ฐœ ๋ณด๋‹ค ํฐ max(row_ind)์˜ 3953๊ฐœ๊ฐ€ parameter๋กœ ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด๋‹ค. user_id๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ uniqueํ•œ ๊ฐ’์€ 6040๊ฐœ์ง€๋งŒ, index๊ฐ€ 1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ๋๊ฐ’์€ 6041์ด๋ฏ€๋กœ ์ด 6042๊ฐœ๋ฅผ col_ind๋กœ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค. ์ด ๋ถ€๋ถ„์€ ์ˆ˜์ •ํ•ด๋ณด๋ ค๊ณ  ํ–ˆ์œผ๋‚˜, movie_id๋งˆ๋‹ค ์ด๋ฏธ ํ• ๋‹น๋œ title์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ratings DataFrame์— ๋‹ค์‹œ movie_id์— ๋งž๋Š” title column์„ ๋”ํ•ด์ฃผ๊ณ  movie_id์ˆœ์œผ๋กœ ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜๊ณ  sortํ•˜์—ฌ title์— ๋‹ค์‹œ movie_id๋ฅผ ํ• ๋‹นํ•ด ์ฃผ๋Š” ์ž‘์—…์ด ๋„ˆ๋ฌด ๋ฒˆ๊ฑฐ๋กœ์›Œ์„œ ๊ทธ๋งŒ๋’€๋‹ค.

์œ ์šฉํ•œ ๋งํฌ

https://orill.tistory.com/entry/Explicit-vs-Implicit-Feedback-Datasets?category=1066301 ๋ช…์‹œ์ /์•”๋ฌต์  ํ‰๊ฐ€

https://lovit.github.io/nlp/machine learning/2018/04/09/sparse_mtarix_handling/#csr-matrix CSR

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html CSR์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•

https://danthetech.netlify.app/DataScience/evaluation-metrics-for-recommendation-system ์ถ”์ฒœ์‹œ์Šคํ…œ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•