ååãšããŠãç¹å®ã®ã¿ã¹ã¯ã®ç¹å®ã®æ©èœã«äŸåããã¢ã«ãŽãªãºã ã®å€æŽã¯ãããåºç¯ãªã¯ã©ã¹ã®åé¡ã«äžè¬åããã®ãé£ããããã䟡å€ãäœããšèŠãªãããŸãã ãã ããããã¯ããã®ãããªå€æŽãäžèŠã§ããããšãæå³ãããã®ã§ã¯ãããŸããã ããã«ãå€ãã®å ŽåãåçŽãªå€å
žçãªåé¡ã§ãçµæã倧å¹
ã«æ¹åã§ããŸããããã¯ãã¢ã«ãŽãªãºã ã®å®çšåã«ãããŠéåžžã«éèŠã§ãã äŸãšããŠããã®æçš¿ã§ã¯ã匷åãã¬ãŒãã³ã°ã§ããŠã³ãã³ã«ãŒã®åé¡ã解決ããã¿ã¹ã¯ã®ç·šææ¹æ³ã«é¢ããç¥èã䜿çšããŠãã¯ããã«è¿
éã«è§£æ±ºã§ããããšã瀺ããŸãã

ç§èªèº«ã«ã€ããŠ
ç§ã®ååã¯ãªã¬ã°ã»ã¹ãããã§ã³ã³ã§ããçŸåšããµã³ã¯ãããã«ãã«ã¯å€§åŠã§3幎éå匷ããŠããåã«ããµã³ã¯ãããã«ãã«ã¯HSEã®ç©çãæ°åŠãã³ã³ãã¥ãŒã¿ãŒç§åŠã®åŠæ ¡ã§å匷ããŠããŸãã ç§ã¯JetBrains Researchã®ç ç©¶è
ãšããŠãåããŠããŸãã 倧åŠã«å
¥ãåã«ãç§ã¯ã¢ã¹ã¯ã¯å·ç«å€§åŠã®SSCã§åŠã³ãã¢ã¹ã¯ã¯ã®ããŒã ã®äžå¡ãšããŠã³ã³ãã¥ãŒã¿ãŒãµã€ãšã³ã¹ã®åŠç«¥ã®å
šãã·ã¢ãªãªã³ãã¢ãŒãã®å
¥è³è
ã«ãªããŸããã
äœãå¿
èŠã§ããïŒ
匷åãã¬ãŒãã³ã°ã詊ããŠã¿ããå Žåã¯ãããŠã³ãã³ã«ãŒã®ãã£ã¬ã³ãžãæé©ã§ãã 仿¥ãã€ã³ã¹ããŒã«ãããŠãã
Gymããã³
PyTorchã©ã€ãã©ãªãåããPythonãšããã¥ãŒã©ã«ãããã¯ãŒã¯ã«é¢ããåºæ¬çãªç¥èãå¿
èŠã§ãã
ã¿ã¹ã¯ã®èª¬æ
2次å
ã®äžçã§ã¯ãè»ã¯2ã€ã®äžã®éã®çªªã¿ããå³ã®äžã®é äžãŸã§ç»ãå¿
èŠããããŸãã 圌女ã¯éåã«æã¡åã¡ãæåã®è©Šã¿ã§ããã«å
¥ãããã«ååãªãšã³ãžã³åãæã£ãŠããªããšããäºå®ã«ãã£ãŠè€éã«ãªã£ãŠããŸãã ãšãŒãžã§ã³ãïŒãã®å Žåã¯ãã¥ãŒã©ã«ãããã¯ãŒã¯ïŒãèšç·Žããããã«æåŸ
ãããŠããŸãããšãŒãžã§ã³ãã¯ããããå¶åŸ¡ããããšã«ãããã§ããã ãæ©ãé©åãªäžãç»ãããšãã§ããŸãã
æ©æ¢°å¶åŸ¡ã¯ãç°å¢ãšã®çžäºäœçšãéããŠå®è¡ãããŸãã ããã¯ç¬ç«ãããšããœãŒãã«åå²ãããåãšããœãŒãã¯æ®µéçã«å®è¡ãããŸãã åã¹ãããã§ããšãŒãžã§ã³ãã¯ã¢ã¯ã·ã§ã³
aã«å¿ããŠç°å¢ããç¶æ
sããã³ç°å¢
rãåãåããŸãã ããã«ããšããœãŒããçµäºããããšãã¡ãã£ã¢ãããã«å ±åããå ŽåããããŸãã ãã®åé¡ã§ã¯ã
sã¯æ°åã®ãã¢ã§ããæåã®æ°åã¯ã«ãŒãäžã®è»ã®äœçœ®ã§ãïŒ1ã€ã®åº§æšã§ååã§ãã衚é¢ããèªåèªèº«ãåŒãé¢ãããšã¯ã§ããªãããïŒã2çªç®ã¯è¡šé¢äžã®é床ã§ãïŒèšå·ä»ãïŒã å ±é
¬
rã¯ããã®ã¿ã¹ã¯ã§ã¯åžžã«-1ã«çããæ°ã§ãã ãã®ããã«ããŠããšãŒãžã§ã³ãã¯ã§ããã ãæ©ããšããœãŒããå®äºããããšããå§ãããŸãã å¯èœãªã¢ã¯ã·ã§ã³ã¯3ã€ã ãã§ããè»ãå·Šã«æŒããäœãããã«è»ãå³ã«æŒããŸãã ãããã®ã¢ã¯ã·ã§ã³ã¯ã0ãã2ãŸã§ã®æ°åã«å¯Ÿå¿ããŸããè»ãå³ã®äžã®é äžã«å°éããå ŽåããŸãã¯ãšãŒãžã§ã³ãã200æ©é²ãã å ŽåããšããœãŒãã¯çµäºããå ŽåããããŸãã
çè«ã®ããã
Habréã«ã¯ã
DQNã«é¢ããèšäºããã§ã«ãããèè
ã¯å¿
èŠãªãã¹ãŠã®çè«ãååã«èª¬æããŠããŸãã ããã§ããèªã¿ãããããããã«ãããã§ããæ£åŒãªåœ¢åŒã§ç¹°ãè¿ããŸãã
匷ååŠç¿ã¿ã¹ã¯ã¯ãç¶æ
空éSãã¢ã¯ã·ã§ã³ç©ºéAãä¿æ°ã®ã»ããã«ãã£ãŠå®çŸ©ãããŸã
ãé·ç§»é¢æ°Tãšå ±é
¬é¢æ°Rãäžè¬ã«ãé·ç§»é¢æ°ãšå ±é
¬é¢æ°ã¯ã©ã³ãã 倿°ã«ã§ããŸãããããã§ã¯ãããããäžæã«å®çŸ©ãããããåçŽãªããŒãžã§ã³ãæ€èšããŸãã ç®æšã¯ã环ç©å ±é
¬ãæå€§åããããšã§ãã
ããã§ãtã¯ã¡ãã£ã¢ã®ã¹ãããçªå·ãTã¯ãšããœãŒãã®ã¹ãããæ°ã§ãã
ãã®åé¡ã解決ããããã«ãç¶æ
sã§éå§ãããšããæ¡ä»¶ã§ãç¶æ
sã®äŸ¡å€é¢æ°Vãæå€§çޝç©å ±é
¬ã®å€ãšããŠå®çŸ©ããŸãã ãã®ãããªé¢æ°ãç¥ã£ãŠããã°ãåã¹ãããã§sã«å¯èœãªæå€§å€ãæž¡ãã ãã§åé¡ã解決ã§ããŸãã ãã ãããã¹ãŠãããã»ã©åçŽãªããã§ã¯ãããŸãããã»ãšãã©ã®å Žåãã©ã®ã¢ã¯ã·ã§ã³ã«ãã£ãŠç®çã®ç¶æ
ã«ãªããã¯ããããŸããã ãããã£ãŠã颿°ã®2çªç®ã®ãã©ã¡ãŒã¿ãŒãšããŠã¢ã¯ã·ã§ã³aã远å ããŸãã çµæã®é¢æ°ã¯Q颿°ãšåŒã°ããŸãã ç¶æ
sã§ã¢ã¯ã·ã§ã³aãå®è¡ããããšã§ç²åŸã§ããæå€§ã®çޝç©å ±é
¬ã瀺ããŸãã ãããããã®é¢æ°ã䜿çšããŠåé¡ã解決ã§ããŸããç¶æ
sã«ãããšããQïŒsãaïŒãæå€§ã«ãªããããªaãéžæããã ãã§ãã
å®éã«ã¯ãå®éã®Q颿°ã¯ããããŸããããããŸããŸãªæ¹æ³ã§è¿äŒŒã§ããŸãã ãã®ãããªææ³ã®1ã€ã«ãDeep Q NetworkïŒDQNïŒããããŸãã 圌ã®èãã¯ãã¢ã¯ã·ã§ã³ã®ããããã«ã€ããŠããã¥ãŒã©ã«ãããã¯ãŒã¯ã䜿çšããŠQ颿°ãè¿äŒŒããããšã§ãã
ç°å¢
ãããç·Žç¿ããŸãããã ãŸããMountainCarç°å¢ããšãã¥ã¬ãŒãããæ¹æ³ãåŠã¶å¿
èŠããããŸãã 倿°ã®æšæºåŒ·ååŠç¿ç°å¢ãæäŸãããžã ã©ã€ãã©ãªã¯ããã®ã¿ã¹ã¯ã«å¯ŸåŠããã®ã«åœ¹ç«ã¡ãŸãã ç°å¢ãäœæããã«ã¯ãgymã¢ãžã¥ãŒã«ã®makeã¡ãœãããåŒã³åºããŠãç®çã®ç°å¢ã®ååããã©ã¡ãŒã¿ãŒãšããŠæž¡ããŸãã
import gym env = gym.make("MountainCar-v0")
詳现ãªããã¥ã¡ã³ãã¯
ããã«ãããç°å¢ã®èª¬æã¯
ããã«ãã
ãŸã ã
äœæããç°å¢ã§äœãã§ããããããã«è©³ããèããŠã¿ãŸãããã
env.reset()
-çŸåšã®ãšããœãŒããçµäºããæ°ãããšããœãŒããéå§ããŸãã åæç¶æ
ãè¿ããŸããenv.step(action)
-æå®ãããã¢ã¯ã·ã§ã³ãå®è¡ããŸãã æ°ããç¶æ
ãå ±é
¬ããšããœãŒããçµäºãããã©ãããããã³ãããã°ã«äœ¿çšã§ããè¿œå æ
å ±ãè¿ããŸããenv.seed(seed)
-ã©ã³ãã ã·ãŒããèšå®ããŸãã ããã¯ãenv.resetïŒïŒäžã«åæç¶æ
ãã©ã®ããã«çæããããã«ãã£ãŠç°ãªããŸããenv.render()
-ç°å¢ã®çŸåšã®ç¶æ
ã衚瀺ããŸãã
DQNãå®çŸããŸã
DQNã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ã䜿çšããŠQ颿°ãè©äŸ¡ããã¢ã«ãŽãªãºã ã§ãã
å
ã®èšäºã§ã DeepMindã¯ç³ã¿èŸŒã¿ãã¥ãŒã©ã«ãããã¯ãŒã¯ã䜿çšããAtariã²ãŒã ã®æšæºã¢ãŒããã¯ãã£ãå®çŸ©ããŸããã ãããã®ã²ãŒã ãšã¯ç°ãªããMountain Carã¯ã€ã¡ãŒãžãç¶æ
ãšããŠäœ¿çšããªããããã¢ãŒããã¯ãã£ãèªåã§æ±ºå®ããå¿
èŠããããŸãã
ããšãã°ãããããã«32åã®ãã¥ãŒãã³ã®2ã€ã®é ãå±€ãããã¢ãŒããã¯ãã£ãèããŠã¿ãŸãããã åé衚瀺ã¬ã€ã€ãŒã®åŸã
ReLUãã¢ã¯ãã£ããŒã·ã§ã³é¢æ°ãšããŠäœ¿çšããŸãã ç¶æ
ã説æãã2ã€ã®æ°å€ããã¥ãŒã©ã«ãããã¯ãŒã¯ã®å
¥åã«äŸçµŠãããåºåã§Q颿°ã®æšå®å€ãååŸããŸãã
import torch.nn as nn model = nn.Sequential( nn.Linear(2, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 3) ) target_model = copy.deepcopy(model)
GPUã§ãã¥ãŒã©ã«ãããã¯ãŒã¯ããã¬ãŒãã³ã°ãããããããã«ãããã¯ãŒã¯ãããŒãããå¿
èŠããããŸãã
ããŒã¿ãããŒãããå¿
èŠããããããããã€ã¹å€æ°ã¯ã°ããŒãã«ã«ãªããŸãã
ãŸããåŸé
éäžã䜿çšããŠã¢ãã«ã®éã¿ãæŽæ°ãããªããã£ãã€ã¶ãŒãå®çŸ©ããå¿
èŠããããŸãã ã¯ããè€æ°ãããŸãã
optimizer = optim.Adam(model.parameters(), lr=0.00003)
ãã¹ãŠäžç·ã« import torch.nn as nn import torch device = torch.device("cuda") def create_new_model(): model = nn.Sequential( nn.Linear(2, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 3) ) target_model = copy.deepcopy(model)
ããã§ããšã©ãŒé¢æ°ãšããã«æ²¿ã£ãåŸé
ãèæ
®ããéäžãé©çšãã颿°ã宣èšããŸãã ãã ãããã®åã«ããããããGPUã«ããŒã¿ãããŠã³ããŒãããå¿
èŠããããŸãã
state, action, reward, next_state, done = batch
次ã«ãQ颿°ã®å®éã®å€ãèšç®ããå¿
èŠããããŸããããããããããªããããæ¬¡ã®ç¶æ
ã®å€ã䜿çšããŠè©äŸ¡ããŸãã
target_q = torch.zeros(reward.size()[0]).float().to(device) with torch.no_grad():
ãããŠçŸåšã®äºæž¬ïŒ
q = model(state).gather(1, action.unsqueeze(1))
target_qãšqã䜿ââçšããŠãæå€±é¢æ°ãèšç®ããã¢ãã«ãæŽæ°ããŸãã
loss = F.smooth_l1_loss(q, target_q.unsqueeze(1))
ãã¹ãŠäžç·ã« gamma = 0.99 def fit(batch, model, target_model, optimizer): state, action, reward, next_state, done = batch
ã¢ãã«ã¯Q颿°ã®ã¿ãèæ
®ããã¢ã¯ã·ã§ã³ãå®è¡ããªãããããšãŒãžã§ã³ããå®è¡ããã¢ã¯ã·ã§ã³ã決å®ãã颿°ã決å®ããå¿
èŠããããŸãã æææ±ºå®ã¢ã«ãŽãªãºã ãšããŠã
-è²ªæ¬²ãªæ¿æ²»ã 圌女ã®èãã¯ããšãŒãžã§ã³ãã¯éåžžã貪欲ã«ã¢ã¯ã·ã§ã³ãå®è¡ããQ颿°ã®æå€§å€ãéžæããŸããã確çã¯
圌ã¯ã©ã³ãã ãªè¡åããšããŸãã ã¢ã«ãŽãªãºã ã貪欲ãªããªã·ãŒã«ãã£ãŠã®ã¿ã¬ã€ããããŠå®è¡ãããªãã¢ã¯ã·ã§ã³ãæ€æ»ã§ããããã«ãã©ã³ãã ã¢ã¯ã·ã§ã³ãå¿
èŠã§ãããã®ããã»ã¹ã¯æ¢çŽ¢ãšåŒã°ããŸãã
def select_action(state, epsilon, model): if random.random() < epsilon: return random.randint(0, 2) return model(torch.tensor(state).to(device).float().unsqueeze(0))[0].max(0)[1].view(1, 1).item()
ãããã䜿çšããŠãã¥ãŒã©ã«ãããã¯ãŒã¯ããã¬ãŒãã³ã°ãããããç°å¢ãšã®ããåãã®çµéšãä¿åããããããããããéžæãããããã¡ãå¿
èŠãšããŸãã
class Memory: def __init__(self, capacity): self.capacity = capacity self.memory = [] self.position = 0 def push(self, element): """ """ if len(self.memory) < self.capacity: self.memory.append(None) self.memory[self.position] = element self.position = (self.position + 1) % self.capacity def sample(self, batch_size): """ """ return list(zip(*random.sample(self.memory, batch_size))) def __len__(self): return len(self.memory)
çŽ æŽãªæ±ºå®
æåã«ãåŠç¿ããã»ã¹ã§äœ¿çšãã宿°ã宣èšããã¢ãã«ãäœæããŸãã
çžäºäœçšããã»ã¹ããšããœãŒãã«åå²ããããšã¯è«ççã§ãããšããäºå®ã«ãããããããåŠç¿ããã»ã¹ã説æããããã«ã¯ãç°å¢ã®åã¹ãããã®åŸã«åŸé
éäžã®1ã¹ããããäœæãããã®ã§ããããå¥ã
ã®ã¹ãããã«åå²ããæ¹ã䟿å©ã§ãã
ããã§åŠç¿ã®1ã€ã®ã¹ããããã©ã®ããã«èŠãããã«ã€ããŠè©³ãã説æããŸãããã max_stepsã¹ãããã®ã¹ãããçªå·ãšçŸåšã®ç¶æ
stateã§ã¹ããããäœæããŠãããšä»®å®ããŸãã æ¬¡ã«ãã¢ã¯ã·ã§ã³ãå®è¡ããŸã
-貪欲ãªããªã·ãŒã¯æ¬¡ã®ããã«ãªããŸãã
epsilon = max_epsilon - (max_epsilon - min_epsilon)* step / max_steps action = select_action(state, epsilon, model) new_state, reward, done, _ = env.step(action)
ç²åŸããçµéšãããã«ã¡ã¢ãªã«è¿œå ããçŸåšã®ãšããœãŒããçµäºããå Žåã¯æ°ãããšããœãŒããéå§ããŸãã
memory.push((state, action, reward, new_state, done)) if done: state = env.reset() done = False else: state = new_state
ãããŠãåŸé
éäžã®ã¹ããããå®è¡ããŸãïŒãã¡ãããå°ãªããšã1ã€ã®ããããæ¢ã«åéã§ããå ŽåïŒã
if step > batch_size: fit(memory.sample(batch_size), model, target_model, optimizer)
ããã§ãtarget_modelã®æŽæ°ãæ®ããŸãã
if step % target_update == 0: target_model = copy.deepcopy(model)
ãã ããåŠç¿ããã»ã¹ããã©ããŒããããšæããŸãã ãããè¡ãã«ã¯ãepsilon = 0ã§target_modelãæŽæ°ãããã³ã«è¿œå ã®ãšããœãŒããåçããç·å ±é
¬ãwards_by_target_updatesãããã¡ãŒã«ä¿åããŸãã
if step % target_update == 0: target_model = copy.deepcopy(model) state = env.reset() total_reward = 0 while not done: action = select_action(state, 0, target_model) state, reward, done, _ = env.step(action) total_reward += reward done = False state = env.reset() rewards_by_target_updates.append(total_reward)
ãã®ã³ãŒããå®è¡ãããšã次ã®ã°ã©ãã®ãããªãã®ãåŸãããŸãã

äœãæªãã£ãã®ã§ããïŒ
ããã¯ãã°ã§ããïŒ ããã¯ééã£ãã¢ã«ãŽãªãºã ã§ããïŒ ãããã®æªããã©ã¡ãŒã¿ãŒã¯ãããŸããïŒ ããã§ããªãã å®éãåé¡ã¯ã¿ã¹ã¯ãã€ãŸãå ±é
¬ã®æ©èœã§ãã ãã£ãšè©³ããèŠãŠã¿ãŸãããã åã¹ãããã§ããšãŒãžã§ã³ãã¯-1ã®å ±é
¬ãåãåããŸããããã¯ãšããœãŒããçµäºãããŸã§çºçããŸãã ãã®ãããªå ±é
¬ã¯ããšãŒãžã§ã³ããã§ããã ãæ©ããšããœãŒããå®äºããããã«åæ©ä»ããŸãããåæã«åœŒã«ãããè¡ãæ¹æ³ãæããŸããã ãã®ããããšãŒãžã§ã³ãã®ãã®ãããªå®åŒåã®åé¡ã解決ããæ¹æ³ãåŠã¶å¯äžã®æ¹æ³ã¯ãæ¢çŽ¢ã䜿çšããŠäœåºŠã解決ããããšã§ãã
ãã¡ãããç§ãã¡ã®ä»£ããã«ãããè€éãªã¢ã«ãŽãªãºã ã䜿çšããŠç°å¢ãç ç©¶ããããšãã§ããŸã
-貪欲ãªããªã·ãŒã ãã ãã第äžã«ããããã®ã¢ããªã±ãŒã·ã§ã³ã®ããã«ãæã
ã®ã¢ãã«ã¯ããè€éã«ãªããŸãã®ã§ãé¿ããããšæããŸãã第äºã«ããã®ã¿ã¹ã¯ã«ååã«æ©èœãããšããäºå®ã§ã¯ãããŸããã 代ããã«ãã¿ã¹ã¯èªäœã倿Žããããšã«ãã£ãŠãã€ãŸãå ±é
¬é¢æ°ã倿Žããããšã«ãã£ãŠãã€ãŸãåé¡ã®åå ãåãé€ãããšãã§ããŸãã ããããå ±é
¬ã·ã§ãŒãã³ã°ãé©çšããŸãã
åæã®é«éå
çŽæçãªç¥èãããäžãç»ãã«ã¯å éããå¿
èŠãããããšãããããŸãã é床ãéãã»ã©ããšãŒãžã§ã³ãã¯åé¡ã®è§£æ±ºã«è¿ã¥ããŸãã ããšãã°ãå ±é
¬ã«ç¹å®ã®ä¿æ°ãæã€é床ã¢ãžã¥ãŒã«ã远å ããããšã§ãããã«ã€ããŠåœŒã«äŒããããšãã§ããŸãã
modified_reward =å ±é
¬+ 10 * absïŒnew_state [1]ïŒ
ãããã£ãŠã颿°ãã£ããã®è¡
memory.pushïŒïŒç¶æ
ãã¢ã¯ã·ã§ã³ãå ±é
¬ãnew_stateãå®äºïŒïŒ
ã«çœ®ãæããå¿
èŠããããŸã
memory.pushïŒïŒç¶æ
ãã¢ã¯ã·ã§ã³ãmodified_rewardãnew_stateãå®äºïŒïŒ
ããã§ãæ°ãããã£ãŒããèŠãŠã¿ãŸãããïŒå€æŽããã«
å
ã®è³ãæç€ºããŸãïŒïŒ
ããã§ãRSã¯Reward Shapingã®ç¥ã§ãããããããã®ã¯ããã§ããïŒ
鲿©ã¯æããã§ããè³ã-200ãšã¯ç°ãªãå§ããããããšãŒãžã§ã³ãã¯äžãç»ãããšãæç¢ºã«åŠã³ãŸããã æ®ã£ãŠãã質åã¯1ã€ã ãã§ããå ±é
¬ã®æ©èœã倿Žãããšãã¿ã¹ã¯èªäœã倿ŽãããŸããèŠã€ãã£ãæ°ããåé¡ã®è§£æ±ºçã¯ãå€ãåé¡ã«åœ¹ç«ã€ã®ã§ããããã
ãããããç§ãã¡ã®å Žåã®ãè¯ããã®æå³ãçè§£ããŠããŸãã åé¡ã解決ããããã«ãæé©ãªããªã·ãŒãèŠã€ããããšããŠããŸãããšããœãŒãã®ç·å ±é
¬ãæå€§åããããªã·ãŒã§ãã ãã®å Žåããgoodããšããåèªããoptimalããšããåèªã«çœ®ãæããããšãã§ããŸããæ¢ããŠããããã§ãã ãŸããDQNãä¿®æ£ãããåé¡ã®æé©ãªè§£æ±ºçãé
ããæ©ããèŠã€ãåºãã屿çãªæå€§å€ã§åããªããªãããšã楜芳çã«é¡ã£ãŠããŸãã ãããã£ãŠã質åã¯æ¬¡ã®ããã«åå®åŒåã§ããŸããå ±é
¬ã®æ©èœã倿Žãããšãåé¡èªäœã倿ŽãããŸããæ°ããåé¡ã®æé©ãªè§£æ±ºçã¯å€ãåé¡ã«æé©ã§ããïŒ
çµå±ã®ãšãããäžè¬çãªã±ãŒã¹ã§ã¯ãã®ãããªä¿èšŒãæäŸããããšã¯ã§ããŸããã çãã¯ãå ±é
¬ã®æ©èœãæ£ç¢ºã«ã©ã®ããã«å€æŽããããããã以åã«ã©ã®ããã«é
眮ãããããç°å¢èªäœãã©ã®ããã«é
眮ããããã«ãã£ãŠç°ãªããŸãã 幞ããªããšã«ãå ±é
¬ã®é¢æ°ã倿ŽãããšãèŠã€ãã£ããœãªã¥ãŒã·ã§ã³ã®æé©æ§ã«ã©ã®ããã«åœ±é¿ãããã調æ»ããèè
ã®
èšäºããããŸãã
ãŸããæœåšçãªæ¹æ³ã«åºã¥ãããå®å
šãªã倿Žã®ã¯ã©ã¹å
šäœãèŠã€ããŸããã
ã©ãã§
-ç¶æ
ãç¶æ
ã®ã¿ã«äŸåããŸãã ãã®ãããªæ©èœã«å¯ŸããŠãèè
ã¯ãæ°ããåé¡ã®è§£æ±ºçãæé©ã§ããã°ãå€ãåé¡ã®è§£æ±ºçãæé©ã§ããããšã蚌æããããšãã§ããŸããã
第äºã«ãèè
ã¯ä»ã®
ãã®ãããªåé¡ãå ±é
¬é¢æ°Rãããã³å€æŽãããåé¡ã«å¯Ÿããæé©ãªè§£æ±ºçãããããããã®è§£æ±ºçã¯å
ã®åé¡ã«ãšã£ãŠæé©ã§ã¯ãããŸããã ããã¯ãæœåšçãªæ¹æ³ã«åºã¥ããªã倿Žã䜿çšããå ŽåãèŠã€ãã£ããœãªã¥ãŒã·ã§ã³ã®è¯ããä¿èšŒã§ããªãããšãæå³ããŸãã
ãããã£ãŠãå ±é
¬é¢æ°ã倿Žããããã®æœåšçãªé¢æ°ã®äœ¿çšã¯ãã¢ã«ãŽãªãºã ã®åæçã®ã¿ã倿Žã§ããŸãããæçµçãªãœãªã¥ãŒã·ã§ã³ã«ã¯åœ±é¿ããŸããã
åæãæ£ããã¹ããŒãã¢ãããã
å ±é
¬ãå®å
šã«å€æŽããæ¹æ³ãããã£ãã®ã§ãåçŽãªãã¥ãŒãªã¹ãã£ãã¯ã®ä»£ããã«æœåšçãªæ¹æ³ã䜿çšããŠãã¿ã¹ã¯ãååºŠå€æŽããŠã¿ãŸãããã
modified_reward =å ±é
¬+ 300 *ïŒã¬ã³ã* absïŒnew_state [1]ïŒ-absïŒstate [1]ïŒïŒ
å
ã®è³ã®ã¹ã±ãžã¥ãŒã«ãèŠãŠã¿ãŸãããïŒ

çµå±ã®ãšãããçè«çãªä¿èšŒã«å ããŠãæœåšçãªæ©èœã®å©ããåããŠå ±é
¬ã倿Žãããšãç¹ã«åææ®µéã§çµæã倧å¹
ã«æ¹åãããŸããã ãã¡ããããšãŒãžã§ã³ãããã¬ãŒãã³ã°ããããã«ããæé©ãªãã€ããŒãã©ã¡ãŒã¿ãŒïŒã©ã³ãã ã·ãŒããã¬ã³ããããã³ãã®ä»ã®ä¿æ°ïŒãéžæã§ããå¯èœæ§ããããŸããããããã«ããŠãã¢ãã«ã®åæé床ã倧å¹
ã«åäžãããã·ã§ãŒãã³ã°ã«å ±é
¬ãäžããŸãã
ããšãã
æåŸãŸã§èªãã§ãããŠããããšãïŒ åŒ·åèšç·Žãžã®ãã®å°ããªå®è·µæåã®é è¶³ãæ¥œããã ããšãé¡ã£ãŠããŸãã ããŠã³ãã³ã«ãŒã¯ãããã¡ããã®ä»äºã§ããããšã¯æããã§ãããæ°ã¥ããããã«ã人éã®èгç¹ãããã®ãããªäžèŠåââçŽãªä»äºã§ã解決ãããããšãŒãžã§ã³ãã«æããããšã¯é£ããå ŽåããããŸãã