A Study on Efficient Transfer Learning for Reinforcement Learning Using Sparse Coding

In this study, we have proposed a novel technique to employ sparse coding in transfer learning for reinforcement learning. By applying the knowledge to the bases of sparse coding dictionary matrix, we succeeded to choose the correct knowledge that should be transferred to new environment target tasks. We experimented the proposed method on the 5 target tasks and confirmed that the method provided jumpstart and reduced the total cell costs. By the result of these experiments, our proposed method was able to contribute significant improvement on the effective online transfer learning for reinforcement learning. However, as mentioned in section VII-E, negative transfer happened on some cells of the target task. For future work, to solve this problem we would like to reconsider the number of bases and the quality of the dictionary, and then aim to improve the accuracy of transfer learning. In concrete, we will increase the number of source tasks to obtain a larger dictionary and obtain a large number of high diversity bases in the dictionary. On the other hand, we also have to take the processing time into account. It will become unrealistic if the number of bases becomes enormous. So, in practice we have to limit the maximum number of bases. In this point, to deal with this problem, we are going to take the methods employed by prior studies such as [9] in our study and aim to construct an ideal dictionary within the upper limit of the number of bases.

7 trang | Chia sẻ: huongthu9 | Lượt xem: 394 | Lượt tải: 0

Bạn đang xem nội dung tài liệu A Study on Efficient Transfer Learning for Reinforcement Learning Using Sparse Coding, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

A Study on Efficient Transfer Learning for Reinforcement Learning Using Sparse Coding Midori Saito and Ichiro Kobayashi Advanced Sciences, Graduate School of Humanities and Sciences, Ochanomizu University, Tokyo, Japan Email: {saito.midori, koba}@is.ocha.ac.jp Abstract—By applying the knowledge previously obtained by reinforcement learning to new tasks, transfer learning has been successful in achieving efficient learning, rather than re-learning knowledge about action policies from scratch. However, in the case of applying transfer learning to reinforcement learning, it is not easy to determine which and how much the obtained knowledge should be transferred. With this background, in this study, we propose a novel method that enables to decide the knowledge and to determine the ratio of transference by adopting sparse coding in transfer learning. The transferred knowledge is represented as a linear combination of the accumulated knowledge by means of sparse coding. In the experiments, we have adopted colored mazes as tasks and confirmed that our proposed method significantly improved in terms of jumpstart and of the reduction of the total learning cost, compared with normal Q-learning. Index Terms—sparse coding, transfer learning, reinforcement learning, maze I. INTRODUCTION In reinforcement learning [1], an agent explores a target environment repeatedly, performing given tasks by trial and error to obtain optimal action policies in the environment. However, the way of learning requires quite a number of random explores until satisfied action policies are obtained, so this is regarded as a big problem with the method [2]. To solve this problem, many approaches have been studied to aim to reduce the number of exploration steps. Among such approaches, it has been reported that transfer learning is especially useful and succeeds in achieving efficient reinforcement learning by reusing the previously learned action policies in similar environments in the target environment [3][4]. However, because the similarities among task environments are not clearly defined in the framework of transfer learning, the similarity is required to be calculated in response to each task [5]. If the number of the states of a task and the number of agents’ actions are a quite a few, the information to be transferred shall be a huge amount, and then the calculation shall be so much complicated. With respect to this problem, in this paper, we employ sparse coding to calculate the similarities between the environments of a source task and a target task, and determine which and how much action policies Manuscript received June 11, 2015; revised October 2, 2015. are transferred. By this, we aim to make possible to efficiently transfer action policies to new target tasks. II. RELATED STUDIES This section presents the related studies to our study, in particular, the ones focus on transfer learning employed in reinforcement learning and also sparse coding employed in transfer learning. First, as for the transfer learning employed in reinforcement learning, it has been successful in generalizing information across multiple tasks. Even though between two different tasks, transferring the knowledge an agent has learned in a source task is useful in a target task [6]. Fernando et al. [3] employed Policy Reuse as a technique to improve transfer efficiency in reinforcement learning by reusing similar policies leaned in a past. The technique improves its exploration in target environments by probabilistically including the exploitation of those past policies. Then, they succeeded to improve the learning performance over different strategies with policy reuse, and contributed policy reuse as transfer learning among different domains [4]. Trung et al. [7] proposed a method to transfer old knowledge, and evaluated new options to see if they worked better. They succeeded in achieving efficient and online transfer, and improved jumpstart and faster convergence to near optimum policy. Furthermore, they proposed model-based reinforcement learning that supports efficient online-learning of the relevant features and introduced an online sparse coding learning technique for feature selection in high-dimensional spaces [8]. Then, they demonstrated practicality of their proposed model in both simulated and real robotics domains. Next, as for sparse coding, sparse coding was originally developed to achieve a good result in the field of signal processing, i.e., audio and natural images. It can approximately represent the input signal expressed by a vector with the linear combination of a few bases of the vector well [9]. For instance, in the field of image processing, Kai et al. [10] presented a method for earning image representations using a two-layer sparse coding scheme. The algorithm provided excellent results for hand-written digit recognition and object recognition, and achieved to automatically learn the features of the target. In this research, they proposed an approach that accounts for high-order dependency among patterns in a local image neighborhood. Then, as an example of applying Journal of Automation and Control Engineering Vol. 4, No. 4, August 2016 ©2016 Journal of Automation and Control Engineering doi: 10.18178/joace.4.4.324-330 324 sparse coding to transfer learning, Haithman et al. [5][11] proposed a novel transfer learning for reinforcement learning method capable of autonomously creating an inter-task mapping by using a novel combination of sparse coding. They succeeded to show not only transfer of information between similar tasks, but also between two very different domains in their experiments. Then, they illustrated that the learned inter-task mapping can be successfully used to improve the performance of a learned policy, reduce the learning times, and converge faster to a near-optimal policy. In this study, we propose a novel online transfer method making use of sparse coding in reinforcement learning, based on those state-of-the-art researches mentioned above. III. REINFORCEMENT LEARNING AND TRANSFER LEARNING A. Reinforcement Learning The reinforcement learning [1] is a machine learning method to obtain optimal action policies by making an agent repeatedly search in a target environment. In concrete, the learning process is shown as the following three steps: 1. An agent observes the states of an environment. 2. An agent selects and performs an action among the possible actions in the current environment. 3. The action performed at an environment is evaluated by being given reward or a penalty. The reinforcement learning is defined as a Markov Decision Processes (MDPs), and its state is represented as a tuple, . Here, S is a set of states; A is a set of actions; P is the transition probability expressed as P=Pr{st+1=s’ | st=s, at=a}; and R is reward given to an agent from the environment. An agent's action policy is expressed as π(s, a)=Pr{at=a | st=s}. The reinforcement learning aims to acquire the optimal action policies that maximize the total expectation value of the reward given from an environment as in (1). }|{}|{)( 1 0 ssrEssREsV tkt k k tt      (1) here, V π (s)is called state-value function for policyπ. γ indicates the discount ratio. B. Q-learning We employ Q-learning [14] as a reinforcement learning algorithm. It is a kind of Temporal Differential learning, and aims to maximize the evaluation value of actions, called Q-value. The equation for updating Q- value is shown in (2). Q(st,at ) =Q(st,at )+ a(r +g maxaQ(st+1,a) -Q(st,at )) (2) here, Q(s,a)=E[R|st=s,at=a], called action-value function, which expresses the value obtained from the action a at the state s. αindicates the learning ratio and γ indicates the discount ratio. In this study, we employ ε-greedy algorithm in deciding agent's action. In the action selection by ε-greedy algorithm, the actions are randomly selected with the probability of ε, and are selected so as maximum Q-value with the probability of 1-ε. C. Transfer Learning The transfer learning employed in the framework for reinforcement learning aims to reduce the number of random exploration in a new task. First, we obtain knowledge as action policies or Q-values by executing reinforcement learning in a source task, and then apply the obtained knowledge to a target task. Thereby, the efficiency of learning the target task will be improved, even if the target task is not the same as the source task. However, if the environment of the target task is considerably different from that of the source task, the transferred knowledge is useless at the target task. So, we should correctly choose the knowledge to be transferred in accordance with the situation of the target task. Therefore, we introduce sparse coding into transfer learning. IV. SPARSE CODING A. Sparse Coding In this study, we propose a method to realize efficient transfer learning in the framework for reinforcement learning by introducing sparse coding [13]. Sparse coding is a method of the signal processing. It selects the number of the basis vectors as small as possible for a signal, and expresses input signal with linear combination of the basis vectors. We show an equation representing the idea of sparse coding below. y = Dx (3) here, y indicates an input signal, D is a set of basis vectors called dictionary, and x is an activation matrix which indicates a set of coefficient corresponding to each of the base. Like this, in sparse coding, y is represented in two matrices, D and x. The objective function to be optimized for sparse coding is defined in (4). x* = minx 1 2 y - Dx 2 2 + l x 1 (4) here, the first right term of (4) indicates the term for minimizing the square error between the original information y and the restored information Dx, and the second term is for regularization which provides a constraint on deriving x in a sparse condition. λ is the parameter for regularization. By (4), we obtain an optimal sparse activation matrix x. B. Sparse Coding for Transfer Learning Now we explain an overview of how to apply sparse coding to transfer learning. In the case of reusing the knowledge, i.e., action policies, obtained by reinforcement learning in a new target task environment, the accuracy of transfer learning will be considerably different depending on which pieces of knowledge are chosen from a huge quantity of knowledge. Moreover, only a part of the whole accumulated knowledge is used to search an appropriate action policy for a target task in short term, though a huge quantity of knowledge is Journal of Automation and Control Engineering Vol. 4, No. 4, August 2016 ©2016 Journal of Automation and Control Engineering 325 necessary for the whole transfer. Therefore, it is thought that sparse coding works well in transfer learning by regarding one piece of knowledge obtained in a source task environment as one basic vector of dictionary matrix, namely, all accumulated knowledge is regarded as a set of basis vectors. Then, by regarding the state s observed in a target task as input vector, it can be represented with linear combination of knowledge in the dictionary matrix. C. Proposed Method As seen in Fig. 1, at first, in the target task, the costs of the target cell and its surrounding 5-5 square cells are observed and compiled as an input vector (step 1). Secondary, sparse coding is applied to the input vector with the dictionary built by source task, and an activation matrix is calculated (step 2). Then, the Q-value obtained in the source tasks which corresponds to the index of the non-zero elements of the activation matrix is multiplied with the value of the non-zero elements of the activation matrix (step 3). Lastly, the Q-value calculated in step 3 is returned to the current target cell (step4). In this way, in the target task, by repeating the above 4 steps and sequentially providing the Q-value obtained in the source tasks as the value of the target cell, an efficient transfer learning is achieved. The detail of this our proposed method is described in section V. Figure 1. Overview of proposed method. V. EXPERIMENTS A. Source Task At first, in order to construct a dictionary by means of sparse cording, we prepared 4 different source tasks (see, Fig. 2).We adopted colored mazes as tasks in [14], these mazes have 5 different cell colors in themselves, and the size of all 4 source tasks are the same, the height is 30 cells and the width is 30 cells. Each color has its own cost: i.e., white: 0.0, blue : -2.0, green : -3.0, red : -5.0, and black : -10.0. We regard the cell costs as the reward, and +100 is given when an agent arrives at goal. In each cell, the agent can select an action among 4 actions, i.e., moving 1 cell in either up, down, right, or left direction. The task we adopt in our experiments is moving to the goal at the right corner of the bottom of the maze from the start point at the left corner of the top of the maze. As for the parameter settings for Q-learning, α is set as 0.1 and γ is set as 0.9. Moreover, as the algorithm of selecting actions, we employ ε-greedy algorithm in which it takes random action with 20% probability and the action based on maximum Q-value with 80% probability, aiming that the agent can learn action policies to find the optimal route to the goal with total cell costs and total steps as low as possible. With these settings, the agent executed Q-learning and learned the optimal route to the goal. We took reinforcement learning for 1000000 episodes repeatedly on each maze, and after 1000000 episodes simulation, we recorded the information about cost and Q-value of each cell as the knowledge to be used for transfer learning. B. Construction of a Dictionay Matrix As mentioned in V-A, a dictionary matrix is constructed based on the information obtained from the result of 1000000 episodes Q-learning. Figure 2. 4 different source tasks. We regard the information in 5-5 square cell costs of the source tasks as a basis vector in a dictionary and extract possible information in 5-5 square cells from 30- 30 square cells. By this, the number of basis vectors in each task is 676 (=26*26) and then becomes 2704 in total (=676*4 tasks). So, the size of the dictionary matrix becomes 25*2704. The image of how to make a dictionary is shown in Fig. 3. C. Target Task As for target tasks, we use the same size square colored mazes as the source tasks. As well as the source tasks, the start point is at the left corner of the top of the maze and the goal point is at the right corner of the bottom of the maze (Fig. 4). As explained at step 1 in IV- C, the agent obtained the cell cost data of the currently exploring cell and its 5-5 square surroundings, and regarded as an input vector. Then, sparse coding chose some environment in 4 source tasks (the bases of the dictionary shown in V-B) that are similar to the target 5-5 square cell cost information, and calculated how similar those chosen environment (the results of the activations). Next, we extracted the index of non-zero activations from Journal of Automation and Control Engineering Vol. 4, No. 4, August 2016 ©2016 Journal of Automation and Control Engineering 326 the activation matrix, and the 5-5 cells’ Q-value data (gained in V-A) corresponding to that index, is multiplied by the corresponding activation values, and lastly took these linear combination. In that calculated Q-value data, the Q-value data for 4 action policies: up, down, right and left, corresponding to the middle cell of 5-5 square was extracted, and it returned to the target cell as its Q-value. While exploring the target task, if it is the first time for the agent to visit the cell, because there is no prior knowledge about a proper action at the cell, Q-value is transferred through the execution of sparse coding. On the other hand, if the agent visits the cell more than twice, normal Q-learning is applied to update the Q-value of the cell because there already exists prior knowledge about actions. Figure 3. How to construct a dictionary matrix. D. Experimental Settings In this paper, we examined our proposed method with 5 different target tasks (see, Fig. 4). In order to show the effectiveness of our proposed method, we employed a normal Q-learning as a baseline method to compare. The Q-learning parameters of the proposed method and the base line method are α=0.1,γ=0.9, and ε =0.2 of ε-greedy. We set the total value of the elements in the activation matrix to be 10.0 for a sparse coding setting. These parameter settings are empirically decided. Fig. 5 shows the relation between bases in the dictionary matrix and the elements in the activation matrix at a particular input cell when the total value of the elements of the activation matrix changes in the range of [0,100].For example, when the total activation value gets 40 in horizontal axis, we see that 3 bases in the dictionary are used to represent the input and the sum of the activation values of those 3 bases gets 40. Fig. 6 shows the result that activation value when the total value of activations changed to 1, 10, and 50. As seen from this figure, a few bases contain non- zero value in their activation, other bases have 0 activation value by sparse coding. Through these experiments, we evaluated 2 points: how the proposed method could reduce the total cell cost and supported jumpstart which means the reduction of number of steps necessary to reach the goal. Figure 4. 5 different target tasks. Value of total activations V al u e o f e ac h a ct iv at io n Figure 5. The trace of activations. Number of bases V al u e o f ac ti va ti o n s o f ea ch b as e Figure 6. The value of activations. E. Result Fig. 7-Fig. 11 represent the experimental results of the number of total steps and the values of total cell costs obtained in 5 target tasks. In the experiments, each task is repeatedly performed 100 times. In all the figures, the red lines indicate the proposed method and the blue lines are normal Q-learning. And these figures show the number of episodes in the horizontal-axis and the number of total steps or the value of total cell costs in the vertical-axis. Journal of Automation and Control Engineering Vol. 4, No. 4, August 2016 ©2016 Journal of Automation and Control Engineering 327 Figure 7. The result of target task 1. Figure 8. The result of target task 2. Figure 9. The result of target task 3. Figure 10. The result of target task 4. Figure 11. The result of target task 5. F. Discussions Table I shows the number of total steps and total cell costs when the agent first arrived at the goal in 5 target tasks. As this table shows, the proposed method could decrease the number of necessary steps taken to reach the goal. We see clearly from the results that the jumpstart was achieved. We also see that the total costs decreased. On the whole, the proposed method, which transfers the Q-values obtained in the source tasks to the target task, was more efficiently than the normal Q-learning that learns Q-value of the target task from scratch. However, as seen in Fig. 7-Fig. 11, there are some points where the proposed method did not work well than the normal Q- learning, especially in task 3 and 5. As the reason for this, it can be thought that the negative transfer has arisen at some cells in the target task and then more steps were necessary to get recovered from the problem. That also led the value of total cell cost to be much worse. Depending on task environment, there is some difference among the effect of transfer via sparse coding. We have verified the assumption that negative transfer happened in the target tasks. As clear examples, we focus on the 61 st episode that took the largest number of steps to reach the goal and the 2 nd episode that took the smallest number of steps in target task 5. Fig. 12 shows how many times the agent arrived at each cell of target task 5 in the 61 st episode. In the figure, the horizontal-axis is the x- coordinate of the target task5, the depth-axis is the y- coordinate and vertical-axis is the number of steps at each cell. We see that the top left corner of the x-y plane is the start point and the front of bottom right corner is the goal point. As this graph shows, there are many useless steps observed at around the bottom left. Likewise, it is observed that the number of steps increased around the bottom left in 2 nd episode, though the total number of steps is considerably lower than that of 61 st episode (see, Fig. 13). TABLE I. THE RESULT OF THE FIRST EPISODE Task1 Task2 Task3 Task4 Task5 steps proposed 208 196 480 166 506 Q-learinig 5728 7878 5910 5972 13238 costs proposed -125 -196 -476 -157 -208 Q-learning -10522 -12391 -9783 -9516 -15016 Figure 12. Number of steps in 61st Episode of target task 5. Figure 13. Number of steps in 2nd Episode of target task5 Among the steps each cell took, the number of steps of the cell at (8, 28) in 61 st episode took 516 steps, the largest steps in the episode. Therefore, we investigated which piece of knowledge was transferred to the cell at (8, Journal of Automation and Control Engineering Vol. 4, No. 4, August 2016 ©2016 Journal of Automation and Control Engineering 328 28). Fig. 14 shows knowledge transfer happened in the cell at (8, 28) of target task 5. The left 5-5 square cell of Fig. 14 indicates the cell cost, obtained when the agent steps into the cell at (8, 28), which is used as input information for sparse coding. In the case of the 5-5 square, we treat the cost of the bottom row as -100, because of the out of search range. Depending on the input information, sparse coding is executed. As the result of execution of sparse coding, we confirmed that 100% of the Q-value of the cell at (23, 24) in source task 3 was transferred to the target cell. However, even though compared the left target circumstance to the right source circumstance, there is less similar between the two environments. It can therefore be thought that the transfer was influenced by the cost at the bottom row of the target task. In addition, Fig. 15 shows a result of the case where the input information for sparse coding does not include the cost at the bottom row. Here, the figure shows the case where knowledge is transferred to the cell at (10,6) of the target task 5. As a result of this case, there were several bases employed as the knowledge to be transferred. Among the bases, the base of the cell at (23, 14) in the source task 1 had the largest transfer ratio, which was about 55.8%. In this example, we see that the transferred Q-value was successfully represented as linear combination of the Q-values of several bases. To consider the difference between these two transfers, because there were lack of the bases which correspond to the out of range, that led transfer accuracy to get worse. In this study, we think if the bases including the out of range cost, the influence by those bases might cause imbalance knowledge transfer, so, we did not take those bases in the dictionary. Thereby, we think the reason why knowledge transfer did not work well at some points as the dictionary was not well prepared. Figure 14. Bad transfer. Figure 15. Good transfer. VI. CONCLUSIONS REFERENCES [1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, The MIT Press, 1998. [2] T. Takano, H. Takase, H. Kawanaka, and S. Tsuruoka, “A study on selection method of transfer knowledge in same transition model for reinforcement learning (in Japanese),” in Proc. 27th Fuzzy System Symposium, MB1-3, 2011. [3] F. Fernandez and M. Veloso, “Probabilistic policy reuse in a reinforcement learning agent,” AAMAS’06, May 8–12, 2006. [4] F. Fernández, J. García, and M. Veloso, “Probabilistic policy reuse for inter-task transfer learning,” Robotics and Autonomous Systems 58, pp. 866–871, 2010. [5] H. B. Ammar, K. Tuyls, M. E. Taylor, K. Driessens, and G. Weiss, “Reinforcement learning transfer via sparse coding,” in Proc. the 11th International Conference on Autonomous Agents and Multiagent Systems, 2012, pp. 4-8. [6] M. E. Taylor and P. Stone, “An introduction to inter-task transfer for reinforcement learning,” Association for the Advancement of Artificial Intelligence, 2011. [7] T. T. Nguyen, T. Silander, and T. Y. Leong, “Transferring expectations in model-based reinforcement learning,” NIPS, 2012. [8] T. T. Nguyen, Z. Li, T. Silander, and T. Y. Leong, “Online feature selection for model-based reinforcement learning,” ICML, 2013. [9] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classification and clustering via dictionary learning with structured incoherence and shared features,” IEEE, 2010. [10] K. Yu, Y. Q. Lin, and J. Lafferty, “Learning image representations from the pixel level via hierarchical sparse coding,” in Proc. 2011 IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1713-1720. [11] H. B. Ammar, M. E. Taylor, K. Tuyls, and G. Weiss, “Reinforcement learning transfer using a sparse coded inter-task mapping,” EUMAS, Lecture Notes in Computer Science, Springer, vol. 7541, pp. 1-16, 2011. Journal of Automation and Control Engineering Vol. 4, No. 4, August 2016 ©2016 Journal of Automation and Control Engineering 329 In this study, we have proposed a novel technique to employ sparse coding in transfer learning for reinforcement learning. By applying the knowledge to the bases of sparse coding dictionary matrix, we succeeded to choose the correct knowledge that should be transferred to new environment target tasks. We experimented the proposed method on the 5 target tasks and confirmed that the method provided jumpstart and reduced the total cell costs. By the result of these experiments, our proposed method was able to contribute significant improvement on the effective online transfer learning for reinforcement learning. However, as mentioned in section VII-E, negative transfer happened on some cells of the target task. For future work, to solve this problem we would like to reconsider the number of bases and the quality of the dictionary, and then aim to improve the accuracy of transfer learning. In concrete, we will increase the number of source tasks to obtain a larger dictionary and obtain a large number of high diversity bases in the dictionary. On the other hand, we also have to take the processing time into account. It will become unrealistic if the number of bases becomes enormous. So, in practice we have to limit the maximum number of bases. In this point, to deal with this problem, we are going to take the methods employed by prior studies such as [9] in our study and aim to construct an ideal dictionary within the upper limit of the number of bases. [12] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, pp. 607-609, 1996. [13] A. Wilson, A. Fern, and P. Tadepalli, “Transfer learning in sequential decision problems: a hierarchical bayesian approach,” JMLR: Workshop and Conference Proceedings, vol. 27, pp. 217– 227, 2012. [14] C. J. C. H. Watkins, “Learning from delayed rewards,” PhD thesis, King’s College, Cambridge, UK, 1989． Midori Saito (Aichi, Mar. 16th, 1991), Master student at Advanced Sciences, Graduated School of Humanities and Sciences, Ochanomizu University. She graduated from Dept. of Information Sciences, Faculty of Sciences, Ochanomizu University in 2013. Ichiro Kobayashi (Tokyo, Aug. 2nd, 1965), 2011- Professor, Advanced Sciences, Graduated School of Humanities and Sciences, Ochanomizu University. 2003-2010 Associate Professor, Advanced Sciences, Graduated School of Humanities and Sciences, Ochanomizu University. 1996-2003 Associate Professor, Faculty of Economics, Hosei University. 1995 Assistant Professor, Faculty of Economics, Hosei University. Journal of Automation and Control Engineering Vol. 4, No. 4, August 2016 ©2016 Journal of Automation and Control Engineering 330

Các file đính kèm theo tài liệu này:

a_study_on_efficient_transfer_learning_for_reinforcement_lea.pdf