Idealistic Instatement and Covetousness Lead to Polynomial-Time Learning in Calculated MDPs.


230 views
Uploaded on:
Description
Hopeful Instatement and Ravenousness Lead to Polynomial-Time Learning in Calculated MDPs. Istv án Szita and Andr ás Lőrincz. College of Alberta Canada. Eötvös Loránd College Hungary. Plot. Considered MDPs inspiration definitions arranging in FMDPs Good faith
Transcripts
Slide 1

Idealistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs Istv án Szita & Andr ás Lå‘rincz University of Alberta Canada Eã¶tvã¶s Lorã¡nd University Hungary

Slide 2

Outline Factored MDPs inspiration definitions arranging in FMDPs Optimism & FMDPs & Model-based learning Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 3

Reinforcement taking in the specialists settles on choices … in an obscure world mentions some objective facts (counting prizes) tries to expand gathered prize Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 4

What sort of perception? organized perceptions structure is misty ??? Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 5

How to “ settle a RL undertaking ” ? a model is helpful can reuse experience from past trials can learn disconnected from the net perceptions are organized structure is obscure organized + model + RL = FMDP ! (then again direct dynamical frameworks, neural systems, etc…) Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 6

Factored MDPs conventional MDPs everything is calculated states prizes move probabilities (worth capacities) Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 7

Factored state space all capacities rely on upon a couple of variables just Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 8

Factored elements Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 9

Factored prizes Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 10

(Factored v alue capacity s) V * is not figured by and large we will make a rough guess slip Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 11

Solving a known FMDP NP-hard either exponential-time or non-optimal… exponential-time most pessimistic scenario leveling the FMDP surmised arrangement cycle [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] non-ideal arrangement (approximating quality capacity in a considered structure) inexact straight programming [Guestrin, Koller, Parr & Venkataraman, 2002] ALP + strategy emphasis [Guestrin et al., 2002] considered quality cycle [Szita & L őrincz , 2008] Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 12

Factored worth emphasis H := grid of premise capacities N ( H T ) := column standardization of H T , the cycle merges to altered point w £ can be registered rapidly for FMDPs Let V £ = H w £ . At that point V £ has limited lapse: Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 13

Learning in obscure FMDPs obscure variable deteriorations (structure) obscure prizes obscure moves (flow) Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 14

Learning in obscure FMDPs obscure element disintegrations (structure) obscure prizes obscure moves (motion) Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 15

Outline Factored MDPs inspiration definitions arranging in FMDPs Optimism & FMDPs & Model-based learning Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 16

Learning in an obscure FMDP a.k.a. “ Explore or endeavor? ” in the wake of attempting a couple activity sequences… … attempt to find better ones? … do the best thing as per current information? Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 17

Be Optimistic! (at the point when confronting instability) Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 18

it is possible that you get experience… Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 19

or you get reward! Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 20

Outline Factored MDPs inspiration definitions arranging in FMDPs Optimism & FMDPs & Model-based learning Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 21

Factored Initial Model part x1 folks: (x1,x3) segment x2 guardian: (x2) … Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 22

Factored Optimistic Initial Model “Garden of Eden” +$10000 compensate (or something high) segment x1 folks: (x1,x3) segment x2 guardian: (x2) … Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 23

Later on… as indicated by beginning model, all states have esteem in regularly gone by states, model turns out to be more practical ! reward desires get lower ! specialists investigates different ranges segment x1 folks: (x1,x3) segment x2 guardian: (x2) … Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 24

Factored hopeful starting model instate model ( hopefully ) for every time step t , fathom aproximate model utilizing considered worth emphasis make eager move , see next state upgrade model number of non-close ideal steps (w.r.t. V £ ) is polynomial with likelihood ¼ 1 Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 25

components of verification: some standard stuff if , then if for all i , then let m i be the quantity of visits to if m i is expansive, then for all y i . all the more accurately: with prob. (Hoeffding/Azuma imbalance) Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 26

components of confirmation: fundamental lemma for any , inexact Bellman-upgrades will be more hopeful than the genuine ones: if V E is sufficiently substantial, the reward term overwhelms for quite a while if all components of H are nonnegative, projection jam positive thinking lower bound by Azuma’s disparity reward guaranteed by Garden of Eden state Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 27

components of verification: wrap up for quite a while, V t is sufficiently idealistic to help investigation at most polynomially numerous investigation steps can be made aside from those, the operators must be close V £ - ideal Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 28

Previous methodologies augmentations of E 3 , Rmax, MBIE to FMDPs utilizing current model, make brilliant arrangement (investigate or endeavor) investigate: make demonstrate more exact adventure: gather close ideal prize unspecified organizers prerequisite: yield arrangement is near ideal …e.g., explain the level MDP polynomial specimen multifaceted nature exponential measures of calculation! Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 29

Unknown prizes ? “To rearrange the presentation, we accept the prize capacity is known and does not should be learned. All outcomes can be reached out to the instance of an obscure prize function.” false. issue: can\'t watch reward segments, just their entirety ! UAI blurb [Walsh, Szita, Diuk, Littman, 2009] Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 30

Unknown structure ? can be learnt in polynomial time SLF-Rmax [Strehl, Diuk, Littman, 2007] Met-Rmax [Diuk, Li, Littman, 2009] Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 31

Take-home message if your model begins hopefully enough, you get proficient investigation for nothing! (regardless of the possibility that your organizer is non-ideal ( the length of it is monotonic) ) Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 32

Thank you for your consideration!

Slide 33

Optimistic introductory model for FMDPs increase the value of every state variable include prize components for every state variable init move model Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 34

Outline Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Slide 35

Outline Szita & L őrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs .:tsli

Recommended
View more...