Together these data suggest that hallmarks of both strategies are seen significantly at the population level and within many individuals, but that there may be between-subject variability in their deployment. Motivated by these results, we considered the fit of full model-based Alectinib order and model-free [SARSA(λ) TD; Rummery and Niranjan, 1994] RL algorithms to the choice sequences. The former evaluates actions by prospective simulation in a learned model; the latter uses a generalized principle of reinforcement. The generalization, controlled by the reinforcement eligibility
parameter λ, is that the estimated value of the second-stage Y-27632 cell line state should act as the same sort of model-free reinforcer for the first-stage choice because the final reward actually received after the second-stage choice. The parameter λ governs the relative importance of these two reinforcers, with λ = 1 being
the special case of Figure 2A in which only the final reward is important, and λ = 0 being the purest case of the TD algorithm in which only the second-stage value plays a role. We also considered a hybrid theory (Gläscher et al., 2010) in which subjects could run both algorithms in parallel and make choices according to the weighted combination of the action values that they produce (see Experimental Procedures). We took the relative weight of the two algorithms’ values into account in determining the choices to be a free parameter, which we allowed to vary across subjects but assumed to be constant throughout the experiment. Thus, this algorithm contains both the model-based and TD algorithms as special cases, where one or the other gets all weight. We first verified that the model fit significantly better than
chance; it did so, at p < 0.05 for all 17 subjects (likelihood ratio tests). We estimated the theory's free parameters Bay 11-7085 individually for each subject by maximum likelihood (Table 1). Such an analysis treats each subject as occupying a point on a continuum trading off the two strategies; tests of the parameter estimates across subjects seek effects that are generalizable to other members of the population (analogous to the random effects level in fMRI; Holmes and Friston, 1998). Due to non-Gaussian statistics (because the parameters are expected to lie in the unit range), we analyzed the estimated parameters’ medians using nonparametric tests. Across subjects, the median weighting for model-free RL values was 61% (with model-based RL at 39%), which was significantly different from both 0% and 100% (sign tests, p < 0.005), again suggesting that both strategies were mixed in the population.