Maximum likelihood parameters deviate from posterior distributions Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)lme() and lmer() giving conflicting resultsWhy is MCMC needed when estimating a parameter using MAPGiven MCMC samples, what are the options for estimating posterior of parameters?Maximizing likelihood versus MCMC sampling: Comparing Parameters and DevianceModelling parameters in maximum likelihoodMarkov chain Monte Carlo (MCMC) for Maximum Likelihood Estimation (MLE)What is the common criterion to decide the performance of prior selection in MCMCConnection between MCMC and Optimization for Inverse/Parameter-Estmation ProblemsFailure of Maximum Likelihood EstimationDo posterior probability values from an MCMC analysis have any use?Monte Carlo maximum likelihood vs Bayesian inferenceExample of maximum a posteriori that does not match the mean of a marginalized posterior

What's the difference between using dependency injection with a container and using a service locator?

Why aren't these two solutions equivalent? Combinatorics problem

Recursive calls to a function - why is the address of the parameter passed to it lowering with each call?

A journey... into the MIND

How to know or convert AREA, PERIMETER units in QGIS

Assertions In A Mock Callout Test

Why doesn't the university give past final exams' answers?

Why do people think Winterfell crypts is the safest place for women, children & old people?

Marquee sign letters

Is there a verb for listening stealthily?

Coin Game with infinite paradox

Does Prince Arnaud cause someone holding the Princess to lose?

How to create a command for the "strange m" symbol in latex?

Can this water damage be explained by lack of gutters and grading issues?

Why does BitLocker not use RSA?

How to calculate density of unknown planet?

Putting Ant-Man on house arrest

Does traveling In The United States require a passport or can I use my green card if not a US citizen?

Why does my GNOME settings mention "Moto C Plus"?

Why is one lightbulb in a string illuminated?

How is an IPA symbol that lacks a name (e.g. ɲ) called?

How to ask rejected full-time candidates to apply to teach individual courses?

Why did Israel vote against lifting the American embargo on Cuba?

What *exactly* is electrical current, voltage, and resistance?



Maximum likelihood parameters deviate from posterior distributions



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)lme() and lmer() giving conflicting resultsWhy is MCMC needed when estimating a parameter using MAPGiven MCMC samples, what are the options for estimating posterior of parameters?Maximizing likelihood versus MCMC sampling: Comparing Parameters and DevianceModelling parameters in maximum likelihoodMarkov chain Monte Carlo (MCMC) for Maximum Likelihood Estimation (MLE)What is the common criterion to decide the performance of prior selection in MCMCConnection between MCMC and Optimization for Inverse/Parameter-Estmation ProblemsFailure of Maximum Likelihood EstimationDo posterior probability values from an MCMC analysis have any use?Monte Carlo maximum likelihood vs Bayesian inferenceExample of maximum a posteriori that does not match the mean of a marginalized posterior



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








9












$begingroup$


I have a likelihood function $mathcalL(d | theta)$ for the probability of my data $d$ given some model parameters $theta in mathbfR^N$, which I would like to estimate. Assuming flat priors on the parameters, the likelihood is proportional to the posterior probability. I use an MCMC method to sample this probability.



Looking at the resulting converged chain, I find that the maximum likelihood parameters are not consistent with the posterior distributions. For example, the marginalized posterior probability distribution for one of the parameters might be $theta_0 sim N(mu=0, sigma^2=1)$, while the value of $theta_0$ at the maximum likelihood point is $theta_0^ML approx 4$, essentially being almost the maximum value of $theta_0$ traversed by the MCMC sampler.



This is an illustrative example, not my actual results. The real distributions are far more complicated, but some of the ML parameters have similarly unlikely p-values in their respective posterior distributions. Note that some of my parameters are bounded (e.g. $0 leq theta_1 leq 1$); within the bounds, the priors are always uniform.



My questions are:



  1. Is such a deviation a problem per se? Obviously I do not expect the ML parameters to exactly coincide which the maxima of each of their marginalized posterior distributions, but intuitively it feels like they should also not be found deep in the tails. Does this deviation automatically invalidate my results?


  2. Whether this is necessarily problematic or not, could it be symptomatic of specific pathologies at some stage of the data analysis? For example, is it possible to make any general statement about whether such a deviation could be induced by an improperly converged chain, an incorrect model, or excessively tight bounds on the parameters?










share|cite|improve this question









$endgroup$


















    9












    $begingroup$


    I have a likelihood function $mathcalL(d | theta)$ for the probability of my data $d$ given some model parameters $theta in mathbfR^N$, which I would like to estimate. Assuming flat priors on the parameters, the likelihood is proportional to the posterior probability. I use an MCMC method to sample this probability.



    Looking at the resulting converged chain, I find that the maximum likelihood parameters are not consistent with the posterior distributions. For example, the marginalized posterior probability distribution for one of the parameters might be $theta_0 sim N(mu=0, sigma^2=1)$, while the value of $theta_0$ at the maximum likelihood point is $theta_0^ML approx 4$, essentially being almost the maximum value of $theta_0$ traversed by the MCMC sampler.



    This is an illustrative example, not my actual results. The real distributions are far more complicated, but some of the ML parameters have similarly unlikely p-values in their respective posterior distributions. Note that some of my parameters are bounded (e.g. $0 leq theta_1 leq 1$); within the bounds, the priors are always uniform.



    My questions are:



    1. Is such a deviation a problem per se? Obviously I do not expect the ML parameters to exactly coincide which the maxima of each of their marginalized posterior distributions, but intuitively it feels like they should also not be found deep in the tails. Does this deviation automatically invalidate my results?


    2. Whether this is necessarily problematic or not, could it be symptomatic of specific pathologies at some stage of the data analysis? For example, is it possible to make any general statement about whether such a deviation could be induced by an improperly converged chain, an incorrect model, or excessively tight bounds on the parameters?










    share|cite|improve this question









    $endgroup$














      9












      9








      9


      3



      $begingroup$


      I have a likelihood function $mathcalL(d | theta)$ for the probability of my data $d$ given some model parameters $theta in mathbfR^N$, which I would like to estimate. Assuming flat priors on the parameters, the likelihood is proportional to the posterior probability. I use an MCMC method to sample this probability.



      Looking at the resulting converged chain, I find that the maximum likelihood parameters are not consistent with the posterior distributions. For example, the marginalized posterior probability distribution for one of the parameters might be $theta_0 sim N(mu=0, sigma^2=1)$, while the value of $theta_0$ at the maximum likelihood point is $theta_0^ML approx 4$, essentially being almost the maximum value of $theta_0$ traversed by the MCMC sampler.



      This is an illustrative example, not my actual results. The real distributions are far more complicated, but some of the ML parameters have similarly unlikely p-values in their respective posterior distributions. Note that some of my parameters are bounded (e.g. $0 leq theta_1 leq 1$); within the bounds, the priors are always uniform.



      My questions are:



      1. Is such a deviation a problem per se? Obviously I do not expect the ML parameters to exactly coincide which the maxima of each of their marginalized posterior distributions, but intuitively it feels like they should also not be found deep in the tails. Does this deviation automatically invalidate my results?


      2. Whether this is necessarily problematic or not, could it be symptomatic of specific pathologies at some stage of the data analysis? For example, is it possible to make any general statement about whether such a deviation could be induced by an improperly converged chain, an incorrect model, or excessively tight bounds on the parameters?










      share|cite|improve this question









      $endgroup$




      I have a likelihood function $mathcalL(d | theta)$ for the probability of my data $d$ given some model parameters $theta in mathbfR^N$, which I would like to estimate. Assuming flat priors on the parameters, the likelihood is proportional to the posterior probability. I use an MCMC method to sample this probability.



      Looking at the resulting converged chain, I find that the maximum likelihood parameters are not consistent with the posterior distributions. For example, the marginalized posterior probability distribution for one of the parameters might be $theta_0 sim N(mu=0, sigma^2=1)$, while the value of $theta_0$ at the maximum likelihood point is $theta_0^ML approx 4$, essentially being almost the maximum value of $theta_0$ traversed by the MCMC sampler.



      This is an illustrative example, not my actual results. The real distributions are far more complicated, but some of the ML parameters have similarly unlikely p-values in their respective posterior distributions. Note that some of my parameters are bounded (e.g. $0 leq theta_1 leq 1$); within the bounds, the priors are always uniform.



      My questions are:



      1. Is such a deviation a problem per se? Obviously I do not expect the ML parameters to exactly coincide which the maxima of each of their marginalized posterior distributions, but intuitively it feels like they should also not be found deep in the tails. Does this deviation automatically invalidate my results?


      2. Whether this is necessarily problematic or not, could it be symptomatic of specific pathologies at some stage of the data analysis? For example, is it possible to make any general statement about whether such a deviation could be induced by an improperly converged chain, an incorrect model, or excessively tight bounds on the parameters?







      bayesian maximum-likelihood optimization inference mcmc






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked Apr 5 at 11:25









      mgc70mgc70

      484




      484




















          2 Answers
          2






          active

          oldest

          votes


















          12












          $begingroup$

          With flat priors, the posterior is identical to the likelihood up to a constant. Thus



          1. MLE (estimated with an optimizer) should be identical to the MAP (maximum a posteriori value = multivariate mode of the posterior, estimated with MCMC). If you don't get the same value, you have a problem with your sampler or optimiser.


          2. For complex models, it is very common that the marginal modes are different from the MAP. This happens, for example, if correlations between parameters are nonlinear. This is perfectly fine, but marginal modes should therefore not be interpreted as the points of highest posterior density, and not be compared to the MLE.


          3. In your specific case, however, I suspect that the posterior runs against the prior boundary. In this case, the posterior will be strongly asymmetric, and it doesn't make sense to interpret it in terms of mean, sd. There is no principle problem with this situation, but in practice it often hints towards model misspecification, or poorly chosen priors.






          share|cite|improve this answer











          $endgroup$




















            13












            $begingroup$

            Some possible generic explanations for this perceived discrepancy, assuming of course there is no issue with code or likelihood definition or MCMC implementation or number of MCMC iterations or convergence of the likelihood maximiser (thanks, Jacob Socolar):



            1. in large dimensions $N$, the posterior does not concentrate on the
              maximum but something of a distance of order $sqrtN$ from the
              mode, meaning that the largest values of the likelihood function
              encountered by an MCMC sampler are often quite below the value of
              the likelihood at its maximum. For instance, if the posterior is $theta|mathbf xsimmathcal N_N(0,I_N)$, $theta$ is at least at a distance $N-2sqrt2N$ from the mode, $0$.


            2. While the MAP and the MLE are indeed confounded under a flat prior, the
              marginal densities of the different parameters of the model may have (marginal) modes
              that are far away from the corresponding MLEs (i.e., MAPs).


            3. The MAP is a position
              in the parameter space where the posterior density is highest but
              this does not convey any indication of posterior weight or volume
              for neighbourhoods of the MAP. A very thin spike carries no posterior weight. This is also the reason why MCMC exploration of a posterior may face difficulties in identifying the posterior mode.


            4. The fact that most parameters are bounded may lead to some
              components of the MAP=MLE occurring at a boundary.


            See, e.g., Druihlet and Marin (2007) for arguments on the un-Bayesian nature of MAP estimators. One is the dependence on these estimators on the dominating measure, another one being the lack of invariance under reparameterisation (unlike MLE's).



            As an example of point 1 above, here is a short R code



            N=100
            T=1e4
            lik=dis=rep(0,T)
            mu=rmvnorm(1,mean=rep(0,N))
            xobs=rmvnorm(1,mean=rep(0,N))
            lik[1]=dmvnorm(xobs,mu,log=TRUE)
            dis[1]=(xobs-mu)%*%t(xobs-mu)
            for (t in 2:T)
            prop=rmvnorm(1,mean=mu,sigma=diag(1/N,N))
            proike=dmvnorm(xobs,prop,log=TRUE)
            if (log(runif(1))<proike-lik[t-1])
            mu=prop;lik[t]=proike
            elselik[t]=lik[t-1]
            dis[t]=(xobs-mu)%*%t(xobs-mu)


            which mimics a random-walk Metropolis-Hastings sequence in dimension N=100. The value of the log-likelihood at the MAP is -91.89, but the visited likelihoods never come close:



            > range(lik)
            [1] -183.9515 -126.6924


            which is explained by the fact that the sequence never comes near the observation:



            > range(dis)
            [1] 69.59714 184.11525





            share|cite|improve this answer











            $endgroup$








            • 2




              $begingroup$
              I'd just add that in addition to worrying about the code or likelihood definition or MCMC implementation, the OP might also worry about whether the software used to obtain the ML estimate got trapped in a local optimum. stats.stackexchange.com/questions/384528/…
              $endgroup$
              – Jacob Socolar
              Apr 5 at 15:30











            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "65"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401349%2fmaximum-likelihood-parameters-deviate-from-posterior-distributions%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            12












            $begingroup$

            With flat priors, the posterior is identical to the likelihood up to a constant. Thus



            1. MLE (estimated with an optimizer) should be identical to the MAP (maximum a posteriori value = multivariate mode of the posterior, estimated with MCMC). If you don't get the same value, you have a problem with your sampler or optimiser.


            2. For complex models, it is very common that the marginal modes are different from the MAP. This happens, for example, if correlations between parameters are nonlinear. This is perfectly fine, but marginal modes should therefore not be interpreted as the points of highest posterior density, and not be compared to the MLE.


            3. In your specific case, however, I suspect that the posterior runs against the prior boundary. In this case, the posterior will be strongly asymmetric, and it doesn't make sense to interpret it in terms of mean, sd. There is no principle problem with this situation, but in practice it often hints towards model misspecification, or poorly chosen priors.






            share|cite|improve this answer











            $endgroup$

















              12












              $begingroup$

              With flat priors, the posterior is identical to the likelihood up to a constant. Thus



              1. MLE (estimated with an optimizer) should be identical to the MAP (maximum a posteriori value = multivariate mode of the posterior, estimated with MCMC). If you don't get the same value, you have a problem with your sampler or optimiser.


              2. For complex models, it is very common that the marginal modes are different from the MAP. This happens, for example, if correlations between parameters are nonlinear. This is perfectly fine, but marginal modes should therefore not be interpreted as the points of highest posterior density, and not be compared to the MLE.


              3. In your specific case, however, I suspect that the posterior runs against the prior boundary. In this case, the posterior will be strongly asymmetric, and it doesn't make sense to interpret it in terms of mean, sd. There is no principle problem with this situation, but in practice it often hints towards model misspecification, or poorly chosen priors.






              share|cite|improve this answer











              $endgroup$















                12












                12








                12





                $begingroup$

                With flat priors, the posterior is identical to the likelihood up to a constant. Thus



                1. MLE (estimated with an optimizer) should be identical to the MAP (maximum a posteriori value = multivariate mode of the posterior, estimated with MCMC). If you don't get the same value, you have a problem with your sampler or optimiser.


                2. For complex models, it is very common that the marginal modes are different from the MAP. This happens, for example, if correlations between parameters are nonlinear. This is perfectly fine, but marginal modes should therefore not be interpreted as the points of highest posterior density, and not be compared to the MLE.


                3. In your specific case, however, I suspect that the posterior runs against the prior boundary. In this case, the posterior will be strongly asymmetric, and it doesn't make sense to interpret it in terms of mean, sd. There is no principle problem with this situation, but in practice it often hints towards model misspecification, or poorly chosen priors.






                share|cite|improve this answer











                $endgroup$



                With flat priors, the posterior is identical to the likelihood up to a constant. Thus



                1. MLE (estimated with an optimizer) should be identical to the MAP (maximum a posteriori value = multivariate mode of the posterior, estimated with MCMC). If you don't get the same value, you have a problem with your sampler or optimiser.


                2. For complex models, it is very common that the marginal modes are different from the MAP. This happens, for example, if correlations between parameters are nonlinear. This is perfectly fine, but marginal modes should therefore not be interpreted as the points of highest posterior density, and not be compared to the MLE.


                3. In your specific case, however, I suspect that the posterior runs against the prior boundary. In this case, the posterior will be strongly asymmetric, and it doesn't make sense to interpret it in terms of mean, sd. There is no principle problem with this situation, but in practice it often hints towards model misspecification, or poorly chosen priors.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited Apr 8 at 11:42

























                answered Apr 5 at 11:54









                Florian HartigFlorian Hartig

                4,2591423




                4,2591423























                    13












                    $begingroup$

                    Some possible generic explanations for this perceived discrepancy, assuming of course there is no issue with code or likelihood definition or MCMC implementation or number of MCMC iterations or convergence of the likelihood maximiser (thanks, Jacob Socolar):



                    1. in large dimensions $N$, the posterior does not concentrate on the
                      maximum but something of a distance of order $sqrtN$ from the
                      mode, meaning that the largest values of the likelihood function
                      encountered by an MCMC sampler are often quite below the value of
                      the likelihood at its maximum. For instance, if the posterior is $theta|mathbf xsimmathcal N_N(0,I_N)$, $theta$ is at least at a distance $N-2sqrt2N$ from the mode, $0$.


                    2. While the MAP and the MLE are indeed confounded under a flat prior, the
                      marginal densities of the different parameters of the model may have (marginal) modes
                      that are far away from the corresponding MLEs (i.e., MAPs).


                    3. The MAP is a position
                      in the parameter space where the posterior density is highest but
                      this does not convey any indication of posterior weight or volume
                      for neighbourhoods of the MAP. A very thin spike carries no posterior weight. This is also the reason why MCMC exploration of a posterior may face difficulties in identifying the posterior mode.


                    4. The fact that most parameters are bounded may lead to some
                      components of the MAP=MLE occurring at a boundary.


                    See, e.g., Druihlet and Marin (2007) for arguments on the un-Bayesian nature of MAP estimators. One is the dependence on these estimators on the dominating measure, another one being the lack of invariance under reparameterisation (unlike MLE's).



                    As an example of point 1 above, here is a short R code



                    N=100
                    T=1e4
                    lik=dis=rep(0,T)
                    mu=rmvnorm(1,mean=rep(0,N))
                    xobs=rmvnorm(1,mean=rep(0,N))
                    lik[1]=dmvnorm(xobs,mu,log=TRUE)
                    dis[1]=(xobs-mu)%*%t(xobs-mu)
                    for (t in 2:T)
                    prop=rmvnorm(1,mean=mu,sigma=diag(1/N,N))
                    proike=dmvnorm(xobs,prop,log=TRUE)
                    if (log(runif(1))<proike-lik[t-1])
                    mu=prop;lik[t]=proike
                    elselik[t]=lik[t-1]
                    dis[t]=(xobs-mu)%*%t(xobs-mu)


                    which mimics a random-walk Metropolis-Hastings sequence in dimension N=100. The value of the log-likelihood at the MAP is -91.89, but the visited likelihoods never come close:



                    > range(lik)
                    [1] -183.9515 -126.6924


                    which is explained by the fact that the sequence never comes near the observation:



                    > range(dis)
                    [1] 69.59714 184.11525





                    share|cite|improve this answer











                    $endgroup$








                    • 2




                      $begingroup$
                      I'd just add that in addition to worrying about the code or likelihood definition or MCMC implementation, the OP might also worry about whether the software used to obtain the ML estimate got trapped in a local optimum. stats.stackexchange.com/questions/384528/…
                      $endgroup$
                      – Jacob Socolar
                      Apr 5 at 15:30















                    13












                    $begingroup$

                    Some possible generic explanations for this perceived discrepancy, assuming of course there is no issue with code or likelihood definition or MCMC implementation or number of MCMC iterations or convergence of the likelihood maximiser (thanks, Jacob Socolar):



                    1. in large dimensions $N$, the posterior does not concentrate on the
                      maximum but something of a distance of order $sqrtN$ from the
                      mode, meaning that the largest values of the likelihood function
                      encountered by an MCMC sampler are often quite below the value of
                      the likelihood at its maximum. For instance, if the posterior is $theta|mathbf xsimmathcal N_N(0,I_N)$, $theta$ is at least at a distance $N-2sqrt2N$ from the mode, $0$.


                    2. While the MAP and the MLE are indeed confounded under a flat prior, the
                      marginal densities of the different parameters of the model may have (marginal) modes
                      that are far away from the corresponding MLEs (i.e., MAPs).


                    3. The MAP is a position
                      in the parameter space where the posterior density is highest but
                      this does not convey any indication of posterior weight or volume
                      for neighbourhoods of the MAP. A very thin spike carries no posterior weight. This is also the reason why MCMC exploration of a posterior may face difficulties in identifying the posterior mode.


                    4. The fact that most parameters are bounded may lead to some
                      components of the MAP=MLE occurring at a boundary.


                    See, e.g., Druihlet and Marin (2007) for arguments on the un-Bayesian nature of MAP estimators. One is the dependence on these estimators on the dominating measure, another one being the lack of invariance under reparameterisation (unlike MLE's).



                    As an example of point 1 above, here is a short R code



                    N=100
                    T=1e4
                    lik=dis=rep(0,T)
                    mu=rmvnorm(1,mean=rep(0,N))
                    xobs=rmvnorm(1,mean=rep(0,N))
                    lik[1]=dmvnorm(xobs,mu,log=TRUE)
                    dis[1]=(xobs-mu)%*%t(xobs-mu)
                    for (t in 2:T)
                    prop=rmvnorm(1,mean=mu,sigma=diag(1/N,N))
                    proike=dmvnorm(xobs,prop,log=TRUE)
                    if (log(runif(1))<proike-lik[t-1])
                    mu=prop;lik[t]=proike
                    elselik[t]=lik[t-1]
                    dis[t]=(xobs-mu)%*%t(xobs-mu)


                    which mimics a random-walk Metropolis-Hastings sequence in dimension N=100. The value of the log-likelihood at the MAP is -91.89, but the visited likelihoods never come close:



                    > range(lik)
                    [1] -183.9515 -126.6924


                    which is explained by the fact that the sequence never comes near the observation:



                    > range(dis)
                    [1] 69.59714 184.11525





                    share|cite|improve this answer











                    $endgroup$








                    • 2




                      $begingroup$
                      I'd just add that in addition to worrying about the code or likelihood definition or MCMC implementation, the OP might also worry about whether the software used to obtain the ML estimate got trapped in a local optimum. stats.stackexchange.com/questions/384528/…
                      $endgroup$
                      – Jacob Socolar
                      Apr 5 at 15:30













                    13












                    13








                    13





                    $begingroup$

                    Some possible generic explanations for this perceived discrepancy, assuming of course there is no issue with code or likelihood definition or MCMC implementation or number of MCMC iterations or convergence of the likelihood maximiser (thanks, Jacob Socolar):



                    1. in large dimensions $N$, the posterior does not concentrate on the
                      maximum but something of a distance of order $sqrtN$ from the
                      mode, meaning that the largest values of the likelihood function
                      encountered by an MCMC sampler are often quite below the value of
                      the likelihood at its maximum. For instance, if the posterior is $theta|mathbf xsimmathcal N_N(0,I_N)$, $theta$ is at least at a distance $N-2sqrt2N$ from the mode, $0$.


                    2. While the MAP and the MLE are indeed confounded under a flat prior, the
                      marginal densities of the different parameters of the model may have (marginal) modes
                      that are far away from the corresponding MLEs (i.e., MAPs).


                    3. The MAP is a position
                      in the parameter space where the posterior density is highest but
                      this does not convey any indication of posterior weight or volume
                      for neighbourhoods of the MAP. A very thin spike carries no posterior weight. This is also the reason why MCMC exploration of a posterior may face difficulties in identifying the posterior mode.


                    4. The fact that most parameters are bounded may lead to some
                      components of the MAP=MLE occurring at a boundary.


                    See, e.g., Druihlet and Marin (2007) for arguments on the un-Bayesian nature of MAP estimators. One is the dependence on these estimators on the dominating measure, another one being the lack of invariance under reparameterisation (unlike MLE's).



                    As an example of point 1 above, here is a short R code



                    N=100
                    T=1e4
                    lik=dis=rep(0,T)
                    mu=rmvnorm(1,mean=rep(0,N))
                    xobs=rmvnorm(1,mean=rep(0,N))
                    lik[1]=dmvnorm(xobs,mu,log=TRUE)
                    dis[1]=(xobs-mu)%*%t(xobs-mu)
                    for (t in 2:T)
                    prop=rmvnorm(1,mean=mu,sigma=diag(1/N,N))
                    proike=dmvnorm(xobs,prop,log=TRUE)
                    if (log(runif(1))<proike-lik[t-1])
                    mu=prop;lik[t]=proike
                    elselik[t]=lik[t-1]
                    dis[t]=(xobs-mu)%*%t(xobs-mu)


                    which mimics a random-walk Metropolis-Hastings sequence in dimension N=100. The value of the log-likelihood at the MAP is -91.89, but the visited likelihoods never come close:



                    > range(lik)
                    [1] -183.9515 -126.6924


                    which is explained by the fact that the sequence never comes near the observation:



                    > range(dis)
                    [1] 69.59714 184.11525





                    share|cite|improve this answer











                    $endgroup$



                    Some possible generic explanations for this perceived discrepancy, assuming of course there is no issue with code or likelihood definition or MCMC implementation or number of MCMC iterations or convergence of the likelihood maximiser (thanks, Jacob Socolar):



                    1. in large dimensions $N$, the posterior does not concentrate on the
                      maximum but something of a distance of order $sqrtN$ from the
                      mode, meaning that the largest values of the likelihood function
                      encountered by an MCMC sampler are often quite below the value of
                      the likelihood at its maximum. For instance, if the posterior is $theta|mathbf xsimmathcal N_N(0,I_N)$, $theta$ is at least at a distance $N-2sqrt2N$ from the mode, $0$.


                    2. While the MAP and the MLE are indeed confounded under a flat prior, the
                      marginal densities of the different parameters of the model may have (marginal) modes
                      that are far away from the corresponding MLEs (i.e., MAPs).


                    3. The MAP is a position
                      in the parameter space where the posterior density is highest but
                      this does not convey any indication of posterior weight or volume
                      for neighbourhoods of the MAP. A very thin spike carries no posterior weight. This is also the reason why MCMC exploration of a posterior may face difficulties in identifying the posterior mode.


                    4. The fact that most parameters are bounded may lead to some
                      components of the MAP=MLE occurring at a boundary.


                    See, e.g., Druihlet and Marin (2007) for arguments on the un-Bayesian nature of MAP estimators. One is the dependence on these estimators on the dominating measure, another one being the lack of invariance under reparameterisation (unlike MLE's).



                    As an example of point 1 above, here is a short R code



                    N=100
                    T=1e4
                    lik=dis=rep(0,T)
                    mu=rmvnorm(1,mean=rep(0,N))
                    xobs=rmvnorm(1,mean=rep(0,N))
                    lik[1]=dmvnorm(xobs,mu,log=TRUE)
                    dis[1]=(xobs-mu)%*%t(xobs-mu)
                    for (t in 2:T)
                    prop=rmvnorm(1,mean=mu,sigma=diag(1/N,N))
                    proike=dmvnorm(xobs,prop,log=TRUE)
                    if (log(runif(1))<proike-lik[t-1])
                    mu=prop;lik[t]=proike
                    elselik[t]=lik[t-1]
                    dis[t]=(xobs-mu)%*%t(xobs-mu)


                    which mimics a random-walk Metropolis-Hastings sequence in dimension N=100. The value of the log-likelihood at the MAP is -91.89, but the visited likelihoods never come close:



                    > range(lik)
                    [1] -183.9515 -126.6924


                    which is explained by the fact that the sequence never comes near the observation:



                    > range(dis)
                    [1] 69.59714 184.11525






                    share|cite|improve this answer














                    share|cite|improve this answer



                    share|cite|improve this answer








                    edited Apr 8 at 11:40

























                    answered Apr 5 at 11:57









                    Xi'anXi'an

                    59.8k897368




                    59.8k897368







                    • 2




                      $begingroup$
                      I'd just add that in addition to worrying about the code or likelihood definition or MCMC implementation, the OP might also worry about whether the software used to obtain the ML estimate got trapped in a local optimum. stats.stackexchange.com/questions/384528/…
                      $endgroup$
                      – Jacob Socolar
                      Apr 5 at 15:30












                    • 2




                      $begingroup$
                      I'd just add that in addition to worrying about the code or likelihood definition or MCMC implementation, the OP might also worry about whether the software used to obtain the ML estimate got trapped in a local optimum. stats.stackexchange.com/questions/384528/…
                      $endgroup$
                      – Jacob Socolar
                      Apr 5 at 15:30







                    2




                    2




                    $begingroup$
                    I'd just add that in addition to worrying about the code or likelihood definition or MCMC implementation, the OP might also worry about whether the software used to obtain the ML estimate got trapped in a local optimum. stats.stackexchange.com/questions/384528/…
                    $endgroup$
                    – Jacob Socolar
                    Apr 5 at 15:30




                    $begingroup$
                    I'd just add that in addition to worrying about the code or likelihood definition or MCMC implementation, the OP might also worry about whether the software used to obtain the ML estimate got trapped in a local optimum. stats.stackexchange.com/questions/384528/…
                    $endgroup$
                    – Jacob Socolar
                    Apr 5 at 15:30

















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Cross Validated!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401349%2fmaximum-likelihood-parameters-deviate-from-posterior-distributions%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High