On the choice of LSTM input/output dimension for a spatio-temporal problem The Next CEO of Stack Overflow2019 Community Moderator ElectionLSTM unit: cell state dimensionWhere is the output in the LSTM?Input and output Dimension of LSTM RNNDefining Input Shape for Time Series using LSTM in KerasUnderstanding LSTM input shape for kerasArchitecture help for multivariate input and output LSTM modelsHow to feed output of predict value back into the input using LSTM in pythonUnderstanding output of LSTM for regressionHow to design a LSTM network with different number of input/output units?Zero padding for LSTM input

Are police here, aren't itthey?

I believe this to be a fraud - hired, then asked to cash check and send cash as Bitcoin

Is it allowed to be an Apple trusted developer with pure Java

What connection does MS Office have to Netscape Navigator?

Why do airplanes bank sharply to the right after air-to-air refueling?

Is there a difference between "Fahrstuhl" and "Aufzug"

Does increasing your ability score affect your main stat?

Combine columns from several files into one

Would a completely good Muggle be able to use a wand?

Proper way to express "He disappeared them"

Why is the US ranked as #45 in Press Freedom ratings, despite its extremely permissive free speech laws?

How should I support this large drywall patch?

I want to delete every two lines after 3rd lines in file contain very large number of lines :

What was the first Unix version to run on a microcomputer?

Make solar eclipses exceedingly rare, but still have new moons

How to invert MapIndexed on a ragged structure? How to construct a tree from rules?

The exact meaning of 'Mom made me a sandwich'

Why is quantifier elimination desirable for a given theory?

WOW air has ceased operation, can I get my tickets refunded?

Unreliable Magic - Is it worth it?

Can I equip Skullclamp on a creature I am sacrificing?

No sign flipping while figuring out the emf of voltaic cell?

Won the lottery - how do I keep the money?

Reference request: Grassmannian and Plucker coordinates in type B, C, D



On the choice of LSTM input/output dimension for a spatio-temporal problem



The Next CEO of Stack Overflow
2019 Community Moderator ElectionLSTM unit: cell state dimensionWhere is the output in the LSTM?Input and output Dimension of LSTM RNNDefining Input Shape for Time Series using LSTM in KerasUnderstanding LSTM input shape for kerasArchitecture help for multivariate input and output LSTM modelsHow to feed output of predict value back into the input using LSTM in pythonUnderstanding output of LSTM for regressionHow to design a LSTM network with different number of input/output units?Zero padding for LSTM input










4












$begingroup$


I am using LSTM neural networks from (R)Keras for a matter of spatio-temporal interpolation. I manage to get the network to output predictions but the results are not outstanding (very little improvement on validation loss). I am wondering about the shapes of training data and labels.



Say I have 50 dates of measurements of the variable of interest $y$, accompanied by about 100 covariates $x$ (spatial coordinates, temperatures...). Each date has 24 measurements of $y$, so nsamples=50*24=1200. If I set the timestep hyperparameter of LSTM to e.g. 3, and use a moving window of step 1, I have therefore an input table $X$ of shape (1200, 3, 100).



On the other hand, should the labels table $Y$ be of dimension (1200, 3) or (1200, 1) ? More precisely, which of the following describes the problem the best:
$$
(X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow (Y_n,t-2 ; Y_n,t-1 ; Y_n,t)
$$



$$
(X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow (Y_n,t ; Y_n,t ; Y_n,t)
$$



$$
(X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow Y_n,t
$$



$$
(X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow Y_n,t+1
$$



Or are they all plausible ways of addressing slightly different problems? As I said, I'm trying to spatially interpolate $Y$ for the 50 dates of measurements, as well as predicting $Y$ for the year(s) to come. So I expect one is more relevant than the others but I have no clue on which one.



I hope this is understandable as I clearly miss some technical vocabulary here.










share|improve this question









New contributor




Yo B. is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$
















    4












    $begingroup$


    I am using LSTM neural networks from (R)Keras for a matter of spatio-temporal interpolation. I manage to get the network to output predictions but the results are not outstanding (very little improvement on validation loss). I am wondering about the shapes of training data and labels.



    Say I have 50 dates of measurements of the variable of interest $y$, accompanied by about 100 covariates $x$ (spatial coordinates, temperatures...). Each date has 24 measurements of $y$, so nsamples=50*24=1200. If I set the timestep hyperparameter of LSTM to e.g. 3, and use a moving window of step 1, I have therefore an input table $X$ of shape (1200, 3, 100).



    On the other hand, should the labels table $Y$ be of dimension (1200, 3) or (1200, 1) ? More precisely, which of the following describes the problem the best:
    $$
    (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow (Y_n,t-2 ; Y_n,t-1 ; Y_n,t)
    $$



    $$
    (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow (Y_n,t ; Y_n,t ; Y_n,t)
    $$



    $$
    (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow Y_n,t
    $$



    $$
    (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow Y_n,t+1
    $$



    Or are they all plausible ways of addressing slightly different problems? As I said, I'm trying to spatially interpolate $Y$ for the 50 dates of measurements, as well as predicting $Y$ for the year(s) to come. So I expect one is more relevant than the others but I have no clue on which one.



    I hope this is understandable as I clearly miss some technical vocabulary here.










    share|improve this question









    New contributor




    Yo B. is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$














      4












      4








      4


      1



      $begingroup$


      I am using LSTM neural networks from (R)Keras for a matter of spatio-temporal interpolation. I manage to get the network to output predictions but the results are not outstanding (very little improvement on validation loss). I am wondering about the shapes of training data and labels.



      Say I have 50 dates of measurements of the variable of interest $y$, accompanied by about 100 covariates $x$ (spatial coordinates, temperatures...). Each date has 24 measurements of $y$, so nsamples=50*24=1200. If I set the timestep hyperparameter of LSTM to e.g. 3, and use a moving window of step 1, I have therefore an input table $X$ of shape (1200, 3, 100).



      On the other hand, should the labels table $Y$ be of dimension (1200, 3) or (1200, 1) ? More precisely, which of the following describes the problem the best:
      $$
      (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow (Y_n,t-2 ; Y_n,t-1 ; Y_n,t)
      $$



      $$
      (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow (Y_n,t ; Y_n,t ; Y_n,t)
      $$



      $$
      (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow Y_n,t
      $$



      $$
      (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow Y_n,t+1
      $$



      Or are they all plausible ways of addressing slightly different problems? As I said, I'm trying to spatially interpolate $Y$ for the 50 dates of measurements, as well as predicting $Y$ for the year(s) to come. So I expect one is more relevant than the others but I have no clue on which one.



      I hope this is understandable as I clearly miss some technical vocabulary here.










      share|improve this question









      New contributor




      Yo B. is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I am using LSTM neural networks from (R)Keras for a matter of spatio-temporal interpolation. I manage to get the network to output predictions but the results are not outstanding (very little improvement on validation loss). I am wondering about the shapes of training data and labels.



      Say I have 50 dates of measurements of the variable of interest $y$, accompanied by about 100 covariates $x$ (spatial coordinates, temperatures...). Each date has 24 measurements of $y$, so nsamples=50*24=1200. If I set the timestep hyperparameter of LSTM to e.g. 3, and use a moving window of step 1, I have therefore an input table $X$ of shape (1200, 3, 100).



      On the other hand, should the labels table $Y$ be of dimension (1200, 3) or (1200, 1) ? More precisely, which of the following describes the problem the best:
      $$
      (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow (Y_n,t-2 ; Y_n,t-1 ; Y_n,t)
      $$



      $$
      (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow (Y_n,t ; Y_n,t ; Y_n,t)
      $$



      $$
      (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow Y_n,t
      $$



      $$
      (X_n,t-2 ; X_n,t-1 ; X_n,t) rightarrow Y_n,t+1
      $$



      Or are they all plausible ways of addressing slightly different problems? As I said, I'm trying to spatially interpolate $Y$ for the 50 dates of measurements, as well as predicting $Y$ for the year(s) to come. So I expect one is more relevant than the others but I have no clue on which one.



      I hope this is understandable as I clearly miss some technical vocabulary here.







      neural-network keras r lstm






      share|improve this question









      New contributor




      Yo B. is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      Yo B. is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited Mar 26 at 12:39









      Esmailian

      2,187218




      2,187218






      New contributor




      Yo B. is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked Mar 25 at 17:20









      Yo B.Yo B.

      1234




      1234




      New contributor




      Yo B. is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Yo B. is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Yo B. is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          On time-series models



          All models that you have mentioned are correct and practical depending on the problem (the index $n$ is not required). The second one however produces redundant results which is a waste of computation. Even
          $$
          (X_t ; X_t+1 ; X_t+2) rightarrow (Y_t-1)
          $$

          is correct, if you are fitting on an archive and want to predict a year given the covariates from the next three years.



          But only the last model
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1)
          $$

          is a forecasting model. So in general, if you want to interpolate into the next $k$-th year from now $t$, you should use:
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+k)
          $$



          or



          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1,...,Y_t+k)
          $$

          Even a better model that takes advantage of known $Y$'s in the past would be:



          $$
          (X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow (Y_t+k)
          $$



          where $|$ denotes vector concatenation to produce a 100 + 1 dimensional vector for each known year.



          As a personal opinion, for the time-series prediction task, 24 data points per year is very small compared to the dimension of $X$, which is 100. 1200 samples for $X rightarrow Y$ regression (ignoring the time) is more practical; if selecting 10 from 100 covariates is possible even better.



          Because of the small data set, I would suggest:



          1. $(Y_t-m ;...; Y_t-1 ; Y_t) rightarrow (Y_t+k)$ for time series prediction, and


          2. $X rightarrow Y$ regression for estimating the relation between X and Y.


          Relation to LSTM and RNN



          If we use LSTM/RNN to model time-series, they would be stateful. That is, when input $X_t-2$ is fed to an LSTM, it keeps an internal state (hidden state) to be combined with the next input $X_t-1$ and so on. Regarding the input/output dimension, here is an RNN animation from a post on medium by Raimi Karim that shows an arbitrary step among 3 steps of feeding $(X_t-2 ; X_t-1 ; X_t)$ to the network:





          As you see, dimension and number of inputs are independent of output. We can feed 5 inputs $X_t-4$ to $X_t$, each 100 dimension (100d) and receive a 1d output by setting the dimension of hidden states to 1d, or setting it to 10d and use an extra dense layer at the end to convert 10d to 1d, or receive a 50d output, or a 150d (three 50d) output, etc.



          Word "stateful" in Keras (source)



          LSTM and RNN are stateful by definition, this [badly named] variable in Keras means




          If stateful=True, the last state for each sample at index i in a batch will be
          used as initial state for the sample of index i in the following
          batch. Fabien Chollet




          For example, if each batch has 24 samples indexed from 0 to 23 (each sample could have the form $(X_t-2, X_t-1, X_t, Y_t+1)$), then the last hidden state $h$ from 8th sample will be used as the initial hidden state for 8th sample in the next batch. Except for special cases that there is a temporal order between batches and their samples, this must be set to False.






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Thank you, this is clear! So the choice for any of those problems would only depend on how I arrange inputs and outputs, right? So the network does not need any more argument to be trained for one problem or the other. If you mind answering that also: does any of this problem needs stateful LSTM? Or should preferentially be addresed with stateful LSTM? I am very confused by this argument, and nothing I have read so far makes it clear wether I should use those or not (and if so, how should I modify the formulas above).
            $endgroup$
            – Yo B.
            Mar 26 at 9:23











          • $begingroup$
            Thanks for the edit and the reference! Just to be clear, currently I am not using stateful=TRUE. Does this mean that the network is not building links between $X_t-1$ and $X_t$? If so, what is the interest of feeding the network with this time-batch structure when stateful=FALSE? I get it's a different question, you don't have to answer. Thanks again!
            $endgroup$
            – Yo B.
            Mar 26 at 10:53







          • 1




            $begingroup$
            Well thanks again for this last edit, you really helped me there! All the best
            $endgroup$
            – Yo B.
            Mar 26 at 11:36










          • $begingroup$
            If you don't mind answering this last question: in the predictive case $(X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow Y_t+1$, how should I input $X_t+1$? I'm asking that because lots of my covariates are static and can therefore be useful in predicting the process in the years to come.
            $endgroup$
            – Yo B.
            Mar 26 at 15:45







          • 1




            $begingroup$
            @YoB. if $t+1$ denotes "next year" we have no access to $X_t+1$ in real-time cases. Btw, since each timestamp should be 101 dimension, you can use $(...X_t|Y_t;X_t+1|0) rightarrow (Y_t+1)$ just using a dummy 0, but I think it may cause under-performance, try it. You can also use the previous year: $(...X_t|Y_t;X_t+1|Y_t) rightarrow (Y_t+1)$.
            $endgroup$
            – Esmailian
            Mar 26 at 15:55











          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );






          Yo B. is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47961%2fon-the-choice-of-lstm-input-output-dimension-for-a-spatio-temporal-problem%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          On time-series models



          All models that you have mentioned are correct and practical depending on the problem (the index $n$ is not required). The second one however produces redundant results which is a waste of computation. Even
          $$
          (X_t ; X_t+1 ; X_t+2) rightarrow (Y_t-1)
          $$

          is correct, if you are fitting on an archive and want to predict a year given the covariates from the next three years.



          But only the last model
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1)
          $$

          is a forecasting model. So in general, if you want to interpolate into the next $k$-th year from now $t$, you should use:
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+k)
          $$



          or



          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1,...,Y_t+k)
          $$

          Even a better model that takes advantage of known $Y$'s in the past would be:



          $$
          (X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow (Y_t+k)
          $$



          where $|$ denotes vector concatenation to produce a 100 + 1 dimensional vector for each known year.



          As a personal opinion, for the time-series prediction task, 24 data points per year is very small compared to the dimension of $X$, which is 100. 1200 samples for $X rightarrow Y$ regression (ignoring the time) is more practical; if selecting 10 from 100 covariates is possible even better.



          Because of the small data set, I would suggest:



          1. $(Y_t-m ;...; Y_t-1 ; Y_t) rightarrow (Y_t+k)$ for time series prediction, and


          2. $X rightarrow Y$ regression for estimating the relation between X and Y.


          Relation to LSTM and RNN



          If we use LSTM/RNN to model time-series, they would be stateful. That is, when input $X_t-2$ is fed to an LSTM, it keeps an internal state (hidden state) to be combined with the next input $X_t-1$ and so on. Regarding the input/output dimension, here is an RNN animation from a post on medium by Raimi Karim that shows an arbitrary step among 3 steps of feeding $(X_t-2 ; X_t-1 ; X_t)$ to the network:





          As you see, dimension and number of inputs are independent of output. We can feed 5 inputs $X_t-4$ to $X_t$, each 100 dimension (100d) and receive a 1d output by setting the dimension of hidden states to 1d, or setting it to 10d and use an extra dense layer at the end to convert 10d to 1d, or receive a 50d output, or a 150d (three 50d) output, etc.



          Word "stateful" in Keras (source)



          LSTM and RNN are stateful by definition, this [badly named] variable in Keras means




          If stateful=True, the last state for each sample at index i in a batch will be
          used as initial state for the sample of index i in the following
          batch. Fabien Chollet




          For example, if each batch has 24 samples indexed from 0 to 23 (each sample could have the form $(X_t-2, X_t-1, X_t, Y_t+1)$), then the last hidden state $h$ from 8th sample will be used as the initial hidden state for 8th sample in the next batch. Except for special cases that there is a temporal order between batches and their samples, this must be set to False.






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Thank you, this is clear! So the choice for any of those problems would only depend on how I arrange inputs and outputs, right? So the network does not need any more argument to be trained for one problem or the other. If you mind answering that also: does any of this problem needs stateful LSTM? Or should preferentially be addresed with stateful LSTM? I am very confused by this argument, and nothing I have read so far makes it clear wether I should use those or not (and if so, how should I modify the formulas above).
            $endgroup$
            – Yo B.
            Mar 26 at 9:23











          • $begingroup$
            Thanks for the edit and the reference! Just to be clear, currently I am not using stateful=TRUE. Does this mean that the network is not building links between $X_t-1$ and $X_t$? If so, what is the interest of feeding the network with this time-batch structure when stateful=FALSE? I get it's a different question, you don't have to answer. Thanks again!
            $endgroup$
            – Yo B.
            Mar 26 at 10:53







          • 1




            $begingroup$
            Well thanks again for this last edit, you really helped me there! All the best
            $endgroup$
            – Yo B.
            Mar 26 at 11:36










          • $begingroup$
            If you don't mind answering this last question: in the predictive case $(X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow Y_t+1$, how should I input $X_t+1$? I'm asking that because lots of my covariates are static and can therefore be useful in predicting the process in the years to come.
            $endgroup$
            – Yo B.
            Mar 26 at 15:45







          • 1




            $begingroup$
            @YoB. if $t+1$ denotes "next year" we have no access to $X_t+1$ in real-time cases. Btw, since each timestamp should be 101 dimension, you can use $(...X_t|Y_t;X_t+1|0) rightarrow (Y_t+1)$ just using a dummy 0, but I think it may cause under-performance, try it. You can also use the previous year: $(...X_t|Y_t;X_t+1|Y_t) rightarrow (Y_t+1)$.
            $endgroup$
            – Esmailian
            Mar 26 at 15:55















          1












          $begingroup$

          On time-series models



          All models that you have mentioned are correct and practical depending on the problem (the index $n$ is not required). The second one however produces redundant results which is a waste of computation. Even
          $$
          (X_t ; X_t+1 ; X_t+2) rightarrow (Y_t-1)
          $$

          is correct, if you are fitting on an archive and want to predict a year given the covariates from the next three years.



          But only the last model
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1)
          $$

          is a forecasting model. So in general, if you want to interpolate into the next $k$-th year from now $t$, you should use:
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+k)
          $$



          or



          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1,...,Y_t+k)
          $$

          Even a better model that takes advantage of known $Y$'s in the past would be:



          $$
          (X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow (Y_t+k)
          $$



          where $|$ denotes vector concatenation to produce a 100 + 1 dimensional vector for each known year.



          As a personal opinion, for the time-series prediction task, 24 data points per year is very small compared to the dimension of $X$, which is 100. 1200 samples for $X rightarrow Y$ regression (ignoring the time) is more practical; if selecting 10 from 100 covariates is possible even better.



          Because of the small data set, I would suggest:



          1. $(Y_t-m ;...; Y_t-1 ; Y_t) rightarrow (Y_t+k)$ for time series prediction, and


          2. $X rightarrow Y$ regression for estimating the relation between X and Y.


          Relation to LSTM and RNN



          If we use LSTM/RNN to model time-series, they would be stateful. That is, when input $X_t-2$ is fed to an LSTM, it keeps an internal state (hidden state) to be combined with the next input $X_t-1$ and so on. Regarding the input/output dimension, here is an RNN animation from a post on medium by Raimi Karim that shows an arbitrary step among 3 steps of feeding $(X_t-2 ; X_t-1 ; X_t)$ to the network:





          As you see, dimension and number of inputs are independent of output. We can feed 5 inputs $X_t-4$ to $X_t$, each 100 dimension (100d) and receive a 1d output by setting the dimension of hidden states to 1d, or setting it to 10d and use an extra dense layer at the end to convert 10d to 1d, or receive a 50d output, or a 150d (three 50d) output, etc.



          Word "stateful" in Keras (source)



          LSTM and RNN are stateful by definition, this [badly named] variable in Keras means




          If stateful=True, the last state for each sample at index i in a batch will be
          used as initial state for the sample of index i in the following
          batch. Fabien Chollet




          For example, if each batch has 24 samples indexed from 0 to 23 (each sample could have the form $(X_t-2, X_t-1, X_t, Y_t+1)$), then the last hidden state $h$ from 8th sample will be used as the initial hidden state for 8th sample in the next batch. Except for special cases that there is a temporal order between batches and their samples, this must be set to False.






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Thank you, this is clear! So the choice for any of those problems would only depend on how I arrange inputs and outputs, right? So the network does not need any more argument to be trained for one problem or the other. If you mind answering that also: does any of this problem needs stateful LSTM? Or should preferentially be addresed with stateful LSTM? I am very confused by this argument, and nothing I have read so far makes it clear wether I should use those or not (and if so, how should I modify the formulas above).
            $endgroup$
            – Yo B.
            Mar 26 at 9:23











          • $begingroup$
            Thanks for the edit and the reference! Just to be clear, currently I am not using stateful=TRUE. Does this mean that the network is not building links between $X_t-1$ and $X_t$? If so, what is the interest of feeding the network with this time-batch structure when stateful=FALSE? I get it's a different question, you don't have to answer. Thanks again!
            $endgroup$
            – Yo B.
            Mar 26 at 10:53







          • 1




            $begingroup$
            Well thanks again for this last edit, you really helped me there! All the best
            $endgroup$
            – Yo B.
            Mar 26 at 11:36










          • $begingroup$
            If you don't mind answering this last question: in the predictive case $(X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow Y_t+1$, how should I input $X_t+1$? I'm asking that because lots of my covariates are static and can therefore be useful in predicting the process in the years to come.
            $endgroup$
            – Yo B.
            Mar 26 at 15:45







          • 1




            $begingroup$
            @YoB. if $t+1$ denotes "next year" we have no access to $X_t+1$ in real-time cases. Btw, since each timestamp should be 101 dimension, you can use $(...X_t|Y_t;X_t+1|0) rightarrow (Y_t+1)$ just using a dummy 0, but I think it may cause under-performance, try it. You can also use the previous year: $(...X_t|Y_t;X_t+1|Y_t) rightarrow (Y_t+1)$.
            $endgroup$
            – Esmailian
            Mar 26 at 15:55













          1












          1








          1





          $begingroup$

          On time-series models



          All models that you have mentioned are correct and practical depending on the problem (the index $n$ is not required). The second one however produces redundant results which is a waste of computation. Even
          $$
          (X_t ; X_t+1 ; X_t+2) rightarrow (Y_t-1)
          $$

          is correct, if you are fitting on an archive and want to predict a year given the covariates from the next three years.



          But only the last model
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1)
          $$

          is a forecasting model. So in general, if you want to interpolate into the next $k$-th year from now $t$, you should use:
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+k)
          $$



          or



          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1,...,Y_t+k)
          $$

          Even a better model that takes advantage of known $Y$'s in the past would be:



          $$
          (X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow (Y_t+k)
          $$



          where $|$ denotes vector concatenation to produce a 100 + 1 dimensional vector for each known year.



          As a personal opinion, for the time-series prediction task, 24 data points per year is very small compared to the dimension of $X$, which is 100. 1200 samples for $X rightarrow Y$ regression (ignoring the time) is more practical; if selecting 10 from 100 covariates is possible even better.



          Because of the small data set, I would suggest:



          1. $(Y_t-m ;...; Y_t-1 ; Y_t) rightarrow (Y_t+k)$ for time series prediction, and


          2. $X rightarrow Y$ regression for estimating the relation between X and Y.


          Relation to LSTM and RNN



          If we use LSTM/RNN to model time-series, they would be stateful. That is, when input $X_t-2$ is fed to an LSTM, it keeps an internal state (hidden state) to be combined with the next input $X_t-1$ and so on. Regarding the input/output dimension, here is an RNN animation from a post on medium by Raimi Karim that shows an arbitrary step among 3 steps of feeding $(X_t-2 ; X_t-1 ; X_t)$ to the network:





          As you see, dimension and number of inputs are independent of output. We can feed 5 inputs $X_t-4$ to $X_t$, each 100 dimension (100d) and receive a 1d output by setting the dimension of hidden states to 1d, or setting it to 10d and use an extra dense layer at the end to convert 10d to 1d, or receive a 50d output, or a 150d (three 50d) output, etc.



          Word "stateful" in Keras (source)



          LSTM and RNN are stateful by definition, this [badly named] variable in Keras means




          If stateful=True, the last state for each sample at index i in a batch will be
          used as initial state for the sample of index i in the following
          batch. Fabien Chollet




          For example, if each batch has 24 samples indexed from 0 to 23 (each sample could have the form $(X_t-2, X_t-1, X_t, Y_t+1)$), then the last hidden state $h$ from 8th sample will be used as the initial hidden state for 8th sample in the next batch. Except for special cases that there is a temporal order between batches and their samples, this must be set to False.






          share|improve this answer











          $endgroup$



          On time-series models



          All models that you have mentioned are correct and practical depending on the problem (the index $n$ is not required). The second one however produces redundant results which is a waste of computation. Even
          $$
          (X_t ; X_t+1 ; X_t+2) rightarrow (Y_t-1)
          $$

          is correct, if you are fitting on an archive and want to predict a year given the covariates from the next three years.



          But only the last model
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1)
          $$

          is a forecasting model. So in general, if you want to interpolate into the next $k$-th year from now $t$, you should use:
          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+k)
          $$



          or



          $$
          (X_t-2 ; X_t-1 ; X_t) rightarrow (Y_t+1,...,Y_t+k)
          $$

          Even a better model that takes advantage of known $Y$'s in the past would be:



          $$
          (X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow (Y_t+k)
          $$



          where $|$ denotes vector concatenation to produce a 100 + 1 dimensional vector for each known year.



          As a personal opinion, for the time-series prediction task, 24 data points per year is very small compared to the dimension of $X$, which is 100. 1200 samples for $X rightarrow Y$ regression (ignoring the time) is more practical; if selecting 10 from 100 covariates is possible even better.



          Because of the small data set, I would suggest:



          1. $(Y_t-m ;...; Y_t-1 ; Y_t) rightarrow (Y_t+k)$ for time series prediction, and


          2. $X rightarrow Y$ regression for estimating the relation between X and Y.


          Relation to LSTM and RNN



          If we use LSTM/RNN to model time-series, they would be stateful. That is, when input $X_t-2$ is fed to an LSTM, it keeps an internal state (hidden state) to be combined with the next input $X_t-1$ and so on. Regarding the input/output dimension, here is an RNN animation from a post on medium by Raimi Karim that shows an arbitrary step among 3 steps of feeding $(X_t-2 ; X_t-1 ; X_t)$ to the network:





          As you see, dimension and number of inputs are independent of output. We can feed 5 inputs $X_t-4$ to $X_t$, each 100 dimension (100d) and receive a 1d output by setting the dimension of hidden states to 1d, or setting it to 10d and use an extra dense layer at the end to convert 10d to 1d, or receive a 50d output, or a 150d (three 50d) output, etc.



          Word "stateful" in Keras (source)



          LSTM and RNN are stateful by definition, this [badly named] variable in Keras means




          If stateful=True, the last state for each sample at index i in a batch will be
          used as initial state for the sample of index i in the following
          batch. Fabien Chollet




          For example, if each batch has 24 samples indexed from 0 to 23 (each sample could have the form $(X_t-2, X_t-1, X_t, Y_t+1)$), then the last hidden state $h$ from 8th sample will be used as the initial hidden state for 8th sample in the next batch. Except for special cases that there is a temporal order between batches and their samples, this must be set to False.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 26 at 12:41

























          answered Mar 25 at 17:52









          EsmailianEsmailian

          2,187218




          2,187218







          • 1




            $begingroup$
            Thank you, this is clear! So the choice for any of those problems would only depend on how I arrange inputs and outputs, right? So the network does not need any more argument to be trained for one problem or the other. If you mind answering that also: does any of this problem needs stateful LSTM? Or should preferentially be addresed with stateful LSTM? I am very confused by this argument, and nothing I have read so far makes it clear wether I should use those or not (and if so, how should I modify the formulas above).
            $endgroup$
            – Yo B.
            Mar 26 at 9:23











          • $begingroup$
            Thanks for the edit and the reference! Just to be clear, currently I am not using stateful=TRUE. Does this mean that the network is not building links between $X_t-1$ and $X_t$? If so, what is the interest of feeding the network with this time-batch structure when stateful=FALSE? I get it's a different question, you don't have to answer. Thanks again!
            $endgroup$
            – Yo B.
            Mar 26 at 10:53







          • 1




            $begingroup$
            Well thanks again for this last edit, you really helped me there! All the best
            $endgroup$
            – Yo B.
            Mar 26 at 11:36










          • $begingroup$
            If you don't mind answering this last question: in the predictive case $(X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow Y_t+1$, how should I input $X_t+1$? I'm asking that because lots of my covariates are static and can therefore be useful in predicting the process in the years to come.
            $endgroup$
            – Yo B.
            Mar 26 at 15:45







          • 1




            $begingroup$
            @YoB. if $t+1$ denotes "next year" we have no access to $X_t+1$ in real-time cases. Btw, since each timestamp should be 101 dimension, you can use $(...X_t|Y_t;X_t+1|0) rightarrow (Y_t+1)$ just using a dummy 0, but I think it may cause under-performance, try it. You can also use the previous year: $(...X_t|Y_t;X_t+1|Y_t) rightarrow (Y_t+1)$.
            $endgroup$
            – Esmailian
            Mar 26 at 15:55












          • 1




            $begingroup$
            Thank you, this is clear! So the choice for any of those problems would only depend on how I arrange inputs and outputs, right? So the network does not need any more argument to be trained for one problem or the other. If you mind answering that also: does any of this problem needs stateful LSTM? Or should preferentially be addresed with stateful LSTM? I am very confused by this argument, and nothing I have read so far makes it clear wether I should use those or not (and if so, how should I modify the formulas above).
            $endgroup$
            – Yo B.
            Mar 26 at 9:23











          • $begingroup$
            Thanks for the edit and the reference! Just to be clear, currently I am not using stateful=TRUE. Does this mean that the network is not building links between $X_t-1$ and $X_t$? If so, what is the interest of feeding the network with this time-batch structure when stateful=FALSE? I get it's a different question, you don't have to answer. Thanks again!
            $endgroup$
            – Yo B.
            Mar 26 at 10:53







          • 1




            $begingroup$
            Well thanks again for this last edit, you really helped me there! All the best
            $endgroup$
            – Yo B.
            Mar 26 at 11:36










          • $begingroup$
            If you don't mind answering this last question: in the predictive case $(X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow Y_t+1$, how should I input $X_t+1$? I'm asking that because lots of my covariates are static and can therefore be useful in predicting the process in the years to come.
            $endgroup$
            – Yo B.
            Mar 26 at 15:45







          • 1




            $begingroup$
            @YoB. if $t+1$ denotes "next year" we have no access to $X_t+1$ in real-time cases. Btw, since each timestamp should be 101 dimension, you can use $(...X_t|Y_t;X_t+1|0) rightarrow (Y_t+1)$ just using a dummy 0, but I think it may cause under-performance, try it. You can also use the previous year: $(...X_t|Y_t;X_t+1|Y_t) rightarrow (Y_t+1)$.
            $endgroup$
            – Esmailian
            Mar 26 at 15:55







          1




          1




          $begingroup$
          Thank you, this is clear! So the choice for any of those problems would only depend on how I arrange inputs and outputs, right? So the network does not need any more argument to be trained for one problem or the other. If you mind answering that also: does any of this problem needs stateful LSTM? Or should preferentially be addresed with stateful LSTM? I am very confused by this argument, and nothing I have read so far makes it clear wether I should use those or not (and if so, how should I modify the formulas above).
          $endgroup$
          – Yo B.
          Mar 26 at 9:23





          $begingroup$
          Thank you, this is clear! So the choice for any of those problems would only depend on how I arrange inputs and outputs, right? So the network does not need any more argument to be trained for one problem or the other. If you mind answering that also: does any of this problem needs stateful LSTM? Or should preferentially be addresed with stateful LSTM? I am very confused by this argument, and nothing I have read so far makes it clear wether I should use those or not (and if so, how should I modify the formulas above).
          $endgroup$
          – Yo B.
          Mar 26 at 9:23













          $begingroup$
          Thanks for the edit and the reference! Just to be clear, currently I am not using stateful=TRUE. Does this mean that the network is not building links between $X_t-1$ and $X_t$? If so, what is the interest of feeding the network with this time-batch structure when stateful=FALSE? I get it's a different question, you don't have to answer. Thanks again!
          $endgroup$
          – Yo B.
          Mar 26 at 10:53





          $begingroup$
          Thanks for the edit and the reference! Just to be clear, currently I am not using stateful=TRUE. Does this mean that the network is not building links between $X_t-1$ and $X_t$? If so, what is the interest of feeding the network with this time-batch structure when stateful=FALSE? I get it's a different question, you don't have to answer. Thanks again!
          $endgroup$
          – Yo B.
          Mar 26 at 10:53





          1




          1




          $begingroup$
          Well thanks again for this last edit, you really helped me there! All the best
          $endgroup$
          – Yo B.
          Mar 26 at 11:36




          $begingroup$
          Well thanks again for this last edit, you really helped me there! All the best
          $endgroup$
          – Yo B.
          Mar 26 at 11:36












          $begingroup$
          If you don't mind answering this last question: in the predictive case $(X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow Y_t+1$, how should I input $X_t+1$? I'm asking that because lots of my covariates are static and can therefore be useful in predicting the process in the years to come.
          $endgroup$
          – Yo B.
          Mar 26 at 15:45





          $begingroup$
          If you don't mind answering this last question: in the predictive case $(X_t-2|Y_t-2 ; X_t-1|Y_t-1 ; X_t|Y_t) rightarrow Y_t+1$, how should I input $X_t+1$? I'm asking that because lots of my covariates are static and can therefore be useful in predicting the process in the years to come.
          $endgroup$
          – Yo B.
          Mar 26 at 15:45





          1




          1




          $begingroup$
          @YoB. if $t+1$ denotes "next year" we have no access to $X_t+1$ in real-time cases. Btw, since each timestamp should be 101 dimension, you can use $(...X_t|Y_t;X_t+1|0) rightarrow (Y_t+1)$ just using a dummy 0, but I think it may cause under-performance, try it. You can also use the previous year: $(...X_t|Y_t;X_t+1|Y_t) rightarrow (Y_t+1)$.
          $endgroup$
          – Esmailian
          Mar 26 at 15:55




          $begingroup$
          @YoB. if $t+1$ denotes "next year" we have no access to $X_t+1$ in real-time cases. Btw, since each timestamp should be 101 dimension, you can use $(...X_t|Y_t;X_t+1|0) rightarrow (Y_t+1)$ just using a dummy 0, but I think it may cause under-performance, try it. You can also use the previous year: $(...X_t|Y_t;X_t+1|Y_t) rightarrow (Y_t+1)$.
          $endgroup$
          – Esmailian
          Mar 26 at 15:55










          Yo B. is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          Yo B. is a new contributor. Be nice, and check out our Code of Conduct.












          Yo B. is a new contributor. Be nice, and check out our Code of Conduct.











          Yo B. is a new contributor. Be nice, and check out our Code of Conduct.














          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47961%2fon-the-choice-of-lstm-input-output-dimension-for-a-spatio-temporal-problem%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High