Sentence representation in Transformer The Next CEO of Stack Overflow2019 Community Moderator ElectionImage Captioning example in keras Approach?How to user Keras's Embedding Layer properly?Logic in sentence : tree representationTensorflow regression predicting 1 for all inputsHow to concatenate feature vectors of different dimensions?Ways to Encode context for text classification?How to dual encode two sentences to show similarity scoreArchitecture for linear regression with variable input where each input is n-sized one-hot encodedArchitecture help for multivariate input and output LSTM modelsHow to choose the number of output channels in a convolutional layer?

Is there a difference between "Fahrstuhl" and "Aufzug"

What happens if you roll doubles 3 times then land on "Go to jail?"

What is the difference between "behavior" and "behaviour"?

Is it safe to use c_str() on a temporary string?

How to count occurrences of text in a file?

LWC - Unit Testing NavigationMixin.GenerateUrl

What is the purpose of the Evocation wizard's Potent Cantrip feature?

Monthly twice production release for my software project

What is the difference between Sanyaas and Vairagya?

How can I get through very long and very dry, but also very useful technical documents when learning a new tool?

Why do airplanes bank sharply to the right after air-to-air refueling?

How easy is it to start Magic from scratch?

Opposite of a diet

How to start emacs in "nothing" mode (`fundamental-mode`)

Robert Sheckley short story about vacation spots being overwhelmed

Is HostGator storing my password in plaintext?

Complex fractions

BOOM! All Clear for Mr. T

Why do we use the plural of movies in this phrase "We went to the movies last night."?

What can we do to stop prior company from asking us questions?

Can we say or write : "No, it'sn't"?

Why did we only see the N-1 starfighters in one film?

How to write the block matrix in LaTex?

Are there languages with no euphemisms?



Sentence representation in Transformer



The Next CEO of Stack Overflow
2019 Community Moderator ElectionImage Captioning example in keras Approach?How to user Keras's Embedding Layer properly?Logic in sentence : tree representationTensorflow regression predicting 1 for all inputsHow to concatenate feature vectors of different dimensions?Ways to Encode context for text classification?How to dual encode two sentences to show similarity scoreArchitecture for linear regression with variable input where each input is n-sized one-hot encodedArchitecture help for multivariate input and output LSTM modelsHow to choose the number of output channels in a convolutional layer?










0












$begingroup$


I am trying my hand at BERT and I got so far that I can feed a sentence into BertTokenizer, and run that through the BERT model which gives me the output layers back. Modified from PyTorch code over at HuggingFace.



import logging

import torch
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.modeling import BertModel

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logger = logging.getLogger(__name__)

class InputFeatures(object):
def __init__(self, tokens, input_ids, input_mask, input_type_ids):
self.tokens = tokens
self.input_ids = input_ids
self.input_mask = input_mask
self.input_type_ids = input_type_ids

def convert_sentences_to_features(sentences, max_seq_length, tokenizer):
features = []
for sentence in sentences:
# tokenizer will also separate on punctuation
# see https://github.com/google-research/bert#tokenization
tokens = tokenizer.tokenize(sentence)

# limit size of tokens
if len(tokens) > max_seq_length - 2:
tokens = tokens[0:(max_seq_length - 2)]

# add [CLS] and [SEP], as expected in BERT
tokens = ['[CLS]', *tokens, '[SEP]']


input_type_ids = [0] * len(tokens)
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)

# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
input_type_ids.append(0)

features.append(InputFeatures(tokens=tokens,
input_ids=input_ids,
input_mask=input_mask,
input_type_ids=input_type_ids)
)
return features


def main(sentences, layers='-1, -2, -3, -4', max_seq_length=512, bert_model='bert-large-uncased',
do_lower_case=True, batch_size=32, no_cuda=False):
device = torch.device('cuda' if torch.cuda.is_available() and not no_cuda else 'cpu')

# 'layers' indicates which layers we want to concatenate
layer_idxs = [int(l) for l in layers.split(',')]

# init tokenizer
tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=do_lower_case)

# returns a list of 'InputFeatures'
features = convert_sentences_to_features(sentences, max_seq_length, tokenizer)

# init model and move to device
model = BertModel.from_pretrained(bert_model)
model.to(device)

# extract IDs and mask from the features
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)

# prepare dataset and dataloader
eval_data = TensorDataset(all_input_ids, all_input_mask)
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=batch_size)

model.eval()

for input_ids, input_mask in eval_dataloader:
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)

all_encoder_layers, _ = model(input_ids, token_type_ids=None, attention_mask=input_mask)

# put layers to concatenate in list, and use torch.cat
layers_to_concat = [all_encoder_layers[idx] for idx in layer_idxs]
concat = torch.cat(layers_to_concat, dim=-1)

logger.info(concat.size())
logger.info(concat)

if __name__ == "__main__":
proc_args =
'sentences': ['I saw Bert today !', 'Do you like bananas ?', 'Some sentences are really horrendous to parse .'],
'max_seq_length': 32

main(**proc_args)


This works and gives me an output size of 3, 32, 4096 in this case. This comes down to batch_size, seq_length, layers*hidden_size. In practice, this means that I have a representation of each token in its context. But I would like to extract the sentence representation from this. In an (bidirectional) RNN, you would typically take the top leftmost node's output which contained the latest hidden state and latest output, but I am not sure whether this is also true for transformers, as the architecture is different.



What is the best way to extract the sentence representation from a sequence of representations in a transformer? The size would be batch_size, hidden_size.










share|improve this question









$endgroup$
















    0












    $begingroup$


    I am trying my hand at BERT and I got so far that I can feed a sentence into BertTokenizer, and run that through the BERT model which gives me the output layers back. Modified from PyTorch code over at HuggingFace.



    import logging

    import torch
    from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

    from pytorch_pretrained_bert.tokenization import BertTokenizer
    from pytorch_pretrained_bert.modeling import BertModel

    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
    datefmt = '%m/%d/%Y %H:%M:%S',
    level = logging.INFO)
    logger = logging.getLogger(__name__)

    class InputFeatures(object):
    def __init__(self, tokens, input_ids, input_mask, input_type_ids):
    self.tokens = tokens
    self.input_ids = input_ids
    self.input_mask = input_mask
    self.input_type_ids = input_type_ids

    def convert_sentences_to_features(sentences, max_seq_length, tokenizer):
    features = []
    for sentence in sentences:
    # tokenizer will also separate on punctuation
    # see https://github.com/google-research/bert#tokenization
    tokens = tokenizer.tokenize(sentence)

    # limit size of tokens
    if len(tokens) > max_seq_length - 2:
    tokens = tokens[0:(max_seq_length - 2)]

    # add [CLS] and [SEP], as expected in BERT
    tokens = ['[CLS]', *tokens, '[SEP]']


    input_type_ids = [0] * len(tokens)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    while len(input_ids) < max_seq_length:
    input_ids.append(0)
    input_mask.append(0)
    input_type_ids.append(0)

    features.append(InputFeatures(tokens=tokens,
    input_ids=input_ids,
    input_mask=input_mask,
    input_type_ids=input_type_ids)
    )
    return features


    def main(sentences, layers='-1, -2, -3, -4', max_seq_length=512, bert_model='bert-large-uncased',
    do_lower_case=True, batch_size=32, no_cuda=False):
    device = torch.device('cuda' if torch.cuda.is_available() and not no_cuda else 'cpu')

    # 'layers' indicates which layers we want to concatenate
    layer_idxs = [int(l) for l in layers.split(',')]

    # init tokenizer
    tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=do_lower_case)

    # returns a list of 'InputFeatures'
    features = convert_sentences_to_features(sentences, max_seq_length, tokenizer)

    # init model and move to device
    model = BertModel.from_pretrained(bert_model)
    model.to(device)

    # extract IDs and mask from the features
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)

    # prepare dataset and dataloader
    eval_data = TensorDataset(all_input_ids, all_input_mask)
    eval_sampler = SequentialSampler(eval_data)
    eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=batch_size)

    model.eval()

    for input_ids, input_mask in eval_dataloader:
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)

    all_encoder_layers, _ = model(input_ids, token_type_ids=None, attention_mask=input_mask)

    # put layers to concatenate in list, and use torch.cat
    layers_to_concat = [all_encoder_layers[idx] for idx in layer_idxs]
    concat = torch.cat(layers_to_concat, dim=-1)

    logger.info(concat.size())
    logger.info(concat)

    if __name__ == "__main__":
    proc_args =
    'sentences': ['I saw Bert today !', 'Do you like bananas ?', 'Some sentences are really horrendous to parse .'],
    'max_seq_length': 32

    main(**proc_args)


    This works and gives me an output size of 3, 32, 4096 in this case. This comes down to batch_size, seq_length, layers*hidden_size. In practice, this means that I have a representation of each token in its context. But I would like to extract the sentence representation from this. In an (bidirectional) RNN, you would typically take the top leftmost node's output which contained the latest hidden state and latest output, but I am not sure whether this is also true for transformers, as the architecture is different.



    What is the best way to extract the sentence representation from a sequence of representations in a transformer? The size would be batch_size, hidden_size.










    share|improve this question









    $endgroup$














      0












      0








      0





      $begingroup$


      I am trying my hand at BERT and I got so far that I can feed a sentence into BertTokenizer, and run that through the BERT model which gives me the output layers back. Modified from PyTorch code over at HuggingFace.



      import logging

      import torch
      from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

      from pytorch_pretrained_bert.tokenization import BertTokenizer
      from pytorch_pretrained_bert.modeling import BertModel

      logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
      datefmt = '%m/%d/%Y %H:%M:%S',
      level = logging.INFO)
      logger = logging.getLogger(__name__)

      class InputFeatures(object):
      def __init__(self, tokens, input_ids, input_mask, input_type_ids):
      self.tokens = tokens
      self.input_ids = input_ids
      self.input_mask = input_mask
      self.input_type_ids = input_type_ids

      def convert_sentences_to_features(sentences, max_seq_length, tokenizer):
      features = []
      for sentence in sentences:
      # tokenizer will also separate on punctuation
      # see https://github.com/google-research/bert#tokenization
      tokens = tokenizer.tokenize(sentence)

      # limit size of tokens
      if len(tokens) > max_seq_length - 2:
      tokens = tokens[0:(max_seq_length - 2)]

      # add [CLS] and [SEP], as expected in BERT
      tokens = ['[CLS]', *tokens, '[SEP]']


      input_type_ids = [0] * len(tokens)
      input_ids = tokenizer.convert_tokens_to_ids(tokens)

      # The mask has 1 for real tokens and 0 for padding tokens. Only real
      # tokens are attended to.
      input_mask = [1] * len(input_ids)

      # Zero-pad up to the sequence length.
      while len(input_ids) < max_seq_length:
      input_ids.append(0)
      input_mask.append(0)
      input_type_ids.append(0)

      features.append(InputFeatures(tokens=tokens,
      input_ids=input_ids,
      input_mask=input_mask,
      input_type_ids=input_type_ids)
      )
      return features


      def main(sentences, layers='-1, -2, -3, -4', max_seq_length=512, bert_model='bert-large-uncased',
      do_lower_case=True, batch_size=32, no_cuda=False):
      device = torch.device('cuda' if torch.cuda.is_available() and not no_cuda else 'cpu')

      # 'layers' indicates which layers we want to concatenate
      layer_idxs = [int(l) for l in layers.split(',')]

      # init tokenizer
      tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=do_lower_case)

      # returns a list of 'InputFeatures'
      features = convert_sentences_to_features(sentences, max_seq_length, tokenizer)

      # init model and move to device
      model = BertModel.from_pretrained(bert_model)
      model.to(device)

      # extract IDs and mask from the features
      all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
      all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)

      # prepare dataset and dataloader
      eval_data = TensorDataset(all_input_ids, all_input_mask)
      eval_sampler = SequentialSampler(eval_data)
      eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=batch_size)

      model.eval()

      for input_ids, input_mask in eval_dataloader:
      input_ids = input_ids.to(device)
      input_mask = input_mask.to(device)

      all_encoder_layers, _ = model(input_ids, token_type_ids=None, attention_mask=input_mask)

      # put layers to concatenate in list, and use torch.cat
      layers_to_concat = [all_encoder_layers[idx] for idx in layer_idxs]
      concat = torch.cat(layers_to_concat, dim=-1)

      logger.info(concat.size())
      logger.info(concat)

      if __name__ == "__main__":
      proc_args =
      'sentences': ['I saw Bert today !', 'Do you like bananas ?', 'Some sentences are really horrendous to parse .'],
      'max_seq_length': 32

      main(**proc_args)


      This works and gives me an output size of 3, 32, 4096 in this case. This comes down to batch_size, seq_length, layers*hidden_size. In practice, this means that I have a representation of each token in its context. But I would like to extract the sentence representation from this. In an (bidirectional) RNN, you would typically take the top leftmost node's output which contained the latest hidden state and latest output, but I am not sure whether this is also true for transformers, as the architecture is different.



      What is the best way to extract the sentence representation from a sequence of representations in a transformer? The size would be batch_size, hidden_size.










      share|improve this question









      $endgroup$




      I am trying my hand at BERT and I got so far that I can feed a sentence into BertTokenizer, and run that through the BERT model which gives me the output layers back. Modified from PyTorch code over at HuggingFace.



      import logging

      import torch
      from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

      from pytorch_pretrained_bert.tokenization import BertTokenizer
      from pytorch_pretrained_bert.modeling import BertModel

      logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
      datefmt = '%m/%d/%Y %H:%M:%S',
      level = logging.INFO)
      logger = logging.getLogger(__name__)

      class InputFeatures(object):
      def __init__(self, tokens, input_ids, input_mask, input_type_ids):
      self.tokens = tokens
      self.input_ids = input_ids
      self.input_mask = input_mask
      self.input_type_ids = input_type_ids

      def convert_sentences_to_features(sentences, max_seq_length, tokenizer):
      features = []
      for sentence in sentences:
      # tokenizer will also separate on punctuation
      # see https://github.com/google-research/bert#tokenization
      tokens = tokenizer.tokenize(sentence)

      # limit size of tokens
      if len(tokens) > max_seq_length - 2:
      tokens = tokens[0:(max_seq_length - 2)]

      # add [CLS] and [SEP], as expected in BERT
      tokens = ['[CLS]', *tokens, '[SEP]']


      input_type_ids = [0] * len(tokens)
      input_ids = tokenizer.convert_tokens_to_ids(tokens)

      # The mask has 1 for real tokens and 0 for padding tokens. Only real
      # tokens are attended to.
      input_mask = [1] * len(input_ids)

      # Zero-pad up to the sequence length.
      while len(input_ids) < max_seq_length:
      input_ids.append(0)
      input_mask.append(0)
      input_type_ids.append(0)

      features.append(InputFeatures(tokens=tokens,
      input_ids=input_ids,
      input_mask=input_mask,
      input_type_ids=input_type_ids)
      )
      return features


      def main(sentences, layers='-1, -2, -3, -4', max_seq_length=512, bert_model='bert-large-uncased',
      do_lower_case=True, batch_size=32, no_cuda=False):
      device = torch.device('cuda' if torch.cuda.is_available() and not no_cuda else 'cpu')

      # 'layers' indicates which layers we want to concatenate
      layer_idxs = [int(l) for l in layers.split(',')]

      # init tokenizer
      tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=do_lower_case)

      # returns a list of 'InputFeatures'
      features = convert_sentences_to_features(sentences, max_seq_length, tokenizer)

      # init model and move to device
      model = BertModel.from_pretrained(bert_model)
      model.to(device)

      # extract IDs and mask from the features
      all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
      all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)

      # prepare dataset and dataloader
      eval_data = TensorDataset(all_input_ids, all_input_mask)
      eval_sampler = SequentialSampler(eval_data)
      eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=batch_size)

      model.eval()

      for input_ids, input_mask in eval_dataloader:
      input_ids = input_ids.to(device)
      input_mask = input_mask.to(device)

      all_encoder_layers, _ = model(input_ids, token_type_ids=None, attention_mask=input_mask)

      # put layers to concatenate in list, and use torch.cat
      layers_to_concat = [all_encoder_layers[idx] for idx in layer_idxs]
      concat = torch.cat(layers_to_concat, dim=-1)

      logger.info(concat.size())
      logger.info(concat)

      if __name__ == "__main__":
      proc_args =
      'sentences': ['I saw Bert today !', 'Do you like bananas ?', 'Some sentences are really horrendous to parse .'],
      'max_seq_length': 32

      main(**proc_args)


      This works and gives me an output size of 3, 32, 4096 in this case. This comes down to batch_size, seq_length, layers*hidden_size. In practice, this means that I have a representation of each token in its context. But I would like to extract the sentence representation from this. In an (bidirectional) RNN, you would typically take the top leftmost node's output which contained the latest hidden state and latest output, but I am not sure whether this is also true for transformers, as the architecture is different.



      What is the best way to extract the sentence representation from a sequence of representations in a transformer? The size would be batch_size, hidden_size.







      deep-learning pytorch natural-language-process transformer bert






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 22 at 11:37









      Bram VanroyBram Vanroy

      359




      359




















          0






          active

          oldest

          votes












          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47782%2fsentence-representation-in-transformer%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47782%2fsentence-representation-in-transformer%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

          Do these cracks on my tires look bad? The Next CEO of Stack OverflowDry rot tire should I replace?Having to replace tiresFishtailed so easily? Bad tires? ABS?Filling the tires with something other than air, to avoid puncture hassles?Used Michelin tires safe to install?Do these tyre cracks necessitate replacement?Rumbling noise: tires or mechanicalIs it possible to fix noisy feathered tires?Are bad winter tires still better than summer tires in winter?Torque converter failure - Related to replacing only 2 tires?Why use snow tires on all 4 wheels on 2-wheel-drive cars?