Input data of variable length - two scenarios Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsAutomatic annotation of medical text dataHow to add incorporate meta data into text classification?data pre-processing before feeding into a deep learning modelIs there any clear tutorial for how to use AutoEncoders with text as inputOrdering quotes in a list based on user input and text analysisIs the multilayer perceptron only able to accept 1d vector of input data? If yes, why is this so?How to use correlation matrix when the dataset contains multiple columns with text data?Size of Output vector from AvgW2V Vectorizer is less than Size of Input dataMatch a two items from two different receiptspython - Identify variable in similar sentences
Bayes factor vs P value
Unable to completely uninstall Zoom meeting app
How long after the last departure shall the airport stay open for an emergency return?
How would this chord from "Rocket Man" be analyzed?
All ASCII characters with a given bit count
Is there any pythonic way to find average of specific tuple elements in array?
How exactly does Hawking radiation decrease the mass of black holes?
What makes accurate emulation of old systems a difficult task?
finding a tangent line to a parabola
std::unique_ptr of base class holding reference of derived class does not show warning in gcc compiler while naked pointer shows it. Why?
Older movie/show about humans on derelict alien warship which refuels by passing through a star
A Paper Record is What I Hamper
Should the Product Owner dictate what info the UI needs to display?
Check if a string is entirely made of the same substring
Can I criticise the more senior developers around me for not writing clean code?
Did the Roman Empire have penal colonies?
What *exactly* is electrical current, voltage, and resistance?
What is this word supposed to be?
What is it called when you ride around on your front wheel?
Can a level 2 Warlock take one level in rogue, then continue advancing as a warlock?
How does the mezzoloth's teleportation work?
What is the best way to deal with NPC-NPC combat?
Scheduling based problem
Philosophical question on logisitic regression: why isn't the optimal threshold value trained?
Input data of variable length - two scenarios
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsAutomatic annotation of medical text dataHow to add incorporate meta data into text classification?data pre-processing before feeding into a deep learning modelIs there any clear tutorial for how to use AutoEncoders with text as inputOrdering quotes in a list based on user input and text analysisIs the multilayer perceptron only able to accept 1d vector of input data? If yes, why is this so?How to use correlation matrix when the dataset contains multiple columns with text data?Size of Output vector from AvgW2V Vectorizer is less than Size of Input dataMatch a two items from two different receiptspython - Identify variable in similar sentences
$begingroup$
I'm trying to figure out how I could train a neural network with inputs that have variable length. This issue comes up in the following 2 scenarios I'm trying to solve.
Scenario 1:
I have a long list of running distances for various runners which looks something like has 3 columns: runner, date, distance.
Obviously some runners have a lot of entries and others don't. I'm trying to make predictions on the number of miles a given runner will run next. So I'm guessing i need to transform my data to have one line per runner, which gives me variable length features. How can I deal with this in a ML application?
Scenario 2:
I'd like to take various strings ("teststring", "P@ssword", "NotAPassword123", etc...) and classify it as a password or not. I guess i'm trying to figure out how to a) convert strings to numbers to train on and b) how to deal with the fact that they have variable length.
Thanks for reading this far...
text mlp features
$endgroup$
add a comment |
$begingroup$
I'm trying to figure out how I could train a neural network with inputs that have variable length. This issue comes up in the following 2 scenarios I'm trying to solve.
Scenario 1:
I have a long list of running distances for various runners which looks something like has 3 columns: runner, date, distance.
Obviously some runners have a lot of entries and others don't. I'm trying to make predictions on the number of miles a given runner will run next. So I'm guessing i need to transform my data to have one line per runner, which gives me variable length features. How can I deal with this in a ML application?
Scenario 2:
I'd like to take various strings ("teststring", "P@ssword", "NotAPassword123", etc...) and classify it as a password or not. I guess i'm trying to figure out how to a) convert strings to numbers to train on and b) how to deal with the fact that they have variable length.
Thanks for reading this far...
text mlp features
$endgroup$
add a comment |
$begingroup$
I'm trying to figure out how I could train a neural network with inputs that have variable length. This issue comes up in the following 2 scenarios I'm trying to solve.
Scenario 1:
I have a long list of running distances for various runners which looks something like has 3 columns: runner, date, distance.
Obviously some runners have a lot of entries and others don't. I'm trying to make predictions on the number of miles a given runner will run next. So I'm guessing i need to transform my data to have one line per runner, which gives me variable length features. How can I deal with this in a ML application?
Scenario 2:
I'd like to take various strings ("teststring", "P@ssword", "NotAPassword123", etc...) and classify it as a password or not. I guess i'm trying to figure out how to a) convert strings to numbers to train on and b) how to deal with the fact that they have variable length.
Thanks for reading this far...
text mlp features
$endgroup$
I'm trying to figure out how I could train a neural network with inputs that have variable length. This issue comes up in the following 2 scenarios I'm trying to solve.
Scenario 1:
I have a long list of running distances for various runners which looks something like has 3 columns: runner, date, distance.
Obviously some runners have a lot of entries and others don't. I'm trying to make predictions on the number of miles a given runner will run next. So I'm guessing i need to transform my data to have one line per runner, which gives me variable length features. How can I deal with this in a ML application?
Scenario 2:
I'd like to take various strings ("teststring", "P@ssword", "NotAPassword123", etc...) and classify it as a password or not. I guess i'm trying to figure out how to a) convert strings to numbers to train on and b) how to deal with the fact that they have variable length.
Thanks for reading this far...
text mlp features
text mlp features
asked Mar 6 at 20:22
Joe RJoe R
61
61
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN
entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.
Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!
$endgroup$
add a comment |
$begingroup$
For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46818%2finput-data-of-variable-length-two-scenarios%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN
entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.
Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!
$endgroup$
add a comment |
$begingroup$
Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN
entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.
Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!
$endgroup$
add a comment |
$begingroup$
Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN
entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.
Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!
$endgroup$
Scenario 1: It seems like you're dealing with columns that may lack data. You have a few options for assigning values to rows that have no information in certain columns, and each have advantages and drawbacks that depend on your dataset. An example is to assign the mean or median of that column for NaN
entries, which has the drawback of reducing variance in your data. Here's an article on the topic that should help you.
Scenario 2: For part "b", a common approach is to simply find a length that should be "big enough" and adding padding to sequences (or, in your case, strings) which are "too short". For part "a", a very simple approach would be to apply bag of words at a character level. Alternatively, you could experiment with trainin a character embedding model on your password text; such models would create a vectorized representation of your text that you can feed to whatever model you use for password classification!
edited Mar 7 at 15:01
answered Mar 7 at 6:15
Andrei UngurAndrei Ungur
312
312
add a comment |
add a comment |
$begingroup$
For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.
$endgroup$
add a comment |
$begingroup$
For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.
$endgroup$
add a comment |
$begingroup$
For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.
$endgroup$
For scenario 1, if you have enough data for each runner, you could build separate models for them otherwise you can add runner as a categorical variable by one hot encoding your runners and then trying out your model.
For scenario 2, you can create fixed sized vectors for each string by handcrafting features, such as count of consonants, count of vowels, presence of special characters, etc.
answered Mar 7 at 16:03
Atif HassanAtif Hassan
1263
1263
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46818%2finput-data-of-variable-length-two-scenarios%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown