How do I use TF*IDF scores for my machine learning model? The Next CEO of Stack Overflow2019 Community Moderator ElectionDocument classification: tf-idf prior to or after feature filtering?Choosing an algorithm with normalized data(Classification)Image Feature VectorsHow to improve the accuracy of Random Forest for Text CategorizationHow to find categorical features from a vector representation of text?Low number of inputs compared to outputs (per row) in neural networkFeeding machine learning model with different matrixWhy Decision trees performs better than logistic regressionWhen to use an ordinal logistic regression modelHow exactly do I go about extracting features from timestamps for machine learning?
Which one is the true statement?
How many extra stops do monopods offer for tele photographs?
Why isn't the Mueller report being released completely and unredacted?
Is it possible to replace duplicates of a character with one character using tr
What was the first Unix version to run on a microcomputer?
Flying from Cape Town to England and return to another province
Would a completely good Muggle be able to use a wand?
What can we do to stop prior company from asking us questions?
Why is quantifier elimination desirable for a given theory?
Should I tutor a student who I know has cheated on their homework?
Proper way to express "He disappeared them"
Is French Guiana a (hard) EU border?
Reference request: Grassmannian and Plucker coordinates in type B, C, D
The exact meaning of 'Mom made me a sandwich'
Solving system of ODEs with extra parameter
Why this way of making earth uninhabitable in Interstellar?
I believe this to be a fraud - hired, then asked to cash check and send cash as Bitcoin
WOW air has ceased operation, can I get my tickets refunded?
A Man With a Stainless Steel Endoskeleton (like The Terminator) Fighting Cloaked Aliens Only He Can See
Why, when going from special to general relativity, do we just replace partial derivatives with covariant derivatives?
Is there a way to bypass a component in series in a circuit if that component fails?
Is this "being" usage is essential?
What is the value of α and β in a triangle?
Running a General Election and the European Elections together
How do I use TF*IDF scores for my machine learning model?
The Next CEO of Stack Overflow2019 Community Moderator ElectionDocument classification: tf-idf prior to or after feature filtering?Choosing an algorithm with normalized data(Classification)Image Feature VectorsHow to improve the accuracy of Random Forest for Text CategorizationHow to find categorical features from a vector representation of text?Low number of inputs compared to outputs (per row) in neural networkFeeding machine learning model with different matrixWhy Decision trees performs better than logistic regressionWhen to use an ordinal logistic regression modelHow exactly do I go about extracting features from timestamps for machine learning?
$begingroup$
I have applied TF*IDF on the 'Ad-topic line' column of my dataset.
For every ad-topic line, I get the same output:
Firstly, I am unable to make sense of the output. The TF*IDF values are mentioned to the right, but what exactly are the numbers in brackets?
I plan to use these for my logistic regression model for classification. How exactly do I feed these values to the algorithm?
machine-learning feature-extraction tfidf
New contributor
$endgroup$
add a comment |
$begingroup$
I have applied TF*IDF on the 'Ad-topic line' column of my dataset.
For every ad-topic line, I get the same output:
Firstly, I am unable to make sense of the output. The TF*IDF values are mentioned to the right, but what exactly are the numbers in brackets?
I plan to use these for my logistic regression model for classification. How exactly do I feed these values to the algorithm?
machine-learning feature-extraction tfidf
New contributor
$endgroup$
add a comment |
$begingroup$
I have applied TF*IDF on the 'Ad-topic line' column of my dataset.
For every ad-topic line, I get the same output:
Firstly, I am unable to make sense of the output. The TF*IDF values are mentioned to the right, but what exactly are the numbers in brackets?
I plan to use these for my logistic regression model for classification. How exactly do I feed these values to the algorithm?
machine-learning feature-extraction tfidf
New contributor
$endgroup$
I have applied TF*IDF on the 'Ad-topic line' column of my dataset.
For every ad-topic line, I get the same output:
Firstly, I am unable to make sense of the output. The TF*IDF values are mentioned to the right, but what exactly are the numbers in brackets?
I plan to use these for my logistic regression model for classification. How exactly do I feed these values to the algorithm?
machine-learning feature-extraction tfidf
machine-learning feature-extraction tfidf
New contributor
New contributor
New contributor
asked Mar 23 at 21:18
ApolloApollo
61
61
New contributor
New contributor
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
The numbers on the left are also important, they are basically the indexes in the following format
(document_number, token_number)
TF-IDF is computed for all the unique tokens in all the documents.
Let me know if you have any further doubt.
Vote me if i was able to help ;)
New contributor
$endgroup$
add a comment |
$begingroup$
First of all, I'm not sure how you have applied tfidf vectorizers on the data as no code snippet has been attached. Tfidf vectorizers are applied on text to convert the text into numerical vectors. Speciality of tfidf vectorization is that it gives more importance to rarely occuring words than the words which occur a lot of time ex: stop words or filler words which occur a lot of times yet they add no special meaning to a sentence. Nevertheless, I've applied tfidf vectorization on the dataset and I have posted them here and here. Try to recreate this. Hope it helps
$endgroup$
$begingroup$
I did the exact same thing. However, all tfidf values for each ad-topic line turn out to be the same. Please have a look at these: (1) Before fit_transfrom: imgur.com/NdOlwQL (2) Code: imgur.com/zuhGpJe (3) After fit_transform: imgur.com/YfFzIQr
$endgroup$
– Apollo
Mar 24 at 11:50
add a comment |
$begingroup$
From the output you have shared it can be understood that you have around 1000 rows of data and 300+ features/columns/words which the tfidf function created based on your selection of ngrams parameter.
Now, in the brackets (x,y) signifies:
X as the number of row of your data
And
Y as the nth feature in the features list.
Assuming you have written something similar to get the above output-
tfidf_matrix = tf.fit_transform(data)
this will give you a list of feature names
feature_names = tf.get_feature_names()
And now you can check any Y value in the brackets, for example lets take the first value from your output- (0,53)
feature_names[53]
This will give the name of the feature(which is basically a word or a combination of words) and right side value is the the tfidf score of that feature in the 0th row of your data.
$endgroup$
$begingroup$
Thank you for the response. Could you please tell me why every ad-topic line gives the same output? Shouldn't the 0th row give me (0,53), (0,4) and (0,228) with their respective tfidf values? Instead, for every row, I get the same tfidf value list for all the rows(the image in the question)
$endgroup$
– Apollo
Mar 24 at 8:25
$begingroup$
for (0,53) its .61, (0,4) its .53 and (0,228) its .58
$endgroup$
– Cini09
Mar 25 at 18:59
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Apollo is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47862%2fhow-do-i-use-tfidf-scores-for-my-machine-learning-model%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The numbers on the left are also important, they are basically the indexes in the following format
(document_number, token_number)
TF-IDF is computed for all the unique tokens in all the documents.
Let me know if you have any further doubt.
Vote me if i was able to help ;)
New contributor
$endgroup$
add a comment |
$begingroup$
The numbers on the left are also important, they are basically the indexes in the following format
(document_number, token_number)
TF-IDF is computed for all the unique tokens in all the documents.
Let me know if you have any further doubt.
Vote me if i was able to help ;)
New contributor
$endgroup$
add a comment |
$begingroup$
The numbers on the left are also important, they are basically the indexes in the following format
(document_number, token_number)
TF-IDF is computed for all the unique tokens in all the documents.
Let me know if you have any further doubt.
Vote me if i was able to help ;)
New contributor
$endgroup$
The numbers on the left are also important, they are basically the indexes in the following format
(document_number, token_number)
TF-IDF is computed for all the unique tokens in all the documents.
Let me know if you have any further doubt.
Vote me if i was able to help ;)
New contributor
New contributor
answered Mar 24 at 0:16
William ScottWilliam Scott
1063
1063
New contributor
New contributor
add a comment |
add a comment |
$begingroup$
First of all, I'm not sure how you have applied tfidf vectorizers on the data as no code snippet has been attached. Tfidf vectorizers are applied on text to convert the text into numerical vectors. Speciality of tfidf vectorization is that it gives more importance to rarely occuring words than the words which occur a lot of time ex: stop words or filler words which occur a lot of times yet they add no special meaning to a sentence. Nevertheless, I've applied tfidf vectorization on the dataset and I have posted them here and here. Try to recreate this. Hope it helps
$endgroup$
$begingroup$
I did the exact same thing. However, all tfidf values for each ad-topic line turn out to be the same. Please have a look at these: (1) Before fit_transfrom: imgur.com/NdOlwQL (2) Code: imgur.com/zuhGpJe (3) After fit_transform: imgur.com/YfFzIQr
$endgroup$
– Apollo
Mar 24 at 11:50
add a comment |
$begingroup$
First of all, I'm not sure how you have applied tfidf vectorizers on the data as no code snippet has been attached. Tfidf vectorizers are applied on text to convert the text into numerical vectors. Speciality of tfidf vectorization is that it gives more importance to rarely occuring words than the words which occur a lot of time ex: stop words or filler words which occur a lot of times yet they add no special meaning to a sentence. Nevertheless, I've applied tfidf vectorization on the dataset and I have posted them here and here. Try to recreate this. Hope it helps
$endgroup$
$begingroup$
I did the exact same thing. However, all tfidf values for each ad-topic line turn out to be the same. Please have a look at these: (1) Before fit_transfrom: imgur.com/NdOlwQL (2) Code: imgur.com/zuhGpJe (3) After fit_transform: imgur.com/YfFzIQr
$endgroup$
– Apollo
Mar 24 at 11:50
add a comment |
$begingroup$
First of all, I'm not sure how you have applied tfidf vectorizers on the data as no code snippet has been attached. Tfidf vectorizers are applied on text to convert the text into numerical vectors. Speciality of tfidf vectorization is that it gives more importance to rarely occuring words than the words which occur a lot of time ex: stop words or filler words which occur a lot of times yet they add no special meaning to a sentence. Nevertheless, I've applied tfidf vectorization on the dataset and I have posted them here and here. Try to recreate this. Hope it helps
$endgroup$
First of all, I'm not sure how you have applied tfidf vectorizers on the data as no code snippet has been attached. Tfidf vectorizers are applied on text to convert the text into numerical vectors. Speciality of tfidf vectorization is that it gives more importance to rarely occuring words than the words which occur a lot of time ex: stop words or filler words which occur a lot of times yet they add no special meaning to a sentence. Nevertheless, I've applied tfidf vectorization on the dataset and I have posted them here and here. Try to recreate this. Hope it helps
answered Mar 24 at 10:51
karthikeyan mgkarthikeyan mg
19510
19510
$begingroup$
I did the exact same thing. However, all tfidf values for each ad-topic line turn out to be the same. Please have a look at these: (1) Before fit_transfrom: imgur.com/NdOlwQL (2) Code: imgur.com/zuhGpJe (3) After fit_transform: imgur.com/YfFzIQr
$endgroup$
– Apollo
Mar 24 at 11:50
add a comment |
$begingroup$
I did the exact same thing. However, all tfidf values for each ad-topic line turn out to be the same. Please have a look at these: (1) Before fit_transfrom: imgur.com/NdOlwQL (2) Code: imgur.com/zuhGpJe (3) After fit_transform: imgur.com/YfFzIQr
$endgroup$
– Apollo
Mar 24 at 11:50
$begingroup$
I did the exact same thing. However, all tfidf values for each ad-topic line turn out to be the same. Please have a look at these: (1) Before fit_transfrom: imgur.com/NdOlwQL (2) Code: imgur.com/zuhGpJe (3) After fit_transform: imgur.com/YfFzIQr
$endgroup$
– Apollo
Mar 24 at 11:50
$begingroup$
I did the exact same thing. However, all tfidf values for each ad-topic line turn out to be the same. Please have a look at these: (1) Before fit_transfrom: imgur.com/NdOlwQL (2) Code: imgur.com/zuhGpJe (3) After fit_transform: imgur.com/YfFzIQr
$endgroup$
– Apollo
Mar 24 at 11:50
add a comment |
$begingroup$
From the output you have shared it can be understood that you have around 1000 rows of data and 300+ features/columns/words which the tfidf function created based on your selection of ngrams parameter.
Now, in the brackets (x,y) signifies:
X as the number of row of your data
And
Y as the nth feature in the features list.
Assuming you have written something similar to get the above output-
tfidf_matrix = tf.fit_transform(data)
this will give you a list of feature names
feature_names = tf.get_feature_names()
And now you can check any Y value in the brackets, for example lets take the first value from your output- (0,53)
feature_names[53]
This will give the name of the feature(which is basically a word or a combination of words) and right side value is the the tfidf score of that feature in the 0th row of your data.
$endgroup$
$begingroup$
Thank you for the response. Could you please tell me why every ad-topic line gives the same output? Shouldn't the 0th row give me (0,53), (0,4) and (0,228) with their respective tfidf values? Instead, for every row, I get the same tfidf value list for all the rows(the image in the question)
$endgroup$
– Apollo
Mar 24 at 8:25
$begingroup$
for (0,53) its .61, (0,4) its .53 and (0,228) its .58
$endgroup$
– Cini09
Mar 25 at 18:59
add a comment |
$begingroup$
From the output you have shared it can be understood that you have around 1000 rows of data and 300+ features/columns/words which the tfidf function created based on your selection of ngrams parameter.
Now, in the brackets (x,y) signifies:
X as the number of row of your data
And
Y as the nth feature in the features list.
Assuming you have written something similar to get the above output-
tfidf_matrix = tf.fit_transform(data)
this will give you a list of feature names
feature_names = tf.get_feature_names()
And now you can check any Y value in the brackets, for example lets take the first value from your output- (0,53)
feature_names[53]
This will give the name of the feature(which is basically a word or a combination of words) and right side value is the the tfidf score of that feature in the 0th row of your data.
$endgroup$
$begingroup$
Thank you for the response. Could you please tell me why every ad-topic line gives the same output? Shouldn't the 0th row give me (0,53), (0,4) and (0,228) with their respective tfidf values? Instead, for every row, I get the same tfidf value list for all the rows(the image in the question)
$endgroup$
– Apollo
Mar 24 at 8:25
$begingroup$
for (0,53) its .61, (0,4) its .53 and (0,228) its .58
$endgroup$
– Cini09
Mar 25 at 18:59
add a comment |
$begingroup$
From the output you have shared it can be understood that you have around 1000 rows of data and 300+ features/columns/words which the tfidf function created based on your selection of ngrams parameter.
Now, in the brackets (x,y) signifies:
X as the number of row of your data
And
Y as the nth feature in the features list.
Assuming you have written something similar to get the above output-
tfidf_matrix = tf.fit_transform(data)
this will give you a list of feature names
feature_names = tf.get_feature_names()
And now you can check any Y value in the brackets, for example lets take the first value from your output- (0,53)
feature_names[53]
This will give the name of the feature(which is basically a word or a combination of words) and right side value is the the tfidf score of that feature in the 0th row of your data.
$endgroup$
From the output you have shared it can be understood that you have around 1000 rows of data and 300+ features/columns/words which the tfidf function created based on your selection of ngrams parameter.
Now, in the brackets (x,y) signifies:
X as the number of row of your data
And
Y as the nth feature in the features list.
Assuming you have written something similar to get the above output-
tfidf_matrix = tf.fit_transform(data)
this will give you a list of feature names
feature_names = tf.get_feature_names()
And now you can check any Y value in the brackets, for example lets take the first value from your output- (0,53)
feature_names[53]
This will give the name of the feature(which is basically a word or a combination of words) and right side value is the the tfidf score of that feature in the 0th row of your data.
edited Mar 25 at 18:57
answered Mar 24 at 5:58
Cini09Cini09
166
166
$begingroup$
Thank you for the response. Could you please tell me why every ad-topic line gives the same output? Shouldn't the 0th row give me (0,53), (0,4) and (0,228) with their respective tfidf values? Instead, for every row, I get the same tfidf value list for all the rows(the image in the question)
$endgroup$
– Apollo
Mar 24 at 8:25
$begingroup$
for (0,53) its .61, (0,4) its .53 and (0,228) its .58
$endgroup$
– Cini09
Mar 25 at 18:59
add a comment |
$begingroup$
Thank you for the response. Could you please tell me why every ad-topic line gives the same output? Shouldn't the 0th row give me (0,53), (0,4) and (0,228) with their respective tfidf values? Instead, for every row, I get the same tfidf value list for all the rows(the image in the question)
$endgroup$
– Apollo
Mar 24 at 8:25
$begingroup$
for (0,53) its .61, (0,4) its .53 and (0,228) its .58
$endgroup$
– Cini09
Mar 25 at 18:59
$begingroup$
Thank you for the response. Could you please tell me why every ad-topic line gives the same output? Shouldn't the 0th row give me (0,53), (0,4) and (0,228) with their respective tfidf values? Instead, for every row, I get the same tfidf value list for all the rows(the image in the question)
$endgroup$
– Apollo
Mar 24 at 8:25
$begingroup$
Thank you for the response. Could you please tell me why every ad-topic line gives the same output? Shouldn't the 0th row give me (0,53), (0,4) and (0,228) with their respective tfidf values? Instead, for every row, I get the same tfidf value list for all the rows(the image in the question)
$endgroup$
– Apollo
Mar 24 at 8:25
$begingroup$
for (0,53) its .61, (0,4) its .53 and (0,228) its .58
$endgroup$
– Cini09
Mar 25 at 18:59
$begingroup$
for (0,53) its .61, (0,4) its .53 and (0,228) its .58
$endgroup$
– Cini09
Mar 25 at 18:59
add a comment |
Apollo is a new contributor. Be nice, and check out our Code of Conduct.
Apollo is a new contributor. Be nice, and check out our Code of Conduct.
Apollo is a new contributor. Be nice, and check out our Code of Conduct.
Apollo is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47862%2fhow-do-i-use-tfidf-scores-for-my-machine-learning-model%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown