What are the ways to partition a large file that does not fit into memory so it can later be fed as training data?2019 Community Moderator ElectionUse liblinear on big data for semantic analysisError::Type of predictors in new data do not match that of the training dataDoes a big data virtual machine machine help in analyzing large file?How to deal with large training data?Working with large datasets pythonCan R + Hadoop overcome R's memory constraints in any case?Plot RDD data using a pyspark dataframe from csv fileWhat is the best statistical measure tool to measure how close data is to fitted regression line if outliers are not fittedHow to extract errors from log file in scala?How to deal with memory insufficient read by pandas in python
Why was the shrink from 8″ made only to 5.25″ and not smaller (4″ or less)
Is it possible to static_assert that a lambda is not generic?
Why is the sentence "Das ist eine Nase" correct?
How seriously should I take size and weight limits of hand luggage?
Where would I need my direct neural interface to be implanted?
Can a virus destroy the BIOS of a modern computer?
Amending the P2P Layer
Two tailed t test for two companies' monthly profits
Placement of More Information/Help Icon button for Radio Buttons
Is it possible for a PC to dismember a humanoid?
Different meanings of こわい
Do Iron Man suits sport waste management systems?
How to add frame around section using titlesec?
What is the fastest integer factorization to break RSA?
Using "tail" to follow a file without displaying the most recent lines
files created then deleted at every second in tmp directory
Is there a hemisphere-neutral way of specifying a season?
Does the Idaho Potato Commission associate potato skins with healthy eating?
How much mains leakage does an Ethernet connection to a PC induce, and what is the operating leakage path?
Processor speed limited at 0.4 GHz
How can a day be exactly 24 hours long?
Theorists sure want true answers to this!
How dangerous is XSS
What is required to make GPS signals available indoors?
What are the ways to partition a large file that does not fit into memory so it can later be fed as training data?
2019 Community Moderator ElectionUse liblinear on big data for semantic analysisError::Type of predictors in new data do not match that of the training dataDoes a big data virtual machine machine help in analyzing large file?How to deal with large training data?Working with large datasets pythonCan R + Hadoop overcome R's memory constraints in any case?Plot RDD data using a pyspark dataframe from csv fileWhat is the best statistical measure tool to measure how close data is to fitted regression line if outliers are not fittedHow to extract errors from log file in scala?How to deal with memory insufficient read by pandas in python
$begingroup$
Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?
machine-learning bigdata
$endgroup$
add a comment |
$begingroup$
Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?
machine-learning bigdata
$endgroup$
$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55
$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06
add a comment |
$begingroup$
Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?
machine-learning bigdata
$endgroup$
Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?
machine-learning bigdata
machine-learning bigdata
asked Mar 26 at 21:02
edunlimitedunlimit
203
203
$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55
$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06
add a comment |
$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55
$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06
$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55
$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55
$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06
$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.
This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.
The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
load one file to RAM and fed them to classifier until to find the best parameter.
My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.
import numpy as np
X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]
from sklearn.linear_model import LogisticRegression
lr_cly = LogisticRegression()
def stop_train(X_s, y_s, threshold):
scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
return np.mean(scores) > threshold
def iter_train(cly, X, y, threshold=0.99, max_iter=10):
X_s = [X[:50, :], X[50:, :]]
y_s = [y[:50], y[50:]]
iter_times = 0
while iter_times <= max_iter:
print "--------------"
for X, y in zip(X_s, y_s):
cly.fit(X, y)
print cly.score(X, y)
if stop_train(X_s, y_s, threshold):
break
iter_times += 1
iter_train(lr_cly, X, y)
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48054%2fwhat-are-the-ways-to-partition-a-large-file-that-does-not-fit-into-memory-so-it%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.
This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.
The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
load one file to RAM and fed them to classifier until to find the best parameter.
My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.
import numpy as np
X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]
from sklearn.linear_model import LogisticRegression
lr_cly = LogisticRegression()
def stop_train(X_s, y_s, threshold):
scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
return np.mean(scores) > threshold
def iter_train(cly, X, y, threshold=0.99, max_iter=10):
X_s = [X[:50, :], X[50:, :]]
y_s = [y[:50], y[50:]]
iter_times = 0
while iter_times <= max_iter:
print "--------------"
for X, y in zip(X_s, y_s):
cly.fit(X, y)
print cly.score(X, y)
if stop_train(X_s, y_s, threshold):
break
iter_times += 1
iter_train(lr_cly, X, y)
$endgroup$
add a comment |
$begingroup$
Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.
This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.
The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
load one file to RAM and fed them to classifier until to find the best parameter.
My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.
import numpy as np
X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]
from sklearn.linear_model import LogisticRegression
lr_cly = LogisticRegression()
def stop_train(X_s, y_s, threshold):
scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
return np.mean(scores) > threshold
def iter_train(cly, X, y, threshold=0.99, max_iter=10):
X_s = [X[:50, :], X[50:, :]]
y_s = [y[:50], y[50:]]
iter_times = 0
while iter_times <= max_iter:
print "--------------"
for X, y in zip(X_s, y_s):
cly.fit(X, y)
print cly.score(X, y)
if stop_train(X_s, y_s, threshold):
break
iter_times += 1
iter_train(lr_cly, X, y)
$endgroup$
add a comment |
$begingroup$
Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.
This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.
The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
load one file to RAM and fed them to classifier until to find the best parameter.
My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.
import numpy as np
X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]
from sklearn.linear_model import LogisticRegression
lr_cly = LogisticRegression()
def stop_train(X_s, y_s, threshold):
scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
return np.mean(scores) > threshold
def iter_train(cly, X, y, threshold=0.99, max_iter=10):
X_s = [X[:50, :], X[50:, :]]
y_s = [y[:50], y[50:]]
iter_times = 0
while iter_times <= max_iter:
print "--------------"
for X, y in zip(X_s, y_s):
cly.fit(X, y)
print cly.score(X, y)
if stop_train(X_s, y_s, threshold):
break
iter_times += 1
iter_train(lr_cly, X, y)
$endgroup$
Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.
This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.
The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
load one file to RAM and fed them to classifier until to find the best parameter.
My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.
import numpy as np
X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]
from sklearn.linear_model import LogisticRegression
lr_cly = LogisticRegression()
def stop_train(X_s, y_s, threshold):
scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
return np.mean(scores) > threshold
def iter_train(cly, X, y, threshold=0.99, max_iter=10):
X_s = [X[:50, :], X[50:, :]]
y_s = [y[:50], y[50:]]
iter_times = 0
while iter_times <= max_iter:
print "--------------"
for X, y in zip(X_s, y_s):
cly.fit(X, y)
print cly.score(X, y)
if stop_train(X_s, y_s, threshold):
break
iter_times += 1
iter_train(lr_cly, X, y)
edited Mar 27 at 14:53
Glorfindel
1511210
1511210
answered Mar 27 at 11:12
Happy BoyHappy Boy
1162
1162
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48054%2fwhat-are-the-ways-to-partition-a-large-file-that-does-not-fit-into-memory-so-it%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55
$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06