What are the ways to partition a large file that does not fit into memory so it can later be fed as training data?2019 Community Moderator ElectionUse liblinear on big data for semantic analysisError::Type of predictors in new data do not match that of the training dataDoes a big data virtual machine machine help in analyzing large file?How to deal with large training data?Working with large datasets pythonCan R + Hadoop overcome R's memory constraints in any case?Plot RDD data using a pyspark dataframe from csv fileWhat is the best statistical measure tool to measure how close data is to fitted regression line if outliers are not fittedHow to extract errors from log file in scala?How to deal with memory insufficient read by pandas in python

Why was the shrink from 8″ made only to 5.25″ and not smaller (4″ or less)

Is it possible to static_assert that a lambda is not generic?

Why is the sentence "Das ist eine Nase" correct?

How seriously should I take size and weight limits of hand luggage?

Where would I need my direct neural interface to be implanted?

Can a virus destroy the BIOS of a modern computer?

Amending the P2P Layer

Two tailed t test for two companies' monthly profits

Placement of More Information/Help Icon button for Radio Buttons

Is it possible for a PC to dismember a humanoid?

Different meanings of こわい

Do Iron Man suits sport waste management systems?

How to add frame around section using titlesec?

What is the fastest integer factorization to break RSA?

Using "tail" to follow a file without displaying the most recent lines

files created then deleted at every second in tmp directory

Is there a hemisphere-neutral way of specifying a season?

Does the Idaho Potato Commission associate potato skins with healthy eating?

How much mains leakage does an Ethernet connection to a PC induce, and what is the operating leakage path?

Processor speed limited at 0.4 GHz

How can a day be exactly 24 hours long?

Theorists sure want true answers to this!

How dangerous is XSS

What is required to make GPS signals available indoors?

What are the ways to partition a large file that does not fit into memory so it can later be fed as training data?

2019 Community Moderator ElectionUse liblinear on big data for semantic analysisError::Type of predictors in new data do not match that of the training dataDoes a big data virtual machine machine help in analyzing large file?How to deal with large training data?Working with large datasets pythonCan R + Hadoop overcome R's memory constraints in any case?Plot RDD data using a pyspark dataframe from csv fileWhat is the best statistical measure tool to measure how close data is to fitted regression line if outliers are not fittedHow to extract errors from log file in scala?How to deal with memory insufficient read by pandas in python

Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?

asked Mar 26 at 21:02

edunlimit

203

$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55

$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06

add a comment |

Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?

asked Mar 26 at 21:02

edunlimit

203

$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55

$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06

add a comment |

Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?

asked Mar 26 at 21:02

edunlimit

203

Is there any other way to partition a large file that does not fit into memory so it can be fed as training data other than using spark? or hadoop?

machine-learning bigdata

asked Mar 26 at 21:02

edunlimit

203

asked Mar 26 at 21:02

edunlimit

203

asked Mar 26 at 21:02

edunlimit

203

asked Mar 26 at 21:02

edunlimit

203

asked Mar 26 at 21:02

edunlimit

203

$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55

$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06

add a comment |

$begingroup$
what is the size of your data? what is the size of memory of your computer?
$endgroup$
– honar.cs
Mar 27 at 6:55

$begingroup$
@honar.cs Oh I'm not using actual data. I was just curious
$endgroup$
– edunlimit
Mar 27 at 23:06

what is the size of your data? what is the size of memory of your computer?

– honar.cs
Mar 27 at 6:55

@honar.cs Oh I'm not using actual data. I was just curious

– edunlimit
Mar 27 at 23:06

add a comment |

1 Answer
1

active

oldest

votes

Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.

This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.

The key and most complex step is how to train classify with those data. Good luck, For Gradient descent series optimization algorithms (GB, SGD and so on), most algorithms (SVM, GBDT, Bayes, LR, deeplearn and so on) support this. You could
load one file to RAM and fed them to classifier until to find the best parameter.

My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.

import numpy as np

X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]

from sklearn.linear_model import LogisticRegression

lr_cly = LogisticRegression()

def stop_train(X_s, y_s, threshold):
 scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
 return np.mean(scores) > threshold

def iter_train(cly, X, y, threshold=0.99, max_iter=10):
 X_s = [X[:50, :], X[50:, :]]
 y_s = [y[:50], y[50:]]

 iter_times = 0
 while iter_times <= max_iter:
 print "--------------"
 for X, y in zip(X_s, y_s):
 cly.fit(X, y)
 print cly.score(X, y)
 if stop_train(X_s, y_s, threshold):
 break
 iter_times += 1

iter_train(lr_cly, X, y)

edited Mar 27 at 14:53

Glorfindel

1511210

answered Mar 27 at 11:12

Happy Boy

1162

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48054%2fwhat-are-the-ways-to-partition-a-large-file-that-does-not-fit-into-memory-so-it%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.

This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.

My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.

import numpy as np

X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]

from sklearn.linear_model import LogisticRegression

lr_cly = LogisticRegression()

def stop_train(X_s, y_s, threshold):
 scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
 return np.mean(scores) > threshold

def iter_train(cly, X, y, threshold=0.99, max_iter=10):
 X_s = [X[:50, :], X[50:, :]]
 y_s = [y[:50], y[50:]]

 iter_times = 0
 while iter_times <= max_iter:
 print "--------------"
 for X, y in zip(X_s, y_s):
 cly.fit(X, y)
 print cly.score(X, y)
 if stop_train(X_s, y_s, threshold):
 break
 iter_times += 1

iter_train(lr_cly, X, y)

edited Mar 27 at 14:53

Glorfindel

1511210

answered Mar 27 at 11:12

Happy Boy

1162

add a comment |

Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.

This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.

My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.

import numpy as np

X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]

from sklearn.linear_model import LogisticRegression

lr_cly = LogisticRegression()

def stop_train(X_s, y_s, threshold):
 scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
 return np.mean(scores) > threshold

def iter_train(cly, X, y, threshold=0.99, max_iter=10):
 X_s = [X[:50, :], X[50:, :]]
 y_s = [y[:50], y[50:]]

 iter_times = 0
 while iter_times <= max_iter:
 print "--------------"
 for X, y in zip(X_s, y_s):
 cly.fit(X, y)
 print cly.score(X, y)
 if stop_train(X_s, y_s, threshold):
 break
 iter_times += 1

iter_train(lr_cly, X, y)

edited Mar 27 at 14:53

Glorfindel

1511210

answered Mar 27 at 11:12

Happy Boy

1162

add a comment |

Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.

This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.

My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.

import numpy as np

X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]

from sklearn.linear_model import LogisticRegression

lr_cly = LogisticRegression()

def stop_train(X_s, y_s, threshold):
 scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
 return np.mean(scores) > threshold

def iter_train(cly, X, y, threshold=0.99, max_iter=10):
 X_s = [X[:50, :], X[50:, :]]
 y_s = [y[:50], y[50:]]

 iter_times = 0
 while iter_times <= max_iter:
 print "--------------"
 for X, y in zip(X_s, y_s):
 cly.fit(X, y)
 print cly.score(X, y)
 if stop_train(X_s, y_s, threshold):
 break
 iter_times += 1

iter_train(lr_cly, X, y)

edited Mar 27 at 14:53

Glorfindel

1511210

answered Mar 27 at 11:12

Happy Boy

1162

Yes, of cause. But, it's insignificant, because Spark and Hadoop are better.

This is my idea. Suppose that your memory can take in 100,000 examples. So splitting your data set to files with size lower than 100,000.

My code is very simple. Before each iteration, re-shuffling the order of simples and re-splitting data set will boost the classifier.

import numpy as np

X = np.random.random((100, 2))
y = [1 if x[0] > x[1] else 0 for x in X]

from sklearn.linear_model import LogisticRegression

lr_cly = LogisticRegression()

def stop_train(X_s, y_s, threshold):
 scores = [gnb.score(X, y) for X, y in zip(X_s, y_s)]
 return np.mean(scores) > threshold

def iter_train(cly, X, y, threshold=0.99, max_iter=10):
 X_s = [X[:50, :], X[50:, :]]
 y_s = [y[:50], y[50:]]

 iter_times = 0
 while iter_times <= max_iter:
 print "--------------"
 for X, y in zip(X_s, y_s):
 cly.fit(X, y)
 print cly.score(X, y)
 if stop_train(X_s, y_s, threshold):
 break
 iter_times += 1

iter_train(lr_cly, X, y)

edited Mar 27 at 14:53

Glorfindel

1511210

answered Mar 27 at 11:12

Happy Boy

1162

edited Mar 27 at 14:53

Glorfindel

1511210

edited Mar 27 at 14:53

Glorfindel

1511210

edited Mar 27 at 14:53

Glorfindel

1511210

answered Mar 27 at 11:12

Happy Boy

1162

answered Mar 27 at 11:12

Happy Boy

1162

answered Mar 27 at 11:12

Happy Boy

1162

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

mNQablu,oQtt0sj M0QwpLl0Wvgh 4 UuL,Nmm92MIO,DUa3yHx4E,fQaswiGD,JUpgdX,N26I6AYsXsjQWUi0hxkR,l2TCoz5DoO

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1