Replacing matching entries in one column of a file by another column from a different file Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Community Moderator Election Results Why I closed the “Why is Kali so hard” questionMerge two files: two lines, partial line, two lines, partial line, etcFind common elements in a given column from two files and output the column values from each filecompare multiple files(more than two) with two different columnsReplace column in one file with column from another using awk?Joining columns from files if they contain a match in another columnMerging two files, one column at a timeColumn matching in separate filesExtract row if both column values appear in a single column from a separate fileJoining entries based off of column using awk/joinCompare two files by first column. Keep rows if matchingRecursively find and replace contents of one file using a key from another file
Assertions In A Mock Callout Test
"Destructive force" carried by a B-52?
Why isn't everyone flabbergasted about Bran's "gift"?
Converting a text document with special format to Pandas DataFrame
A German immigrant ancestor has a "Registration Affidavit of Alien Enemy" on file. What does that mean exactly?
enable https on private network
Pointing to problems without suggesting solutions
How to ask rejected full-time candidates to apply to teach individual courses?
How to produce a PS1 prompt in bash or ksh93 similar to tcsh
Is it OK if I do not take the receipt in Germany?
FME Console for testing
Fourier Transform of Airy Equation
Output the slug and name of a CPT single post taxonomy term
“Since the train was delayed for more than an hour, passengers were given a full refund.” – Why is there no article before “passengers”?
How was Lagrange appointed professor of mathematics so early?
How to calculate density of unknown planet?
/bin/ls sorts differently than just ls
What came first? Venom as the movie or as the song?
Why is one lightbulb in a string illuminated?
If gravity precedes the formation of a solar system, where did the mass come from that caused the gravity?
tabularx column has extra padding at right?
Why these surprising proportionalities of integrals involving odd zeta values?
Recursive calls to a function - why is the address of the parameter passed to it lowering with each call?
Protagonist's race is hidden - should I reveal it?
Replacing matching entries in one column of a file by another column from a different file
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Community Moderator Election Results
Why I closed the “Why is Kali so hard” questionMerge two files: two lines, partial line, two lines, partial line, etcFind common elements in a given column from two files and output the column values from each filecompare multiple files(more than two) with two different columnsReplace column in one file with column from another using awk?Joining columns from files if they contain a match in another columnMerging two files, one column at a timeColumn matching in separate filesExtract row if both column values appear in a single column from a separate fileJoining entries based off of column using awk/joinCompare two files by first column. Keep rows if matchingRecursively find and replace contents of one file using a key from another file
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have two tab-separated files which look as follows:
file1:
NC_008146.1 WP_011558474.1 1155234 1156286 44173
NC_008146.1 WP_011558475.1 1156298 1156807 12
NC_008146.1 WP_011558476.1 1156804 1157820 -3
NC_008705.1 WP_011558474.1 1159543 1160595 42748
NC_008705.1 WP_011558475.1 1160607 1161116 12
NC_008705.1 WP_011558476.1 1161113 1162129 -3
NC_009077.1 WP_011559727.1 2481079 2481633 8
NC_009077.1 WP_011854835.1 1163068 1164120 42559
NC_009077.1 WP_011854836.1 1164127 1164636 7
file2:
NC_008146.1 GCF_000014165.1_ASM1416v1_protein.faa
NC_008705.1 GCF_000015405.1_ASM1540v1_protein.faa
NC_009077.1 GCF_000016005.1_ASM1600v1_protein.faa
I want to match column 1 of file1 to file2 and replace itself with the respective column 2 entry of file 2.
The output would look like this:
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
awk
add a comment |
I have two tab-separated files which look as follows:
file1:
NC_008146.1 WP_011558474.1 1155234 1156286 44173
NC_008146.1 WP_011558475.1 1156298 1156807 12
NC_008146.1 WP_011558476.1 1156804 1157820 -3
NC_008705.1 WP_011558474.1 1159543 1160595 42748
NC_008705.1 WP_011558475.1 1160607 1161116 12
NC_008705.1 WP_011558476.1 1161113 1162129 -3
NC_009077.1 WP_011559727.1 2481079 2481633 8
NC_009077.1 WP_011854835.1 1163068 1164120 42559
NC_009077.1 WP_011854836.1 1164127 1164636 7
file2:
NC_008146.1 GCF_000014165.1_ASM1416v1_protein.faa
NC_008705.1 GCF_000015405.1_ASM1540v1_protein.faa
NC_009077.1 GCF_000016005.1_ASM1600v1_protein.faa
I want to match column 1 of file1 to file2 and replace itself with the respective column 2 entry of file 2.
The output would look like this:
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
awk
It looks like you might also be interested in our sister site: Bioinformatics.
– terdon♦
Apr 5 at 12:54
Thank you for the link @terdon!
– BhushanDhamale
Apr 5 at 12:57
add a comment |
I have two tab-separated files which look as follows:
file1:
NC_008146.1 WP_011558474.1 1155234 1156286 44173
NC_008146.1 WP_011558475.1 1156298 1156807 12
NC_008146.1 WP_011558476.1 1156804 1157820 -3
NC_008705.1 WP_011558474.1 1159543 1160595 42748
NC_008705.1 WP_011558475.1 1160607 1161116 12
NC_008705.1 WP_011558476.1 1161113 1162129 -3
NC_009077.1 WP_011559727.1 2481079 2481633 8
NC_009077.1 WP_011854835.1 1163068 1164120 42559
NC_009077.1 WP_011854836.1 1164127 1164636 7
file2:
NC_008146.1 GCF_000014165.1_ASM1416v1_protein.faa
NC_008705.1 GCF_000015405.1_ASM1540v1_protein.faa
NC_009077.1 GCF_000016005.1_ASM1600v1_protein.faa
I want to match column 1 of file1 to file2 and replace itself with the respective column 2 entry of file 2.
The output would look like this:
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
awk
I have two tab-separated files which look as follows:
file1:
NC_008146.1 WP_011558474.1 1155234 1156286 44173
NC_008146.1 WP_011558475.1 1156298 1156807 12
NC_008146.1 WP_011558476.1 1156804 1157820 -3
NC_008705.1 WP_011558474.1 1159543 1160595 42748
NC_008705.1 WP_011558475.1 1160607 1161116 12
NC_008705.1 WP_011558476.1 1161113 1162129 -3
NC_009077.1 WP_011559727.1 2481079 2481633 8
NC_009077.1 WP_011854835.1 1163068 1164120 42559
NC_009077.1 WP_011854836.1 1164127 1164636 7
file2:
NC_008146.1 GCF_000014165.1_ASM1416v1_protein.faa
NC_008705.1 GCF_000015405.1_ASM1540v1_protein.faa
NC_009077.1 GCF_000016005.1_ASM1600v1_protein.faa
I want to match column 1 of file1 to file2 and replace itself with the respective column 2 entry of file 2.
The output would look like this:
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
awk
awk
edited Apr 5 at 12:56
Rui F Ribeiro
42.2k1484143
42.2k1484143
asked Apr 5 at 12:32
BhushanDhamaleBhushanDhamale
1785
1785
It looks like you might also be interested in our sister site: Bioinformatics.
– terdon♦
Apr 5 at 12:54
Thank you for the link @terdon!
– BhushanDhamale
Apr 5 at 12:57
add a comment |
It looks like you might also be interested in our sister site: Bioinformatics.
– terdon♦
Apr 5 at 12:54
Thank you for the link @terdon!
– BhushanDhamale
Apr 5 at 12:57
It looks like you might also be interested in our sister site: Bioinformatics.
– terdon♦
Apr 5 at 12:54
It looks like you might also be interested in our sister site: Bioinformatics.
– terdon♦
Apr 5 at 12:54
Thank you for the link @terdon!
– BhushanDhamale
Apr 5 at 12:57
Thank you for the link @terdon!
– BhushanDhamale
Apr 5 at 12:57
add a comment |
3 Answers
3
active
oldest
votes
You can do this very easily with awk
:
$ awk 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
Or, since that looks like a tab-separated file:
$ awk -vOFS="t" 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
This assumes that every RefSeq (NC_*
) id in file1
has a corresponding entry in file2
.
Explanation
NR==FNR
: NR is the current line number, FNR is the line number of the current file. The two will be identical only while the 1st file (here,file2
) is being read.a[$1]=$2; next
: if this is the first file (see above), save the 2nd field in an array whose key is the 1st field. Then, move on to thenext
line. This ensures the next block isn't executed for the 1st file.$1=a[$1]; print
: now, in the second file, set the 1st field to whatever value was saved in the arraya
for the 1st field (so, the associated value fromfile2
) and print the resulting line.
1
NR == FNR
doesn't work correctly when the first file is empty. See this and the associated answer for a workaround
– iruvar
Apr 5 at 12:44
2
@iruvar nothing will work well if the first file is empty, so I don't really see why that's relevant. The entire point here is to combine the data from the two files. If either file is empty, the whole exercise is pointless.
– terdon♦
Apr 5 at 12:45
sorry I should have said in this particular casefile2
and notfile1
is empty. Sane behaviour whenfile2
is empty is to report the contents offile1
. The problem withNR == FNR
is that code associated with it executes on the contents offile1
whenfile2
is empty
– iruvar
Apr 5 at 12:51
2
@iruvar there is no sane behavior here if either file is empty. That's what I'm saying :) So trying to make it deal with that case gracefully is pointless. And, in any case, when either file is empty here, nothing is printed. Which actually seems like the sanest approach, I'd rather get no data than wrong data.
– terdon♦
Apr 5 at 12:54
add a comment |
No need for awk, assuming the files are sorted, you can use coreutils join:
join -o '2.2 1.2 1.3 1.4 1.5' file1 file2
Output:
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
If your files aren't, sorted, you can either sort them first (sort file1 > file1.sorted; sort file2 > file2.sorted
) and then use the command above, or, if your shell supports the <()
construct (bash does), you can do:
join -o '2.2 1.2 1.3 1.4 1.5' <(sort file1) <(sort file2)
add a comment |
Tested with below command and worked fine
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
output
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f510709%2freplacing-matching-entries-in-one-column-of-a-file-by-another-column-from-a-diff%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can do this very easily with awk
:
$ awk 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
Or, since that looks like a tab-separated file:
$ awk -vOFS="t" 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
This assumes that every RefSeq (NC_*
) id in file1
has a corresponding entry in file2
.
Explanation
NR==FNR
: NR is the current line number, FNR is the line number of the current file. The two will be identical only while the 1st file (here,file2
) is being read.a[$1]=$2; next
: if this is the first file (see above), save the 2nd field in an array whose key is the 1st field. Then, move on to thenext
line. This ensures the next block isn't executed for the 1st file.$1=a[$1]; print
: now, in the second file, set the 1st field to whatever value was saved in the arraya
for the 1st field (so, the associated value fromfile2
) and print the resulting line.
1
NR == FNR
doesn't work correctly when the first file is empty. See this and the associated answer for a workaround
– iruvar
Apr 5 at 12:44
2
@iruvar nothing will work well if the first file is empty, so I don't really see why that's relevant. The entire point here is to combine the data from the two files. If either file is empty, the whole exercise is pointless.
– terdon♦
Apr 5 at 12:45
sorry I should have said in this particular casefile2
and notfile1
is empty. Sane behaviour whenfile2
is empty is to report the contents offile1
. The problem withNR == FNR
is that code associated with it executes on the contents offile1
whenfile2
is empty
– iruvar
Apr 5 at 12:51
2
@iruvar there is no sane behavior here if either file is empty. That's what I'm saying :) So trying to make it deal with that case gracefully is pointless. And, in any case, when either file is empty here, nothing is printed. Which actually seems like the sanest approach, I'd rather get no data than wrong data.
– terdon♦
Apr 5 at 12:54
add a comment |
You can do this very easily with awk
:
$ awk 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
Or, since that looks like a tab-separated file:
$ awk -vOFS="t" 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
This assumes that every RefSeq (NC_*
) id in file1
has a corresponding entry in file2
.
Explanation
NR==FNR
: NR is the current line number, FNR is the line number of the current file. The two will be identical only while the 1st file (here,file2
) is being read.a[$1]=$2; next
: if this is the first file (see above), save the 2nd field in an array whose key is the 1st field. Then, move on to thenext
line. This ensures the next block isn't executed for the 1st file.$1=a[$1]; print
: now, in the second file, set the 1st field to whatever value was saved in the arraya
for the 1st field (so, the associated value fromfile2
) and print the resulting line.
1
NR == FNR
doesn't work correctly when the first file is empty. See this and the associated answer for a workaround
– iruvar
Apr 5 at 12:44
2
@iruvar nothing will work well if the first file is empty, so I don't really see why that's relevant. The entire point here is to combine the data from the two files. If either file is empty, the whole exercise is pointless.
– terdon♦
Apr 5 at 12:45
sorry I should have said in this particular casefile2
and notfile1
is empty. Sane behaviour whenfile2
is empty is to report the contents offile1
. The problem withNR == FNR
is that code associated with it executes on the contents offile1
whenfile2
is empty
– iruvar
Apr 5 at 12:51
2
@iruvar there is no sane behavior here if either file is empty. That's what I'm saying :) So trying to make it deal with that case gracefully is pointless. And, in any case, when either file is empty here, nothing is printed. Which actually seems like the sanest approach, I'd rather get no data than wrong data.
– terdon♦
Apr 5 at 12:54
add a comment |
You can do this very easily with awk
:
$ awk 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
Or, since that looks like a tab-separated file:
$ awk -vOFS="t" 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
This assumes that every RefSeq (NC_*
) id in file1
has a corresponding entry in file2
.
Explanation
NR==FNR
: NR is the current line number, FNR is the line number of the current file. The two will be identical only while the 1st file (here,file2
) is being read.a[$1]=$2; next
: if this is the first file (see above), save the 2nd field in an array whose key is the 1st field. Then, move on to thenext
line. This ensures the next block isn't executed for the 1st file.$1=a[$1]; print
: now, in the second file, set the 1st field to whatever value was saved in the arraya
for the 1st field (so, the associated value fromfile2
) and print the resulting line.
You can do this very easily with awk
:
$ awk 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
Or, since that looks like a tab-separated file:
$ awk -vOFS="t" 'NR==FNRa[$1]=$2; next$1=a[$1]; print' file2 file1
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
This assumes that every RefSeq (NC_*
) id in file1
has a corresponding entry in file2
.
Explanation
NR==FNR
: NR is the current line number, FNR is the line number of the current file. The two will be identical only while the 1st file (here,file2
) is being read.a[$1]=$2; next
: if this is the first file (see above), save the 2nd field in an array whose key is the 1st field. Then, move on to thenext
line. This ensures the next block isn't executed for the 1st file.$1=a[$1]; print
: now, in the second file, set the 1st field to whatever value was saved in the arraya
for the 1st field (so, the associated value fromfile2
) and print the resulting line.
edited Apr 5 at 12:50
answered Apr 5 at 12:38
terdon♦terdon
134k33271450
134k33271450
1
NR == FNR
doesn't work correctly when the first file is empty. See this and the associated answer for a workaround
– iruvar
Apr 5 at 12:44
2
@iruvar nothing will work well if the first file is empty, so I don't really see why that's relevant. The entire point here is to combine the data from the two files. If either file is empty, the whole exercise is pointless.
– terdon♦
Apr 5 at 12:45
sorry I should have said in this particular casefile2
and notfile1
is empty. Sane behaviour whenfile2
is empty is to report the contents offile1
. The problem withNR == FNR
is that code associated with it executes on the contents offile1
whenfile2
is empty
– iruvar
Apr 5 at 12:51
2
@iruvar there is no sane behavior here if either file is empty. That's what I'm saying :) So trying to make it deal with that case gracefully is pointless. And, in any case, when either file is empty here, nothing is printed. Which actually seems like the sanest approach, I'd rather get no data than wrong data.
– terdon♦
Apr 5 at 12:54
add a comment |
1
NR == FNR
doesn't work correctly when the first file is empty. See this and the associated answer for a workaround
– iruvar
Apr 5 at 12:44
2
@iruvar nothing will work well if the first file is empty, so I don't really see why that's relevant. The entire point here is to combine the data from the two files. If either file is empty, the whole exercise is pointless.
– terdon♦
Apr 5 at 12:45
sorry I should have said in this particular casefile2
and notfile1
is empty. Sane behaviour whenfile2
is empty is to report the contents offile1
. The problem withNR == FNR
is that code associated with it executes on the contents offile1
whenfile2
is empty
– iruvar
Apr 5 at 12:51
2
@iruvar there is no sane behavior here if either file is empty. That's what I'm saying :) So trying to make it deal with that case gracefully is pointless. And, in any case, when either file is empty here, nothing is printed. Which actually seems like the sanest approach, I'd rather get no data than wrong data.
– terdon♦
Apr 5 at 12:54
1
1
NR == FNR
doesn't work correctly when the first file is empty. See this and the associated answer for a workaround– iruvar
Apr 5 at 12:44
NR == FNR
doesn't work correctly when the first file is empty. See this and the associated answer for a workaround– iruvar
Apr 5 at 12:44
2
2
@iruvar nothing will work well if the first file is empty, so I don't really see why that's relevant. The entire point here is to combine the data from the two files. If either file is empty, the whole exercise is pointless.
– terdon♦
Apr 5 at 12:45
@iruvar nothing will work well if the first file is empty, so I don't really see why that's relevant. The entire point here is to combine the data from the two files. If either file is empty, the whole exercise is pointless.
– terdon♦
Apr 5 at 12:45
sorry I should have said in this particular case
file2
and not file1
is empty. Sane behaviour when file2
is empty is to report the contents of file1
. The problem with NR == FNR
is that code associated with it executes on the contents of file1
when file2
is empty– iruvar
Apr 5 at 12:51
sorry I should have said in this particular case
file2
and not file1
is empty. Sane behaviour when file2
is empty is to report the contents of file1
. The problem with NR == FNR
is that code associated with it executes on the contents of file1
when file2
is empty– iruvar
Apr 5 at 12:51
2
2
@iruvar there is no sane behavior here if either file is empty. That's what I'm saying :) So trying to make it deal with that case gracefully is pointless. And, in any case, when either file is empty here, nothing is printed. Which actually seems like the sanest approach, I'd rather get no data than wrong data.
– terdon♦
Apr 5 at 12:54
@iruvar there is no sane behavior here if either file is empty. That's what I'm saying :) So trying to make it deal with that case gracefully is pointless. And, in any case, when either file is empty here, nothing is printed. Which actually seems like the sanest approach, I'd rather get no data than wrong data.
– terdon♦
Apr 5 at 12:54
add a comment |
No need for awk, assuming the files are sorted, you can use coreutils join:
join -o '2.2 1.2 1.3 1.4 1.5' file1 file2
Output:
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
If your files aren't, sorted, you can either sort them first (sort file1 > file1.sorted; sort file2 > file2.sorted
) and then use the command above, or, if your shell supports the <()
construct (bash does), you can do:
join -o '2.2 1.2 1.3 1.4 1.5' <(sort file1) <(sort file2)
add a comment |
No need for awk, assuming the files are sorted, you can use coreutils join:
join -o '2.2 1.2 1.3 1.4 1.5' file1 file2
Output:
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
If your files aren't, sorted, you can either sort them first (sort file1 > file1.sorted; sort file2 > file2.sorted
) and then use the command above, or, if your shell supports the <()
construct (bash does), you can do:
join -o '2.2 1.2 1.3 1.4 1.5' <(sort file1) <(sort file2)
add a comment |
No need for awk, assuming the files are sorted, you can use coreutils join:
join -o '2.2 1.2 1.3 1.4 1.5' file1 file2
Output:
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
If your files aren't, sorted, you can either sort them first (sort file1 > file1.sorted; sort file2 > file2.sorted
) and then use the command above, or, if your shell supports the <()
construct (bash does), you can do:
join -o '2.2 1.2 1.3 1.4 1.5' <(sort file1) <(sort file2)
No need for awk, assuming the files are sorted, you can use coreutils join:
join -o '2.2 1.2 1.3 1.4 1.5' file1 file2
Output:
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
If your files aren't, sorted, you can either sort them first (sort file1 > file1.sorted; sort file2 > file2.sorted
) and then use the command above, or, if your shell supports the <()
construct (bash does), you can do:
join -o '2.2 1.2 1.3 1.4 1.5' <(sort file1) <(sort file2)
edited Apr 5 at 13:00
terdon♦
134k33271450
134k33271450
answered Apr 5 at 12:39
ThorThor
12.2k13962
12.2k13962
add a comment |
add a comment |
Tested with below command and worked fine
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
output
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
add a comment |
Tested with below command and worked fine
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
output
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
add a comment |
Tested with below command and worked fine
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
output
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
Tested with below command and worked fine
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
output
for i in `awk 'print $1' f2`; do k=`awk -v i="$i" '$1==i print $2' f2`;sed "/$i/s/$i/$k/g" f1 >f3;done
GCF_000014165.1_ASM1416v1_protein.faa WP_011558474.1 1155234 1156286 44173
GCF_000014165.1_ASM1416v1_protein.faa WP_011558475.1 1156298 1156807 12
GCF_000014165.1_ASM1416v1_protein.faa WP_011558476.1 1156804 1157820 -3
GCF_000015405.1_ASM1540v1_protein.faa WP_011558474.1 1159543 1160595 42748
GCF_000015405.1_ASM1540v1_protein.faa WP_011558475.1 1160607 1161116 12
GCF_000015405.1_ASM1540v1_protein.faa WP_011558476.1 1161113 1162129 -3
GCF_000016005.1_ASM1600v1_protein.faa WP_011559727.1 2481079 2481633 8
GCF_000016005.1_ASM1600v1_protein.faa WP_011854835.1 1163068 1164120 42559
GCF_000016005.1_ASM1600v1_protein.faa WP_011854836.1 1164127 1164636 7
answered Apr 7 at 13:47
Praveen Kumar BSPraveen Kumar BS
1,7691311
1,7691311
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f510709%2freplacing-matching-entries-in-one-column-of-a-file-by-another-column-from-a-diff%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
It looks like you might also be interested in our sister site: Bioinformatics.
– terdon♦
Apr 5 at 12:54
Thank you for the link @terdon!
– BhushanDhamale
Apr 5 at 12:57