Terminal Lesson 25
Hey guys, this is Maads 101, and today I'm going to be talking about three terminal commands: sort
, cut
, and xargs
. These commands are really great for dealing with large lists of file names or strings, so I think you'll be glad that you learned them.
What I'm going to do is show each command in isolation, give some examples of its usage, stuff like that. Then I'm going to move on to give some more in-depth examples of how you could actually use these commands together because they play really nicely together, and I want to give some examples of that.
But first, let's get started, and I'm just going to get started by showing you the sort
command. So, I have this file on my desktop here, and it contains a couple of lines of text. We could also read it through the terminal with cat
, and what I want to do is sort it alphabetically.
To do this, we can just pipe the contents of lines.txt
into the sort
command. So we just do cat lines.txt
, the vertical bar for pipe, and then sort
. This outputs the sorted file. You can see "another sentence" comes first; it starts with "A". "This is a sentence" comes last; it starts with "T", so it's later in the alphabet.
Already, this is pretty useful, you know? If we wanted to save the sorted file to the sorted lines into another file, we could just pipe it into lines_sorted.txt
. Perhaps, if I drag it over to this display, you can see this actually contains the sorted version. So already, this is pretty useful; you could see how you might apply this.
But there's a lot more that the sort
command can do, so I'm just going to go ahead and delete lines_sorted.txt
. Something else we can do is we can reverse the order of everything that’s been sorted. Maybe I wanted it in descending alphabetical order instead of ascending alphabetical order. So I can just pass the -r
flag, and that'll do it. You can see it's in the opposite order as it was before.
Another thing that sort
is actually capable of doing, what you might not expect, is removing duplicates. You can see "this is a sentence" appears twice, and maybe I want to just remove one of those instances. Well, when a list is sorted, it's really easy to remove duplicates because the duplicates are all right next to each other. So you can even visually remove them.
Sort
just has that built in; you can do sort -u
for unique, and it will remove duplicates. You can see "this is a sentence" only appears one time. We could, of course, combine that with -r
and get it in the opposite order.
So that’s something else you can do with sort
that's pretty cool, and these have just been two really simple flags. There’s one last thing that I think is worth knowing how to use the sort
command for. If we look at lines.txt
, you'll notice not only is a line completely repeated, as we saw with the -u
, but there's actually three lines that start with "candy is". These three lines right here, and when we're removing duplicates these three lines are counted as different lines because, you know, after "candy is", they're different.
But maybe in some cases when we're removing duplicates, we only want to look at the first two words, and so we would count "candy is tasty" and "candy is bad for you" as duplicates. This will actually have a practical application that I'll show you later in this video.
So, how do we get sort to just look at the first two words of the line without looking at the rest of it when removing duplicates? Well, all we have to do is our old -u
and we're just going to add -k 1,2
. What this does is it tells sort
that when it's sorting or removing duplicates, it should start by looking at the first word and read all the way up to the second word.
In this case, it's kind of silly because they're right next to each other, but you can imagine us doing something like -k 2,4
, and it would start looking at the second word and read the third and the fourth word, something like that. But in this case, we're doing 1,2
, and so sort
will just look at the first two words when removing duplicates.
So, there's only one "candy is" line now, and the other two were removed. But this only works if the words you're interested in are separated by spaces. You know, it only knows what a word is because I used spaces; "candy space is space tasty", right? But if I have a different file, and instead of spaces, there are underscores everywhere, well then this command is not going to work. You can see there are still three "candy is" lines because it thinks that this is all one word; it doesn't know that underscores are a magical character for separating words.
So, to tell it what a space is, basically what the delimiter is, we can add -t
, and in this case, it'll be an underscore. That just tells it words are separated by underscores, and still we have this -k
thing telling it to look at the first two words only. Now you can see it did what it did before for the spaces.
So, this has just been a really quick overview of what you can do with sort
. It is by far not exclusive; you know, I didn't go over all the options. In fact, if you want to sort by something other than alphabetical order, you want to sort by date, something like that, there are options for that. You can find more information if you do man sort
and scroll around; you can just see all the options that you can do.
Now I’d like to move on to talking about the cut
command. This command basically lets you just extract a certain part of every line in standard input. So say I had a file with a bunch of sentences, and I wanted a new file that just has the second word in each of those sentences? Well, we could use the cut
command for that. We would just cut out the second word.
So I'm just going to give a couple examples, and I think they'll make it pretty clear how this command works and what you can do with it. So, let’s go ahead and just review what's in our lines.txt
file. Let's say we want to cut; we just want to get the first five characters from each line, and that's all we want to output.
Well, we can pipe that into cut -c
for character, and now a range of characters we want. In this case, 1-5
. You know, we want to start at the first character and go all the way up to the fifth character. This will do what we want; it'll give us the first five characters of every single line in the standard input.
Now, I want to talk a little bit about this 1-5
thing we gave for the range. It's a little different than the sort -k
flag because that took 1,5
, something like that. Ranges are actually a little interesting because in this case, I've given a start and an end index. But actually, I could do like 2-5
, something like that. But I can actually leave out either end of this. So I could say 2-
, and it basically means start at character 2 and read all the way until the end of the line.
So, we'll have the entire line minus the first character, and I can do it the other way and do -5
, and that would just say just read up to character five starting at the beginning of the line. This is, of course, the same as just doing 1-5
, and I can also just specify a single character. Like if I want the third letter of every line, something like that would do.
So, that’s just the idea of a range; it can be a number, dash, another number, just dash number, number, dash, or just a number. Now, nothing is complete unless we can deal with words basically. So once again, if we look, maybe we just want the second word of each of these lines. Something like that.
Well, unlike the sort
command, cut
doesn't automatically assume that words are separated by spaces or something like that. So we use -f
to specify. In this case, it stands for field; just think of that as a word, and this will tell it to read the second word. Now, let's think about what our delimiter is. In this case, words are separated by spaces. So to tell it that, we do -d
, and then the delimiter, the thing separating it. In this case, it's a space, and we can't just type a space because that would just mean nothing.
So, we have to do it in quotes just because a space is just, in the Shell, a space just separates arguments. So we have to put it in quotes to indicate that we really want the space there. Now you can see, wow, I didn't actually realize all of these lines start with or have "is" as the second word, except for one of them. But that's pretty cool, so we get the second word.
We could also get the first two words; we could get starting from the second word onward, we could get 2-3
, something like that. So this has just been how to use cut
to cut with characters or words. There's a lot of other options you can use; of course, you can use -b
for bytes instead of characters. So you can, of course, have a look at the man page and read over all the options.
So last but not least, I'm going to be talking about the xargs
command. This command basically just lets you take all of the lines or words from standard input and pass them as arguments to some other command. So, to give a really simple example, I’ve created a file on my desktop called touched_file
, and what I want to do is create an empty file in here for every line in lines.txt
.
I want a file called "This is a sentence", "Another sentence", "Candy is tasty", "Candy is bad for you". I want a file with each of those names, and I could manually do this by, you know, doing touch
space and then the first file name, the second file name, etc. You can see that would create all the files. I'll just delete them.
But using xargs
, it's a little easier. So I can cat
the file, and I can just pipe that into xargs touch
, and what this will do is it will run the touch
command, and it will pass all of the lines from lines.txt
into that, you know, as arguments to touch
. So I just run that, and you can see it created one file for each of the lines in lines.txt
.
Anyway, so this is, you know, the most basic way to use xargs
; just a way to get a bunch of lines to be arguments to some command like touch
. But we can actually use xargs
in cases when you might not have even expected to need anything fancy.
So, to illustrate why xargs
is actually extremely useful in a lot of cases, I’ve created a folder on my desktop here with about 40,000 files in it. I’m not going to open it with Finder because Finder doesn’t like that kind of thing, but I’ll go ahead and cd
into it and run an ls
. This folder contains a bunch of files whose names start with "A" and a bunch of files whose names start with "Z", a couple of other ones whose names start with "W" or "Y" or something like that.
But, so it’s mostly these two categories starting with "A" or starting with "Z". Let’s say I want to do something pretty basic; I want to delete all the files whose names start with the letter "A". Well, normally you would do something like this: rm A*
. This just means remove all the files who match this pattern "A" followed by anything else.
The problem with this is when you run a command like this, your shell—in this case, I'm using Bash—will turn A*
into a list of all the files that match that pattern, in this case probably about 10,000 files, and it'll pass all of those files as arguments to rm
. So, rm
will get 10,000 arguments or something like that, and the problem is UNIX and Linux actually have a limit for the number of arguments a command can get.
So if I run this, it says "argument list is too long." So I can't delete all the files whose names start with "A" like this because it just requires too many arguments, and I just can't do it. So I need some alternative way to do it, and the answer of course is going to be xargs
because that's what I'm talking about.
So we can find—we can just get it to print out—we can get a list of all the files whose names start with "A" pretty easily. We do find . -name 'A*'
, something like that, and you can see that just will print out all the files we want. In fact, we can count how many there are now by piping it into wc -l
. So, 19,000. So that actually is about half of all the total files in this folder.
So we get this list of files, and now it's the perfect thing for xargs
, right? Because xargs
can just pass each of these into rm
. So we can just pipe that into xargs rm
, and the nice thing that xargs
will do is it won't try to pass all of the things to rm
at once because they don't fit. They can't all be parameters, you know, arguments to rm
at the same time.
What it'll do is it'll chunk them; it'll break them up into smaller chunks of arguments and run rm
multiple times with some subset of these each time. The result is that we can delete all these files. So if we run this, it's going to take a little while. As you can see here, it'll say each command it’s running; you can see it's running multiple rm
commands, and each one gets multiple arguments.
But this is just an idea of how you could delete a massive amount of files at once. You could also use it to, you know, I don't know, rename them, something like that. So now, if we do an ls
, you can see none of it starts with "A". So that's pretty great.
One thing that's a little tricky is figuring out how to use xargs
with commands that, you know, you want to pass certain arguments to. So let’s use move
as an example. I have a folder inside this folder called "W". Right now, it’s empty, and I want to move all of the files whose names start with "W" into this folder. The naive way would be this: mv W* W/
. Well, I can't do that because there's too many files whose names start with "W"; we have the argument list problem.
But it's actually not obvious how to do this with xargs
, like I've just shown you because if we get all the files the same way we did before with find
, but how do we pipe this, you know? We can't just do xargs mv
because that'll pass all these files as arguments to mv
. But how does it know about the "W" directory that we want to move it into, you know? Move stuff into.
Well, really, we want—when xargs
runs something for one of these files—we wanted to put the file name here, you know, and then we want "W" to be after it. So we want it to put the argument in between mv
and "W". And actually, we can pretty much just do this. So xargs
has a -I
flag; we give it a string, and in the command, it'll look for that string we gave it and replace it with whatever argument it wants to pass.
So I could have made this anything; I could call it "monkey" if I want to, and I'll just change this to "monkey". So it basically says replace "monkey" with the argument, and this will move all of the files whose names start with "W" into the "W" directory. If I run it, you can see it's working really hard, and this time it's only passing, it looks like one file, and it's between mv
and "W".
So it's working really well, and once it's done, I can just do an ls
in "W", so you can see. And if I ls
in the main directory now, nothing starts with "W" because it’s all been moved. So that was basically just an intro to xargs
, and of course, as always, check out the man page for more information.
So now I just want to move on to a couple practical applications of the things we just learned. I want to start by making something that goes through this large directory with a bunch of files, and I want it to tell me how many unique files there are. How many files there are where there's no other file in that directory with the same exact contents.
This is actually not that easy because, you know, you might have a, you know, for instance, a folder with 20 million or, you know, a million images, and you want to delete all the duplicate images or something like that; you know how to do that in a reasonable amount of time?
Well, one way to do it is what we're going to be using is by hashing the files first. So if I hash a file with the md5
command md5 -r filename
, and I'll do a different one—how about this? What this does is it basically gives a code that uniquely identifies the contents of that file. So if two files have a different code, they're definitely different.
We’re going to be using this in conjunction with sort
and xargs
to count how many unique files there are. So how are we going to do this? Well, first off, we need the MD5 hashes of all the files in here. So, I'm just going to be using xargs find . -type f
and then I will pipe that into md5
.
So this will just compute the hash, the MD5 hash of all the files in this directory; you know, a unique identifier for all the files. We get something like this, and I'm just going to kill it so you can see this. The output basically looks like you got a unique identifier and a file name, and if there's any duplicates in this list, they'll have the same unique identifier.
I don't see any off the top of my head, but you know, we would have something else that says "94023", and then some different file name somewhere further up in here. So now we can start using our sort
command, and what we want to do is find the unique entries in this giant output that are just unique in their hash. The hash is the first word of the output; the file name is the second word.
So we can just pipe this into sort -K 1,1 -u
, and so now it'll run. It's going to have to wait until the whole running process is over, and now, actually, of course, all the hashes are sorted, as you can tell, and now there should be no duplicates.
And now, finally, we want to count this, so let’s pipe it into wc -l
, and this will just tell us how many files with unique hashes there are in this directory. And it’s going to have to run; we got 16,000 unique files. If we just want to count the files in general, we can get rid of this whole MD5 and sort thing.
There's actually 18,000 files, so there's a lot of duplicates: there's about 2,000 duplicate files in this directory, and we found that just by running MD5, sorting the result by hash, removing the duplicates from that, and just counting the output. But now say we want to do something with each of the unique files, so we want to run a command for every single file, but we don't want to run it twice if there's, you know, two copies of that file or something like that.
What we really want is we want a list of all the files, but we don't want the hashes to the side of them anymore. So, if you remember, this right here—it will have no duplicates—but it’ll still output something with the hash on the left and the file name on the right. So to get rid of the hash on the left, we can use our good old pal cut
.
We can do cut -d ' ' -f2-
, and we're going to use a delimiter of space because that's what separates these two. And we just want the well, we want everything after the first word. So like that, we want all the fields, all the words to and on, and if I run this, it'll run, and I'll get an output of files that are, you know, within this directory, but none of these files will have the same contents as any other file in the list.
So now, theoretically, I could pipe this to xargs
and do whatever I want with this list of files. So this is just a really practical use case when I actually use this occasionally to just, you know, manipulate a large list of files. The three commands xargs
, sort
, and cut
actually work really well together, and usually, find
is involved too or something, you know, if you're working with files.
But anyway, I hope this was enlightening; I hope you learned something. Thanks for watching, subscribe, and goodbye!