
File System commands
File System commands can be used to directly interact with HDFS. These commands can also be executed on an HDFS supported File System such as WebHDFS, S3, and so on. Let's walk through a few basic, important commands:
- -ls: The ls command lists all the directories and files within a specified path:
hadoop fs -ls /user/packt/
The ls command returns the following information:
File_Permission numberOfReplicas userid groupid filesize last_modification_date last_modification_time filename/directory_name
A few options are also available that you can use with the ls command, such as sorting output based on size and showing only limited information:
hadoop fs -ls -h /user/packt
The -h option is used to display file sizes in a readable format. For example, it would use 230.8 MB or 1.24 GB instead of putting the file size in bytes. You can check out other options for this by using --help options:
- -copyFromLocal and -put: Users may want to copy data from a local File System to HDFS, which is possible if you use the -copyFromLocal and -put commands:
hadoop fs -copyFromLocal /home/packt/abc.txt /user/packt
hadoop fs -put /home/packt/abc.txt /user/packt
You can also use these options to copy the data from a local File System to an HDFS supported File System. For example, the -f option can be used to forcefully copy a file to an HDFS supported File System, even if the file already exists at that destination.
-copyToLocal and -get: These commands are used to copy data from an HDFS supported File System to a local File System:
hadoop fs -copyToLocal /user/packt/abc.txt /home/packt
hadoop fs -get /user/packt/abc.txt /home/packt
- -cp: You can copy data from one HDFS location to another HDFS location by using the -cp command:
hadoop fs -cp /user/packt/path1/file1 /user/packt/path2/
- -du: The du command is used to display the size of files and directories contained in the given path. It is a good option to include the -h option so that you can view the size in a readable format:
hadoop fs -du -h /user/packt
- -getmerge: The getmerge command takes all the files from a specified source directory and concatenates them into a single file before storing it in a local File System:
hadoop fs -getmerge /user/packt/dir1 /home/sshuser
- -skip-empty-file: This option can be used to skip empty files.
- -mkdir: Users often create directories on HDFS, and the mkdir command is used to serve this purpose. You can use the -p option to create all the directories along this path. For example, if you create a directory called /user/packt/dir1/dir2/dir3, dir1 and dir2 do not exist, but if you use the -p option, it will create dir1 and dir2 before creating dir3:
hadoop fs -mkdir -p /user/packt/dir1/dir2/dir3
If you don't use the -p option, then the preceding command has to be written like this:
hadoop fs -mkdir /user/packt/dir1 /user/packt/dir2 /user/packt/dir3
- -rm: The rm command is used to remove a file or directory from an HDFS supported File System. By default, all the deleted files go into the trash, but if you use the -skipTrash option with it, the file will be immediately deleted from the File System and will not go into the trash. -r can be used to recursively delete all the files within a specified path:
hadoop fs -rm -r -skipTrash /user/packt/dir1
- -chown: Sometimes, you may want to change the owner of a file or directory. -chown is used for this purpose. Users must have valid permission to change permissions, otherwise a super user must do this. The -R option can be used to change the ownership of all the files within a specified path:
hadoop fs -chown -R /user/packt/dir1
- -cat: The use of this command is similar to what it does in Linux; it is used to copy file data to a standard output:
hadoop fs -cat /user/packt/dir1/file1