Mapping Directories


If you are a guy like me, then you came up using the command line. And you love data integrity! To keep track of my backups located on different discs and in different places I wrote a shell script, called mapdir, that maps out directory structures and file hierarchies, for when copying and backing up data to external storage or transmitting it over the network there's always a risk that data gets lost, corrupted or that sectors of your disc may be broken.

The idea behind the script is as follows: It is intended to be run once on the original file/directory to obtain more detailed information about its structure and the files contained in them. Mapdir keeps a report about what it found in the users home directory ~/ in a file called "mapdir_path_to_folder_mapped_date.txt", whereas path_to_folder is the file/folder mapdir was invoked with and date is the date of the run. If at invocation time no argument is passed to the script, then by default mapdir examines the current directory. After having copied your files/directories to another machine or drive, run it once more, keeping a report file of the run as well. These two report files can later be passed to the diff utility to check for data integrity. When diff exits with an exit status - echo $? - of 0, then the directory structures are in sync. If this is not the case, then some further investigation is necessary.

As we are primarily interested in regular files, we will log the size and md5 checksum next to the filename and type only for this kind of file. All other common UNIX file types such as directories, pipes, links.... are only logged with their filename and type.

So much for the mission statement.

Here goes mapdir.sh:

#!/bin/sh

# mapdir is a simple shell script that maps the structure and filetype of folders.
# If the file is a regular file, then besides mapping it in the folder hierarchy, 
# mapdir also reports its size in bytes and the file's md5 checksum.
# mapdir is Copyright 12.2015/01.2016 by Oliver Mahmoudi

# variables
READLINK=
DATE=$(date +%m%d%Y)		# today
SAVEFILE=
STRLEN=				# for nice output at the end
# variables that are used for the statistics at the end of the program
DIRS=0
UNREADABLE_DIRS=0
FILES=0
UNREADABLE_FILES=0
BLOCK_SPECIALS=0
CHARACTER_SPECIALS=0
PIPES=0
SOCKETS=0
SYMBOLIC_LINKS=0
UNKNOWN=0
SAVEIFS=$IFS			# save the current Internal Field Seperator
IFS=$(echo -en "\n\b")		# and set a new one
export LC_COLLATE=C
f_flag=0			# flag for getopts

readfile() 
{
	local _dir _contents _entry _file _newpath

	_file=$1

	if [ -b $_file ]; then
		if [ $f_flag -eq 0 ]; then
			echo $_file - block special file
			echo $_file - block special file >> $SAVEFILE
		else
			echo $(get_filename $_file) - block special file
			echo $(get_filename $_file) - block special file >> $SAVEFILE
		fi
		BLOCK_SPECIALS=$((BLOCK_SPECIALS+1))			# gather statistics
	elif [ -c $_file ]; then
		if [ $f_flag -eq 0 ]; then
			echo $_file - character special file
			echo $_file - character special file >> $SAVEFILE
		else
			echo $(get_filename $_file) - character special file
			echo $(get_filename $_file) - character special file >> $SAVEFILE
		fi
		CHARACTER_SPECIALS=$((CHARACTER_SPECIALS+1))		# gather statistics
	elif [ -L $_file ]; then
		if [ $f_flag -eq 0 ]; then
			echo $_file - symbolic link
			echo $_file - symbolic link >> $SAVEFILE
		else
			echo $(get_filename $_file) - symbolic link
			echo $(get_filename $_file) - symbolic link >> $SAVEFILE
		fi
		SYMBOLIC_LINKS=$((SYMBOLIC_LINKS+1))			# gather statistics
	elif [ -d $_file ]; then
		if [ -r $_file ] && [ -x $_file ]; then
			if [ $f_flag -eq 0 ]; then
				echo $_file - directory
				echo $_file - directory >> $SAVEFILE
			else
				if [ $2 -eq 0 ]; then
					echo $(get_filename $_file) - directory
					echo $(get_filename $_file) - directory >> $SAVEFILE
				fi
			fi
			DIRS=$((DIRS+1))				# gather statistics
			_dir=$_file
			cd $_dir
			_contents=*
			for _entry in $_contents ; do
				_newpath=$_dir/"$_entry"
				readfile $_newpath 0
			done
		else 
			if [ $f_flag -eq 0 ]; then
				echo $_file - directory is not readable
				echo $_file - directory is not readable >> $SAVEFILE
			else
				if [ $2 -eq 0 ]; then
					echo $(get_filename $_file) - directory is not readable
					echo $(get_filename $_file) - directory is not readable >> $SAVEFILE
				fi
			fi
			UNREADABLE_DIRS=$((UNREADABLE_DIRS+1))		# gather statistics
		fi
	elif [ -f $_file ]; then
		if [ -r $_file ]; then
			if [ $f_flag -eq 0 ]; then
				echo $_file - regular file - Size: `ls -l $_file | awk '{ print $5 }'` \
                                bytes - MD5: `md5sum $_file | awk '{ print $1 }'`
				echo $_file - regular file - Size: `ls -l $_file | awk '{ print $5 }'` \
                                bytes - MD5: `md5sum $_file | awk '{ print $1 }'` >> $SAVEFILE
			else
				echo $(get_filename $_file) - regular file - Size: `ls -l $_file | \
                                awk '{ print $5 }'` bytes - MD5: `md5sum $_file | awk '{ print $1 }'`
				echo $(get_filename $_file) - regular file - Size: `ls -l $_file | \
                                awk '{ print $5 }'` bytes - MD5: `md5sum $_file | awk '{ print $1 }'` \
                                >> $SAVEFILE
			fi
			FILES=$((FILES+1))				# gather statistics
		else 
			if [ $f_flag -eq 0 ]; then
				echo $_file - regular file - Size: `ls -l $_file | awk '{ print $5 }'` \
                                bytes - MD5: not readable
				echo $_file - regular file - Size: `ls -l $_file | awk '{ print $5 }'` \
                                bytes - MD5: not readable >> $SAVEFILE
			else
				echo $(get_filename $_file) - regular file - Size: `ls -l $_file | \
                                awk '{ print $5 }'` bytes - MD5: not readable
				echo $(get_filename $_file) - regular file - Size: `ls -l $_file | \
                                awk '{ print $5 }'` bytes - MD5: not readable >> $SAVEFILE
			fi
			UNREADABLE_FILES=$((UNREADABLE_FILES+1))	# gather statistics
		fi
	elif [ -p $_file ]; then
		if [ $f_flag -eq 0 ]; then
			echo $_file - pipe
			echo $_file - pipe >> $SAVEFILE
		else
			echo $(get_filename $_file) - pipe
			echo $(get_filename $_file) - pipe >> $SAVEFILE
		fi
		PIPES=$((PIPES+1))					# gather statistics
	elif [ -S $_file ]; then
		if [ $f_flag -eq 0 ]; then
			echo $_file - socket
			echo $_file - socket >> $SAVEFILE
		else
			echo $(get_filename $_file) - socket
			echo $(get_filename $_file) - socket >> $SAVEFILE
		fi
		SOCKETS=$((SOCKETS+1))					# gather statistics
	else
		if [ $f_flag -eq 0 ]; then
			echo $_file - unknown file type
			echo $_file - unknown file type >> $SAVEFILE
		else
			echo $(get_filename $_file) - unknown file type
			echo $(get_filename $_file) - unknown file type >> $SAVEFILE
		fi
		UNKNOWN=$((UNKNOWN+1))
	fi
}

get_filename()
{
	local _filename

	# use awk to get the last "/" character and get the right part of it	
	_filename=$(awk -v filename=$1 'BEGIN {
		n = split(filename, a, "/");
		print a[n];
	}')
	
	echo $_filename
}

usage()
{
	echo usage: mapdir.sh [-f] [file]/[directory]
	exit 1
}

### Point of entry ###


while getopts ":f" opt ; do
	case $opt in
		f)
			f_flag=1
			;;
		\?)
			echo "unkown flag: -$OPTARG."
			usage
			exit
			;;
	esac
done

shift $((OPTIND-1))

#
# if an argument is given, take it. otherwise process the current directory
#

if [ $# -eq 1 ]; then
	if [ ! -e $(readlink -f $1) ]; then
 		echo The file `readlink -f $1` doesn\'t exist.
		usage
	fi
	READLINK=$(readlink -f $1)
	SAVEFILE=~/mapdir$(readlink -f $1 | sed s#/#_#g)_$DATE.txt
	: > $SAVEFILE
else
	READLINK=$(readlink -f ./)
	SAVEFILE=~/mapdir$(pwd | sed s#/#_#g)_$DATE.txt
	:> $SAVEFILE
fi

echo Mapping structure of: $READLINK
if [ $f_flag -eq 0 ]; then
	echo Mapping structure of: $READLINK >> $SAVEFILE
fi

# When calling the readline function for the first time, we pass a second argument of "1" to it. 
# This serves the purpose of pleasing the diff utility when invoking mapdir with the -f switch and 
# having a directory as the first argument. If we would map the starting directory to the $SAVEFILE 
# and would later on compare it with another the $SAVEFILE, the diff utility would obviously exit 
# with a return value other than 0, even though the contents of the directories may be truly equivalent.
# Consider for example the folders /media/filesystem_a and /media/filesystem_b that have the same content.
#
# The logic is as follows: if the file is a folder, then the readlink function detects this in the 
# "is directory" part and skips mapping its occurence to the $SAVEFILE. For subsequent calls to readlink 
# we will pass a second argument of "0" to the function, which this time maps it.
#
# Start processing the file/folder...

readfile $READLINK 1

echo
echo >> $SAVEFILE

STRLEN="########## Statistics for $READLINK ##########"
echo $STRLEN
if [ $f_flag -eq 0 ]; then
	echo $STRLEN >> $SAVEFILE
else
	echo "########## Statistics ##########" >> $SAVEFILE
fi

if [ $DIRS -ne 0 ]; then 
	echo Number of directories: $DIRS
	echo Number of directories: $DIRS >> $SAVEFILE
fi
if [ $UNREADABLE_DIRS -ne 0 ]; then 
	echo Number of unreadable directories: $UNREADABLE_DIRS
	echo Number of unreadable directories: $UNREADABLE_DIRS >> $SAVEFILE
fi
if [ $FILES -ne 0 ]; then
	echo Number of regular files: $FILES
	echo Number of regular files: $FILES >> $SAVEFILE
fi
if [ $UNREADABLE_FILES -ne 0 ]; then
	echo Number of unreadble files: $UNREADABLE_FILES
	echo Number of unreadble files: $UNREADABLE_FILES >> $SAVEFILE
fi
if [ $BLOCK_SPECIALS -ne 0 ]; then
	echo Number of block special files: $BLOCK_SPECIALS
	echo Number of block special files: $BLOCK_SPECIALS >> $SAVEFILE
fi
if [ $CHARACTER_SPECIALS -ne 0 ]; then
	echo Number of character speial files: $CHARACTER_SPECIAL
	echo Number of character speial files: $CHARACTER_SPECIAL >> $SAVEFILE
fi
if [ $PIPES -ne 0 ]; then
	echo Number of pipes: $PIPE
	echo Number of pipes: $PIPE >> $SAVEFILE
fi
if [ $SOCKETS -ne 0 ]; then 
	echo Number of sockets: $SOCKET
	echo Number of sockets: $SOCKET >> $SAVEFILE
fi
if [ $SYMBOLIC_LINKS -ne 0 ]; then
	echo Number of symbolic links: $SYMBOLIC_LINKS
	echo Number of symbolic links: $SYMBOLIC_LINKS >> $SAVEFILE
fi
if [ $UNKNOWN -ne 0 ]; then
	echo Number of symbolic links: $UNKNOWN
	echo Number of symbolic links: $UNKNOWN >> $SAVEFILE
fi

#
# pretty output
#
STRLEN=${#STRLEN}
while [ $STRLEN -gt 0 ]
do
	echo -n "#"
	if [ $f_flag -eq 0 ]; then
		echo -n "#" >> $SAVEFILE
	fi
	STRLEN=$((STRLEN-1))
done
IFS=$SAVEIFS			# reset the old IFS
echo
if [ $f_flag -eq 0 ]; then
	echo >> $SAVEFILE
else
	echo "################################" >> $SAVEFILE
fi
echo Done!
echo Done! >> $SAVEFILE
exit 0

The above script should work on pretty much any UNIX type system, as it makes use of very standard UNIX utilities.

Let's watch it in action:

First, let's create a directory structure:

[om@pc192-168-2-119 ~]$ mkdir testdir
[om@pc192-168-2-119 ~]$ cd testdir/
[om@pc192-168-2-119 testdir]$ mkdir a_dir
[om@pc192-168-2-119 testdir]$ mkdir b_dir
[om@pc192-168-2-119 testdir]$ mkdir c_dir
[om@pc192-168-2-119 testdir]$ cd a_dir/
[om@pc192-168-2-119 a_dir]$ touch file_1.txt
[om@pc192-168-2-119 a_dir]$ echo "hello world" > file_2.txt
[om@pc192-168-2-119 a_dir]$ cd ../b_dir/
[om@pc192-168-2-119 b_dir]$ touch file_3.txt
[om@pc192-168-2-119 b_dir]$ echo "hello world number 2" > file_4.txt
[om@pc192-168-2-119 b_dir]$ cd ../c_dir/
[om@pc192-168-2-119 c_dir]$ echo "more stuff in this file" > file_5.txt
[om@pc192-168-2-119 c_dir]$ echo "yet more stuff here" > file_6.txt

Now we can run mapdir.sh on the testdir directory:

[om@pc192-168-2-119 ~]$ mapdir.sh testdir
Mapping structure of: /home/om/testdir
/home/om/testdir - directory
/home/om/testdir/a_dir - directory
/home/om/testdir/a_dir/file_1.txt - regular file - Size: 0 bytes - MD5: d41d8cd98f00b204e9800998ecf8427e
/home/om/testdir/a_dir/file_2.txt - regular file - Size: 12 bytes - MD5: 6f5902ac237024bdd0c176cb93063dc4
/home/om/testdir/b_dir - directory
/home/om/testdir/b_dir/file_3.txt - regular file - Size: 0 bytes - MD5: d41d8cd98f00b204e9800998ecf8427e
/home/om/testdir/b_dir/file_4.txt - regular file - Size: 21 bytes - MD5: 1d64c0e7aa142fe642b94eac89c52388
/home/om/testdir/c_dir - directory
/home/om/testdir/c_dir/file_5.txt - regular file - Size: 24 bytes - MD5: 05a0df1800afb8f3fc30460c74ac21a3
/home/om/testdir/c_dir/file_6.txt - regular file - Size: 20 bytes - MD5: 66df841f6f18d6deb62f767ba1ae884a

########## Statistics for /home/om/testdir ##########
Number of directories: 4
Number of regular files: 6
#####################################################
Done!
[om@pc192-168-2-119 ~]$ 

This created the following report file in our home directory:

[om@pc192-168-2-119 ~]$ ll mapdir_home_om_testdir_01112016.txt 
-rw-rw-r--. 1 om om 974 Jan 11 21:50 mapdir_home_om_testdir_01112016.txt
[om@pc192-168-2-119 ~]$ 

One thing to note here is that the md5 sum of a file, is independent of the file's name but rather depends on the file's contents, as /home/om/testdir/a_dir/file_1.txt and /home/om/testdir/b_dir/file_3.txt are obviously two different files but have the same md5 sum.

However, this kind of report file is not exactly suitable for passing to diff, to compare it with the same directory structure located in another folder or mount point. For this reason mapdir provides the -f switch. When invoking mapdir with -f, the script does essentially the same, however doesn't map the entire path to a given file but only the file itself.

[om@pc192-168-2-119 ~]$ rm mapdir_home_om_testdir_01112016.txt 
[om@pc192-168-2-119 ~]$ mapdir.sh -f testdir
Mapping structure of: /home/om/testdir
a_dir - directory
file_1.txt - regular file - Size: 0 bytes - MD5: d41d8cd98f00b204e9800998ecf8427e
file_2.txt - regular file - Size: 12 bytes - MD5: 6f5902ac237024bdd0c176cb93063dc4
b_dir - directory
file_3.txt - regular file - Size: 0 bytes - MD5: d41d8cd98f00b204e9800998ecf8427e
file_4.txt - regular file - Size: 21 bytes - MD5: 1d64c0e7aa142fe642b94eac89c52388
c_dir - directory
file_5.txt - regular file - Size: 24 bytes - MD5: 05a0df1800afb8f3fc30460c74ac21a3
file_6.txt - regular file - Size: 20 bytes - MD5: 66df841f6f18d6deb62f767ba1ae884a

########## Statistics for /home/om/testdir ##########
Number of directories: 4
Number of regular files: 6
#####################################################
Done!
[om@pc192-168-2-119 ~]$ 

Now let's create another folder with the same directory structure:

[om@pc192-168-2-119 ~]$ mkdir testdir2
[om@pc192-168-2-119 ~]$ cd testdir
[om@pc192-168-2-119 testdir]$ cp -r * ../testdir2
[om@pc192-168-2-119 testdir]$ 

And run mapdir on that directory as well:

[om@pc192-168-2-119 testdir]$ cd ~/
[om@pc192-168-2-119 ~]$ mapdir.sh -f testdir2
Mapping structure of: /home/om/testdir2
a_dir - directory
file_1.txt - regular file - Size: 0 bytes - MD5: d41d8cd98f00b204e9800998ecf8427e
file_2.txt - regular file - Size: 12 bytes - MD5: 6f5902ac237024bdd0c176cb93063dc4
b_dir - directory
file_3.txt - regular file - Size: 0 bytes - MD5: d41d8cd98f00b204e9800998ecf8427e
file_4.txt - regular file - Size: 21 bytes - MD5: 1d64c0e7aa142fe642b94eac89c52388
c_dir - directory
file_5.txt - regular file - Size: 24 bytes - MD5: 05a0df1800afb8f3fc30460c74ac21a3
file_6.txt - regular file - Size: 20 bytes - MD5: 66df841f6f18d6deb62f767ba1ae884a

########## Statistics for /home/om/testdir2 ##########
Number of directories: 4
Number of regular files: 6
######################################################
Done!
[om@pc192-168-2-119 ~]$ 

Now we can run the diff utility on the two report files, to check wether or not they - and along with that the two directory structures - are truly equivalent:

[om@pc192-168-2-119 ~]$ diff mapdir_home_om_testdir_01112016.txt mapdir_home_om_testdir2_01112016.txt 
[om@pc192-168-2-119 ~]$ echo $?
0
[om@pc192-168-2-119 ~]$

Which is the exit status we wanted, confirming the desired relationship.

Finally to map the entire system and in the process obtain as much information as possible run: "mapdir.sh /" as the root user.