Wednesday, March 6, 2019

How to determine the number of elements of each directory in a directory tree

Imagine that you have a directory tree with multiple sub directories and files and you want to determine the number of files and directories for each directory in this hierarchy i.e. not just direct elements but all elements of sub directories as well. Here I present a solution just using the UNIX tools find and awk.

Creating an example directory tree

With these commands I am creating a simple directory tree with some directories and files which I will use further on.
mkdir -p somedir/d1/b somedir/d1/c somedir/d2 somedir/d3;    # Create some directories
touch somedir/d1/b/file1 somedir/d3/file2;                   # Create some files
ln -s somedir/d3/file2 somedir/d3/z;                             # Create symbolic links
(cd somedir; ln -s d1 d4_link ;)

find somedir
somedir
somedir/d1
somedir/d1/b
somedir/d1/b/file1
somedir/d1/c
somedir/d2
somedir/d3
somedir/d3/file2
somedir/d3/z
somedir/d4_link
I want some code to show me that somedir contains 9 elements and somedir/d1 contains 3 elements aso.

List the complete directory tree with types of files

On Linux systems with GNU find (unfortunately this does not work on my Mac) one can easily retrieve a list of elements and their file types with the printf formatting options. My example contains six directories (including the topmost somedir), two files and two symbolic links (one to a file and one to a directory).
find somedir -printf '%y %p\n'

d somedir
d somedir/d1
d somedir/d1/b
f somedir/d1/b/file1
d somedir/d1/c
d somedir/d2
d somedir/d3
f somedir/d3/file2
l somedir/d3/z
l somedir/d4_link

Calculating the number of entries per directory

First of all the awk command consumes a slightly modified output of the find from above:
I am adding a slash so that I can use the slash as an awk field delimiter.
find somedir -printf '%y/%p\n'

...
d/somedir/d1/b
...
The awk command:
awk -F/ '
{ 
  x = $2; 
  count[x]++; 
  for(i=3;i<=NF;i++) { x = x FS $i; count[x]++ } 
  type[x] = $1;
} 
END {
  for(i in count)  printf "%s %2d %s\n", type[i],count[i]-1,i
}'
What happens in each step: the first field is the type of the element ( d for directory, f for field etc.), the count array is increased for each occurance of a path.
d/somedir

  x = "somedir"
  count["somedir"] = 1
  # the for loop is not executed for "d/somedir" since NF=2
  type["somedir"} = "d"

d/somedir/d1
  x = "somedir"
  count["somedir"] = 2      # increase count by 1
 
  # NF = 3. The for loop is executed once.
  x = x FS "a" = "somedir/d1"
  count["somedir/d1"] = 1    # first time: 1

  type["somedir/d1"] = "d"

d/somedir/d1/b
  x = "somedir"
  count["somedir"] = 3      # increase count by 1
 
  # NF = 4. The for loop is executed twice.
  x = x FS "a" = "somedir/d1"
  count["somedir/d1"] = 2    # increase count by 1

  x = x FS "b" = "somedir/d1/b"
  count["somedir/d1/b"] = 1  # first time: 1

  type["somedir/d1/b"] = "d"

Here is the combined command sequence also appended by a sort statement for better readability

find somedir -printf '%y/%p\n'  | 
awk -F/ '{ x=$2; count[x]++; for(i=3;i<=NF;i++) { x=x FS $i; count[x]++ } type[x]=$1;} 
END {for(i in count)  printf "%s %2d %s\n", type[i],count[i]-1,i } '|
sort  -k3,3

d  9 somedir
d  3 somedir/d1
d  1 somedir/d1/b
f  0 somedir/d1/b/file1
d  0 somedir/d1/c
d  0 somedir/d2
d  2 somedir/d3
f  0 somedir/d3/file2
l  0 somedir/d3/z
l  0 somedir/d4_link
So somedir contains 9 elements altogether: 4 direct elements d1, d2, d3 and d4_link and also elements of elements.
Note that the END statement prints the count minus one since the count was set to one when the directory appeared originally but we want to show only the number of elements i.e. I need to exclude the directory itself.
Note also that somedir/d4_link (the symbolic link to directory somedir/d1) is not followed and listed as having zero elements. If you want to follow symbolic links to directories with find somedir -follow the calculations will be misleading since - in this example - elements of d1 and d4_link would be calculated twice.

The counts for non-directory file types should always be zero, they could probably be excluded completely from the output.

find somedir -printf '%y/%p\n'  | 
awk -F/ '{ x=$2; count[x]++; for(i=3;i<=NF;i++) { x=x FS $i; count[x]++ } type[x]=$1;} 
END {for(i in count) if( type[i]=="d") printf "%2d %s\n", count[i]-1,i } '| sort  -k2,2

 9 somedir
 3 somedir/d1
 1 somedir/d1/b
 0 somedir/d1/c
 0 somedir/d2
 2 somedir/d3

Usages

Empty directories

Add a grep '^d 0' (or adjust the awk code with if clause count[i]==1 etc.)
find somedir -printf '%y/%p\n'  | 
awk -F/ '{ x=$2; count[x]++; for(i=3;i<=NF;i++) { x=x FS $i; count[x]++ } type[x]=$1;} 
END {for(i in count)  printf "%s %2d %s\n", type[i],count[i]-1,i } '|
grep '^d  0'

d  0 somedir/d1/c
d  0 somedir/d2

Non-empty directories

find somedir -printf '%y/%p\n'  | 
awk -F/ '{ x=$2; count[x]++; for(i=3;i<=NF;i++) { x=x FS $i; count[x]++ } type[x]=$1;} 
END {for(i in count)  printf "%s %2d %s\n", type[i],count[i]-1,i } '|
grep '^d .[^0]' | sort -k 3,3

d  9 somedir
d  3 somedir/d1
d  1 somedir/d1/b
d  2 somedir/d3