Tuesday, September 20, 2011

September 19, 2011 - PeDALS

With RPM’s instructions and tip video open for reference, I opened my ARST5100 virtual machine in VirtualBox. Under the Places dropdown menu, I opened the Home Folder and created a new folder named PeDALS. I then opened a Firefox browser and went to the address http://mas.clayton.edu/collections/PeDALS/PeDALS.zip. A dialog box opened giving me the option to either Open or Save the file. I chose the Save File option, which downloaded the compressed file to the Downloads folder. I used the cut command to remove the zip file from the Downloads folder and pasted it into the PeDALS folder located in my home (hoswald) folder.

I double-clicked on the zip file in the PeDALS folder in order to open it with Archive Manager. This gave me a set of options at the top of the page, including Extract. I chose to extract all files, and unchecked the Overwrite existing files box under Actions. I then clicked the Extract button. This process created a folder named PeDALS within the home/hoswald/PeDALS directory of files. In order to make the files as accessible as possible, without multiple layers of folders, I moved all of the files within the PeDALS folder into the upper PeDALS directory and deleted the extraneous folder. This made the full path hoswald/PeDALS.

I opened the command line by going to the Applications dropdown menu and opening Terminal. I input the following commands in order to create an inventory of file names in the PeDALs folder.

$ cd ~

$ find PeDALS -exec basename {} \; > inv_filenames_hlo.txt

Because I copied and pasted the second command from RPM’s instructions, I did need to edit the line to read as a hyphen as opposed to an en dash.

I then attempted to create a list of the directories into which the files are organized using the following commands.

$ cd ~

$ find PeDALS -exec > dirname {} \; > inv_dirlist_hlo.txt

However, at this time, I received a long list of documents followed by the notation: Permission denied. I went to the text file created out of this process, and it was blank. At this point, I reviewed both the assignment and the video from the September 7 ARST 5100 class. I noticed there was an extra > in the command that was not used in the command to create an inventory of filenames. I removed the extra > and the command worked.

In order to sort and find only unique file directories, I used the command:

find PeDALS -exec dirname {} \; |sort |uniq > inv_dirlist_hlo.txt

I also realized it would be useful to see the inventory of filenames in some sort of order, so I created a second, sorted text file using the command:

find PeDALS -exec basename {} \; |sort > inv_filenes_hlo1.txt

Now that I have only unique directories and a sorted inventory, I determined the number of lines of each of the files, using the command:

$ wc –l inv_filenames_hlo.txt

$ wc –l inv_dirlist_hlo1.txt

I found I have 598 lines in the inv_filename_hlo1.txt file and 203 lines in the inv_dirlist_hlo.txt file.

To characterize the types of files found in the collection, I underwent a number of steps, following RPM’s directions as seen below.

1. I switched to the parent directory of PeDALS.

$ cd /home/hoswald

2. I then created a list of the filenames without the directory

$ find PeDALS -exec basename {} \; > inv1.txt

3. Eliminated (imperfectly) lines that are likely not file names.

$ cat inv1.txt | grep '\.' > inv2.txt

4. Some files used a period elsewhere in the filename, so I used the stream editor command (sed).

$ cat inv2.txt | sed 's/^[^.]*\.//g' > inv3.txt

5. I stripped the file name, hopefully leaving just the extension.

$ cat inv3.txt | sed 's/^[^.]*\.//g' > inv4.txt

6. I sorted the list.

$ sort inv4.txt > inv5.txt

7. I removed duplicates

$ uniq inv5.txt > inv6.txt

8. I changed the name inv6.txt to inv_hlo.txt using the right-click, Rename option.

At this point, I opened the txt file to view my completed work. It showed two lines that appeared to be errors.

Both the highlighted 3.xls line and the 1_2009-11-14).doc line were left in after the removal of lines with at least two ‘.’s. In order to present the most accurate final product, I copied the document and renamed it inv_hlotest.txt.

I then repeated the command line:

$ cat inv_hlotest.txt | sed 's/^[^.]*\.//g' > inv_hlotest2.txt

This stripped the remainder of these two filenames, but I still needed to sort and then ensure only unique file extensions were listed. I used the following command lines:

$ sort inv_hlotest2.txt > inv_hlotest3.txt

$ uniq inv_hlotest3.txt > inv_hlofinal.txt

Finally, I did a quality control check by doing a word count on both inv_hlo.txt and inv_hlofinal.txt using the following command lines:

hoswald@ARST5100:~$ wc -l inv_hlo.txt

29 inv_hlo.txt

hoswald@ARST5100:~$ wc -l inv_hlofinal.txt

27 inv_hlofinal.txt