Teija Alasalmi

Command-line Tools for Linguists

This document is a description of what I have learned during the Command-line Tools for Linguists 2022. The course was organised by University of Helsinki as a web course. It is a course designed for students interested in command-line tools that can be used for example in language text file processsing. The course covered the following topics:

Course syllabus

Week 1 Introduction to Command Line Environments

The first week was an introduction to command-line environments. The first task was to install Windows Subsystem for Linux (WSL) for your own computer. This was an easy task for me since I had just done the same thing when I started at my new project at work.

The course material provided introduction on the commands used for listing the contents of your current directory, changing your current directory, downloading content from the internet, displaying a text file using less, creating and removing files and directories, copying, renaming and moving files, as well as opening and editing a file in text editor called emacs. Most of these commands, as well as different commands for quitting applications were familiar to me due to my previous studies in Computer Science, and I use them daily at my work.

However, I had not previously used the command for downloading content, even though it is a very simple command. Thus, I learned how to fetch books from gutenberg.org using wget. For example, the following command fetches book “Don Quijote” in plain text format:

wget https://www.gutenberg.org/files/2000/2000-0.txt

Week 2 Navigating a UNIX System

The second week deepened understanding about the UNIX system as well as on working on a remote server using the programs ssh and scp. Again I was familiar with most of the commands used in UNIX for copying, moving and deleting files and folders, or how to visit the root of the system, or change the read and write permissions of a file. However, this week I learned how to compress a folder directory into compressed.tgz package using tar:

tar -czvf compressed.tgz my_folder

I also learned how to find the PID of a process and how to kill it, as well as how to run commands in the background.

ps aux               // list all processes
kill -9 process_id   // kill a certain process by giving the id

In addition, I learned how to form a remote connection to Puhti server and how to copy files to and from a server using scp.

from local to remote home dir:
scp  my_folder/life_of_bee.txt puhti:~

bring a copy from remote to local:
scp puhti:~/life_of_bee.txt .

Week 3 Basic Corpus Processing

This week was probably the most interesting, as I learned how to process corpus files with command line tools. For example, you can count the number of lines, words and characters in a text file with wc command. Some file utilities that I learned are explained in the following table.

Command Usage
cat -s concatenate files to the standard output
comm -3 compare two sorted files, line by line
cut -s remove section from file
diff -q file compare, line by line
head -n output the first part of files
join -i join lines of two files on a common field
more, less view file content, one page at a time
sort -n sort lines in text file
split -l split file into pieces
tail -f output last part of the file
tr -d translate or delete character
uniq -c -u report or omit repeated lines
wc -w print newline, word, and byte counts for each file

I also re-learned the differences between the ASCII, Latin-1 and utf-8 encoding systems, although I had studies about them during my Computer Science studies. You can convert Latin-1 encoded text into utf-8 by

iconv -f latin1 -t utf- file.txt > file.utf8.txt

You can also convert between Windows and UNIX text file formats with two convenient commands dos2unix and unix2dos. Examples:

dos2unix mywindowsfile.txt

unix2dos myunixfile.txt

Week 4 Advanced Corpus Processing

This week continued where week 3 material was left off. I learned how to make frequency lists from a text file, and pipe UNIX commands using “the pipe” character. I also learned how to generate sentence per line format from a text file. Probably the most difficult thing learned on this course was how to transform a text file into a list of word n-grams. For example, the following command lists words from a book text file to a frequency order, with most frequent on top:

cat book.txt | tr -s '\n\r\t ' '\n' | tr -dc "A-Za-z0-9\n'" | sort | uniq -c | sort -nr | less > lista.txt

In addition to these, I learned how to use the sed command, which is used for deleting lines, finding patters or replacing patterns in a text file. The most common usage of sed is apparently to substitute patterns with new strings, but the find option is also very useful in testing if your pattern works as it should be. For example, you can remove empty lines from the file and redirect to new file:

sed '/^$/d' book.txt > book_noemptylines.txt

Week 5 Scripting and Configuration Files

On the fifth week I learned the basics of scripting. I had run -sh scripts at work, but I had not written them by myself, so it was fun to learn how to write a .sh script from scratch. I learned how to write simple scripts and how to access command line parameters in scripts. I also learned how to implement if statement in the script and how to use command substitution. This week was probably the hardest one on the course, since there were a lot of material to go through in terms of scripting tutorial.

Also, the modification of .bashrc / .bash_profile turned out to be very handy especially with aliases. The modified profile can be taken into use with command

source ~/.bashrc

Week 6 Installing and Running Programs

This week covered topics related to installing and running programs. I have used pip, apt and brew before at work as test automation engineer, so I was familiar with what you can do with them. However, I had not used make before, so it was all new to me. Make is used for building other programs, libraries or other projects like corpus collections. In this course I learned to write a Makefile and run it for the first time ever. It was fun to browse the GNU Make manual for learning how to use it, although it contained a whopping 230 pages!

Another tool that was new to me was called meld. It is a nicely visual diff tool, which shows the difference between files in color, and can compare more than two files or directories at the same time. You can install the tool as sudo user with apt by typing

sudo apt-get install meld

Week 7 Version Control

The last week of this course was all about git version control program. As a test automation engineer, I use git daily at work to pull other developers’ changes and to push my changes, so practically this week’s material and quiz was just a good revise of all git commands. I also have been creating repositories for my hobby coding projects at GitHub, so everything that we were meant to study this week was rather easy for me.

Final project - Building Web Pages using GitHub Pages

By doing the final project I learned how to

In Conclusion

This course was one of the best and useful courses I have taken during the past ten years! I learned so much on this course. Even though I already knew how to use the basic command-line tools such as Windows PowerShell, Command Prompt or Git Bash, I also got to practise the skills acquired previously. Moreover, I learned the basic use of WSL (Ubuntu).

I also learned to form regular expressions at a basic level and how to use them to process text corpora with command-line tools. I learned how to run and install simple programs from the WSL, and I learned to write simple scripts. Finally, I learned to work on the server.

I can warmly recommend Command-line Tools for Linguists web course to anyone interested in processing language files with simple command-line tools. The course format was fantastic regarding students working full time. The course material was super good and the videos were really great and easy to follow.

List of Interesting Links