Project Organization

Project Organization #

Working with the terminal requires and instructs you about the file structure of your computer, knowledge of which is important for keeping things organized while you do your research and writing. Project organization refers to two separate and sometimes competing tasks: one is organizing the material you need to learn about the world in a way that is conceptually coherent and helpful to you. The second task is organizing actual computer files in such a way that they are accessible to you and to the software you are working with. These are distinct tasks and sometimes they compete with one another, as the first is meant to help you think better and the second is meant to help you do things faster. For example, it makes sense to keep distinct activities related to your research in different directories (sort of like keeping one notebook of notes for one class, and another notebook for another class). This organization is in part a matter of taste but there are certain organizing principles you should almost certainly avoid. For example, you could just have one directory that has all the files for your project in a flat structure, which makes remembering the paths to each file pretty simple (it would just be /Users/timothyelder/Documents/project-dir/FILENAMEHERE.txt). That makes everything accessible but doesn’t help to keep conceptually or practically separable parts of your project organized. On the other extreme, you could have a byzantine file structure, where edits to documents are organized by activity (such as having folders for interview_transcripts, field_notes, field_notes_reflections, interview_notes, draft_papers), and then subdirectories organized by the date in which the files in that directory were edited or created. That would certainly heep thins organized but fairly messy when it came time to put them altogether in a write-up.

Finding a healthy compromise between the conceptual ordering of research material and the organization of computer files will be a matter of personal preference. With that said, I have a few principles that I find helpful and encourage you to emulate:

Each project should have its own directory. #

A project can be as small as a side-project that goes nowhere, or your dissertation or even your magnum opus. A rule of thumb that I use is that once something takes on a distinct character, and is relying upon more than 5 or so files to be coherent, it probably needs its own directory.

Each Area of a Project should have its own Sub-Directory. #

Again what constitutes an “Area” is ambiguous and you have to use your own judgement and the final rule of thumb is if you are making progress and getting things done. I usually end up having the same directories at the top of my project directory. These include the following:

README.md
research_log.md
/analysis
/data
/drafts
/figures
/memos
/misc
/scripts

The two files at the top of this list are what are called Markdown files, a simple markup language. The README.md file includes an explanation of what the project is, what data it uses, and any required software while the research_log.md includes entries about what is going on in the project and my laments about the world and my own research. The rest are sub-directories that contain different parts of the project. In the data directory naturally is all my data (both quantitative and qualitative), while the drafts directory has all the drafts for the different formal pieces of writing coming from the project. Informal pieces of writing that are likely only to be seen by myself, my advisors, or collaborators are stored in the memos directory. figures includes all the images that are generated from the data or that might be included in papers, slides, etc.. Whether the project is purely qualitative or not, I inevitably write some scripts in R, python or shell and they go in the scripts folder and misc is a garbage can of all the things I don’t need now but am not confident I will never need again.

Hierarchical is Better than Horizontal #

This goes hand in hand with the first two points but I want to emphasize that creating directories and sub-directories is helpful for keeping everything organized and that nesting directories is particularly helpful as it allows for more and more fine grained conceptual categorizations.

Literal is Better than Symbolic #

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton

When you are naming directories and files, it is always best to make things explicit. If the directory holds data call it data. If the directory contains images of documents from the Florida Office of Insurance Regulation in Tallahassee, call the directory images_fl_tallahassee_insurance_regulation. If the file is your dissertation latex file call it dissertation.tex. Choosing symbolic names, or anything cryptic makes parsing files difficult, particularly if you take a break from a project for a few weeks and need to get back to it.

Naming things
It is really important that when you are naming things, including files and directories, that you don’t use spaces or slashes in the names. Slashes are used in paths to delimit directories from each other and so will lead to errors and spaces are hard to include on the terminal when writing commands.

Some Examples #

Here are some examples that exemplify and defy the paradigm I articulated above. All of these are examples from my own computer and time in graduate school. Figure 1 is the directory that contains all the files for my Qualifying Paper, my first independently researched and written paper from graduate school. You will note two things: I was doing all my writing with Word and didn’t abide by my rules about keeping different directories for different areas of the project. I do this a little bit having a drafts directory but I am not certain what the Print and Material directories are and the Writing Seminar Submission directory should probably be in drafts. Clearly, I was not being very thoughtful when it comes to how I organized my work, but this is a lot better than the directory in Figure 2.

A poorly organized directory.

Figure 1: A poorly organized directory.

This is a directory for class prep for a sociology course I helped to design along with Profs. Jenny Trinitapoli and John Levi Martin. It is a complete mess with obscure directory names and files that are all at the top of the directory without much organization at all. Files are misspelled and there is a particularly egregious organizational error. I have two files that are indistinguishable in their name except for the suffix “.v2” being included in the file name. A short anecdote will heighten the importance of the naming conventions I articulate here.

A very poorly organized directory.

Figure 2: A very poorly organized directory.

I have a colleague who has had a paper under review for a couple of years now due to a variety of factors related to the pandemic. On the second R&R, after months of working on other projects and getting ready for job talks, he went back to the project files to address the concerns of a particularly scrupulous Reviewer #3. The reviewer was asking that they re-run some analyses with a different method and so my colleague needed to go back and figure out how a few figures were generated and how the original analyses were specifically conducted. Opening the directory with his code he had endless files all with nearly indistinguishable names like:

pol_gss_multiimput.R
pol_gss_multiimput_v2.R
pol_gss_multiimput_v2_1.R
pol_gss_multiimput_v2_2.R
pol_gss_multiimput_v2_2_THIS_ONE.R
pol_gss_multiimput_v2_2_No_REally_THIS_ONE.R
pol_gss_multiimput_v2_2_No_REally_THIS_ONE_final.R

He spent weeks and lots of tireless hours figuring out what was what. Instead of being like my friend, be more like me and do something like you see in Figure 3. This is a directory for a project that I am working on that includes lots of different scripts and data. Code for processing the data is kept separate from code for doing analyses and files are given explicit and non-redundant names. Certainly, this takes some amount of effort and energy to do and my project directories don’t always look like this while working on them, but it is worth cultivating good habits like this to do your research and writing. You will thank yourself later if you are ever in the same position as my poor sweet friend being harassed by pesky Reviewer #3.

A more orderly directory.

Figure 3: A more orderly directory.

Summary #

The wisdom then for organizing a directory in a way that successfully achieves the two different goals of organization (conceptual clarity and organizational productivity) can be condensed into:

  1. Project in Every Directory and a Directory for Every Project
  2. A Sub-Directory for every Area of a Project
  3. Hierarchical is better than horizontal organization
  4. Literal names are always favored over cryptic or symbolic names

This is an area that is going to be much more dependent on your particular disposition to doing work, but at a certain point you will run up against the need to typeset documents and keep conceptually distinct areas actually digitally distinct, so if you don’t buy into my dogma that is fine, but adopt some other regular process of your own and stick to it as you do your work.

Previous
Next