3rd Year Workshop

Data workflow and version control

Ian McCarthy, Emory University

Economics PhD Professionalism Workshop

Keeping track

These slides are all about transparency and reproducibility (for your future self and for others). Some other resources…

Why Bother with Workflow?

Reproducibility

  • You’ll need to come back to this one day!
  • Others will (hopefully) ask you about this
  • “I lost my hard drive” is unacceptable!

Automation

One benefit of good tracking and workflow is that you partially automate your projects…

  • No repeating commands in the command line
  • No copying and pasting values into tables
  • No copying and pasting values into the text (harder to do)

hint: consider pytask for efficient automation

Some Basic Tips

Naming files

Avoid spaces in file names. Avoid them at all costs. DO NOT PUT SPACES IN YOUR FILE NAMES.

“A space in a file name is a space in your soul.”

Naming files

Avoid things like this…

great-research-idea
|
|---analysis
|   |   final_analysis.R
|   |   final_final_analysis.R
|   |   last_analysis.R
|
|---data
|   |   clean_data.csv
|   |   extra_clean_data.csv
|
|---paper
|   |   draft1.tex
|   |   final_draft.tex
|   |   final_final_draft.tex

Naming files

  • Use dates in your filenames if necessary (YYYYMMDD format)
  • Minimize the use of filename dates and instead try official version control
  • Dates useful in log files and similar output (not under version control)

Related point…use common sense (but short) variable names without spaces

Data workflow

Some quick thoughts on workflow…

  1. Avoid absolute path names!

Data workflow

Some quick thoughts on workflow…

  1. Avoid absolute path names!

  2. If you use the same data across projects, try a “path” script that you add to your .gitignore file, or use symbolic links ln -s ~/base_location ~/new_location_with_link

Data workflow

Some quick thoughts on workflow…

  1. Avoid absolute path names!

  2. If you use the same data across projects, try a “path” script that you add to your .gitignore file, or use symbolic links

  3. Separate your analysis and your markdown (not practical otherwise)

Data workflow

Some quick thoughts on workflow…

  1. Avoid absolute path names!

  2. If you use the same data across projects, try a “path” script that you add to your .gitignore file, or use symbolic links

  3. Separate your analysis and your markdown (not practical otherwise)

  4. NEVER delete or directly change raw data files

Version Control

Why bother with version control?

  1. Internal validity (replicability for your future self)
  2. External validity (replicability for others)

note: particularly important on the modern job market

Version control

How do you track your versions?

  • Don’t keep two versions of the same thing
  • Dropbox, Google Drive, etc. offer some form of version history (but incomplete)
  • Overleaf has built-in version history
  • Commit fully to Git/GitHub

Git is ideal since it is easily shareable, so you get external validity too

Virtual environments

  • Let’s assume your code works on your system
  • You (should) also want it to work on other computers
  • How do you get other computers to look like your own?

Virtual Environments!

Virtual environments

Lots of ways to do this in practice:

  1. Use Docker to basically mimic your entire machine

  2. Use renv to make sure you note versions of all packages (in a .lock file)

  3. renv also works with virtual Python environments renv::use_python()

My Workflow

My basic workflow

  1. Create “empty” GitHub repo online

  2. Clone to my system as a new project with version control

  3. Initialize my R environment with renv (more details here), if using Stata…change ado paths to call specific packages folder

  4. Create project “scaffolding” with initial directories

My basic workflow

  • Consider cookiecutter to create project from template
  • I have a template on GitHub if you want to use it: R research template
  • AI coding assistants (GitHub Copilot, Claude Code) can help with boilerplate and debugging

Analysis workflow

  • Shared Google Doc with co-authors
  • Acts as a shared scratch notebook
  • Basic headers and lists…but minimal formatting (no one else will ever see this)
  • To-do lists and other goals go here, removed after completing

My solo-author analysis and writing workflow

  • I write everything in Quarto (.qmd), including paper and presentations
  • Tables and figures in output folders (automate results)
  • Compile with Quarto (Pandoc under the hood)

My co-authored analysis and writing workflow

  • Write in LaTeX with Overleaf
  • Compile with Overleaf using output folders (tables and figures)
  • Sync bibliography via cloud storage
  • Template on Overleaf: Overleaf template