3rd Year Workshop

Data workflow and version control

Ian McCarthy, Emory University

Economics PhD Professionalism Workshop

Keeping track

These slides are all about transparency and reproducibility (for your future self and for others). Some other resources…

Why Bother with Workflow?

Reproducibility

  • You’ll need to come back to this one day!
  • Others will (hopefully) ask you about this
  • “I lost my hard drive” is unacceptable!

Automation

One benefit of good tracking and workflow is that you partially automate your projects…

  • No repeating commands in the command line
  • No copying and pasting values into tables
  • No copying and pasting values into the text (harder to do)

hint: consider pytask for efficient automation

Some Basic Tips

Naming files

Avoid spaces in file names. Avoid them at all costs. DO NOT PUT SPACES IN YOUR FILE NAMES.

“A space in a file name is a space in your soul.”

Naming files

Avoid things like this…

great-research-idea
|
|---analysis
|   |   final_analysis.R
|   |   final_final_analysis.R
|   |   last_analysis.R
|
|---data
|   |   clean_data.csv
|   |   extra_clean_data.csv
|
|---paper
|   |   draft1.tex
|   |   final_draft.tex
|   |   final_final_draft.tex

Naming files

  • Use dates in your filenames if necessary (YYYYMMDD format)
  • Minimize the use of filename dates and instead try official version control
  • Dates useful in log files and similar output (not under version control)

Related point…use common sense (but short) variable names without spaces

Data workflow

Some quick thoughts on workflow…

  1. Avoid absolute path names!