Data workflow and version control
Economics PhD Professionalism Workshop
These slides are all about transparency and reproducibility (for your future self and for others). Some other resources…
One benefit of good tracking and workflow is that you partially automate your projects…
hint: consider pytask for efficient automation
Avoid spaces in file names. Avoid them at all costs. DO NOT PUT SPACES IN YOUR FILE NAMES.
“A space in a file name is a space in your soul.”
Avoid things like this…
great-research-idea
|
|---analysis
| | final_analysis.R
| | final_final_analysis.R
| | last_analysis.R
|
|---data
| | clean_data.csv
| | extra_clean_data.csv
|
|---paper
| | draft1.tex
| | final_draft.tex
| | final_final_draft.tex
Related point…use common sense (but short) variable names without spaces
Some quick thoughts on workflow…
Some quick thoughts on workflow…
Avoid absolute path names!
If you use the same data across projects, try a “path” script that you add to your .gitignore file, or use symbolic links ln -s ~/base_location ~/new_location_with_link
Some quick thoughts on workflow…
Avoid absolute path names!
If you use the same data across projects, try a “path” script that you add to your .gitignore file, or use symbolic links
Separate your analysis and your markdown (not practical otherwise)
Some quick thoughts on workflow…
Avoid absolute path names!
If you use the same data across projects, try a “path” script that you add to your .gitignore file, or use symbolic links
Separate your analysis and your markdown (not practical otherwise)
NEVER delete or directly change raw data files
note: particularly important on the modern job market
How do you track your versions?
Git is ideal since it is easily shareable, so you get external validity too
Virtual Environments!
Lots of ways to do this in practice:
Use Docker to basically mimic your entire machine
Use renv to make sure you note versions of all packages (in a .lock file)
renv also works with virtual Python environments renv::use_python()
Create “empty” GitHub repo online
Clone to my system as a new project with version control
Initialize my R environment with renv (more details here), if using Stata…change ado paths to call specific packages folder
Create project “scaffolding” with initial directories
.qmd), including paper and presentations