class: center, middle, inverse, title-slide .title[ # 3rd Year Workshop ] .subtitle[ ## Data workflow and version control ] .author[ ### Ian McCarthy, Emory University ] .date[ ### Economics PhD Professionalism Workshop ] --- <!-- Adjust some CSS code for font size, maintain R code font size --> <style type="text/css"> .remark-slide-content { font-size: 30px; padding: 1em 2em 1em 2em; } .remark-code, .remark-inline-code { font-size: 20px; } </style> <!-- Set R options for how code chunks are displayed and load packages --> # Keeping track These slides are all about transparency and reproducibility (for your future self and for others). Some other resources... - [Berkeley Initiative for Transparency in the Social Sciences](https://www.bitss.org/) - [Julian Reif's Guide for Replication with Stata](https://reifjulian.github.io/guide/) --- class: inverse, center, middle name: why # Why Bother with Workflow? <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # Reproducibility - You'll need to come back to this one day! - Others will (hopefully) ask you about this - "I lost my hard drive" is unacceptable! --- # Automation One benefit of good tracking and workflow is that you partially automate your projects... - No repeating commands in the command line - No copying and pasting values into tables - No copying and pasting values into the text (harder to do) -- **hint: consider [pytask](https://econ-project-templates.readthedocs.io/en/stable/index.html) for efficient automation** --- class: inverse, center, middle name: basic # Some Basic Tips <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # Naming files Avoid spaces in file names. Avoid them at all costs. *DO NOT PUT SPACES IN YOUR FILE NAMES.* >*"A space in a file name is a space in your soul."* --- count: false # Naming files Avoid things like this... .pre[ ``` great-research-idea | |---analysis | | final_analysis.R | | final_final_analysis.R | | last_analysis.R | |---data | | clean_data.Rds (or csv) | | extra_clean_data.Rds (or csv) | |---paper | | draft1.tex | | final_draft.tex | | final_final_draft.tex ``` ] --- count: false # Naming files - Use dates in your filenames if necessary (YYYYMMDD format) - Minimize the use of filename dates and instead try official version control - Dates useful in log files and similar output (not under version control) -- Related point...use common sense (but short) variable names without spaces --- # Data workflow Some quick thoughts on workflow... 1. Avoid absolute path names! <br> .center[ ![](https://media.giphy.com/media/3oz8xt6aO8l4UjcnXa/giphy.gif) ] --- count: false # Data workflow Some quick thoughts on workflow... 1. Avoid absolute path names! 2. If you use the same data across projects, try a "path" script that you add to your `.gitignore` file, or use symbolic links `ln -s ~/base_location ~/new_location_with_link` .center[ ![](https://media.giphy.com/media/z8rEcJ6I0hiUM/giphy.gif) ] --- count: false # Data workflow Some quick thoughts on workflow... 1. Avoid absolute path names! 2. If you use the same data across projects, try a "path" script that you add to your `.gitignore` file, or use symbolic links `ln -s ~/base_location ~/new_location_with_link` 3. Separate your analysis and your markdown (not practical otherwise) --- count: false # Data workflow Some quick thoughts on workflow... 1. Avoid absolute path names! 2. If you use the same data across projects, try a "path" script that you add to your `.gitignore` file, or use symbolic links `ln -s ~/base_location ~/new_location_with_link` 3. Separate your analysis and your markdown (not practical otherwise) 4. **NEVER** delete or directly change raw data files --- class: inverse, center, middle name: version-control # Version Control <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # Why bother with version control? 1. Internal validity (replicability for your future self) 2. External validity (replicability for others) -- **note**: particularly important on the modern job market --- count: false # Version control How do you track your versions? - Don't keep two versions of the same thing - Dropbox, Google Drive, etc. offer some form of version history (but incomplete) - Overleaf has built-in version history - Commit fully to Git/GitHub -- Git is ideal since it is easily shareable, so you get external validity too --- # Virtual environments - Let's assume your code works on your system - You (should) also want it to work on other computers - How do you get other computers to look like your own? -- <br> **Virtual Environments!** --- count: false # Virtual environments Lots of ways to do this in practice: 1. Use `Docker` to basically mimic your entire machine 2. Use `renv` to make sure you note versions of all packages (in a .lock file) 3. `renv` also works with virtual Python environments `renv::use_python()` --- class: inverse, center, middle name: example # My Workflow <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # My basic workflow 1. Create "empty" GitHub repo online 2. Clone to my system in `RStudio` as a new project with version control 3. Initialize my `R` environment with `renv` (more details [here](https://rstudio.github.io/renv/articles/renv.html)), if using stata...change ado paths in `Stata` to call specific packages folder ``` adopath + "stata_packages" net set ado "stata_packages" ``` 4. Create project "scaffolding" with initial directories --- # My basic workflow - Be sure to add "stata_packages" to gitignore file - Consider [cookiecutter](https://github.com/cookiecutter/cookiecutter) to create project from template - I have a template on Github if you want to use it: [R research template](https://github.com/imccart/research-template) --- # Analysis workflow - Shared Google Doc with co-authors - Acts as a shared scratch notebook - Basic headers a lists...but minimal formatting (no one else will ever see this) - To-do lists and other goals go here, removed after completing --- # My solo-author analysis and writing workflow - I write everything in `R Markdown`, including paper and presentations - Tables and figures in output folders (automate results) - Compile with `R` (Pandoc) --- # My co-authored analysis and writing workflow - Write in LaTeX with Overleaf - Compile with Overleaf using output folders (tables and figures) - Sync bibliography via Google Drive - Upload from URL, `https://drive.google.com/uc?export=download&id=name`, where name is the alphanumeric string from Google Drive link - Template on Overleaf: [Overleaf template](https://www.overleaf.com/read/tfnkdhnndxck)