Workflow and Version Control

Ian McCarthy | Emory University

Outline for Today

  1. Basic Software Requirements
  2. Understanding Version Control
  3. Folder Structure and Gitignore
  4. Symbolic Links
  5. Practice

Software Requirements

Steps

  1. Download R and/or Python
  2. Download Visual Studio Code
  3. Download Git
  4. Create an account on GitHub
  5. Download GitHub Desktop, authenticate your account, and configure Git for GitHub Desktop
  6. Ensure proper extensions in VS Code…
  • GitHub Copilot and GitHub Copilot Chat
  • Python and Jupyter
  • Quarto
  • R and R Extension Pack

Checklist (for R)

  1. Download R
  2. Install R Extension Pack in VS Code
    • Confirm R Version
version$version.string
[1] "R version 4.4.1 (2024-06-14 ucrt)"
  • Update R Packages
update.packages(ask = FALSE, checkBuilt = TRUE, repos='https://cran.us.r-project.org')

Checklist (for Python)

  1. Download Python. During installation, ensure the option Add Python to PATH is checked

  2. Install Python Extension in VS Code

  3. Set Python Interpreter in VS Code (e.g., the version installed in step 1)

    • Open a Python file
    • Click on the Python interpreter selector in the bottom-left corner of the VS Code window (or use Ctrl+Shift+P and search for “Python: Select Interpreter”).
  4. Install python libraries, pip install <library> from terminal

    • Confirm Python version:
import sys
print(f"Python version: {sys.version}")
Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]

Checklist (for Git)

  • Which version of Git have you installed?

  • Did you register an account in GitHub?

  • Did you authenticate your account and configure your local Git for GitHub Desktop?

What about Open OnDemand?

  • Alternative environment to access high-performance computing (HPC) resources in a web-based application
  • Simplified interface and setup
    • Installation already handled
    • Data already uploaded
    • Significantly more computing resources than available on individual computers
    • Can still render PDFs using a quarto from within Jupyter Notebook
  • Costs?
    • Monetary and personnel costs to Emory
    • Sacrifice some flexibility (not necessarily bad when learning)
    • No VS Code or GitHub Desktop (may require some command line interface)

Steps with Open OnDemand

  • Access via a standard web browser
  • Authenticate GitHub [TBD]
  • Familiarize yourself with command line and basic git commands
    • git status
    • git add
    • git commit -m 'comment'
    • git push
    • git pull (if needed)

Basics of Version Control

Heads up

  • Windows folders are not files…there is no content without a file. You can’t commit or push changes without content.
  • If you’re working across devices on your own repo, be sure to pull before starting and push afterward.
  • Avoid spaces in file names. Avoid them at all costs. DO NOT PUT SPACES IN YOUR FILE NAMES.

“A space in a file name is a space in your soul.”

Ideal workflow

Until you are a Git(Hub) expert…

  1. Start project on GitHub (fork from another repo if needed)
  2. Clone to local computer with GitHub Desktop
  3. Set up structure and basic files

Folder Structure

Idea

  • Need a uniform structure that you can deploy across projects
  • For this class…we also need a good naming convention for assignments
  • Also need to optimize storage space and avoid pushing things that shouldn’t be under version control
    • e.g., large data sets or documents with significant “code overhead”

File paths

  • Two types of file paths:
    • Absolute paths, like C:\Users\username\projects\...
    • Relative paths, like data/code/...
  • Slashes…
    • Windows defaults to backslash \ but can understand forward slashes /. Reason is historic related to forward slash already being taken in MS-DOS for command line options.
    • Mac and Linux (basically everything else) uses forward slashes by default.
    • So…use forward slashes in your file paths
  • Some relative path tricks
    • a new slash / implies a subfolder or subdirectory
    • .. says go “up” one level into an outer directory. e.g., ../data would say to go out of the current folder into it’s outer directory and then back to the other data subdirectory

Why use a .gitignore?

  • Keeps unnecessary or sensitive files out of version control
  • Located in initial directory of project
  • Avoids cluttering your repo with:
    • Large generated files
    • Machine-specific files
    • Temporary or secret files (e.g., tokens, API keys)

Examples of files to ignore

  • Editor / IDE files
    • .Rproj.user/, .Rhistory, .RData
    • .vscode/, .DS_Store
    • These things change constantly and are machine specific
  • Large or generated data files or entire folders
    • data/output/
    • *.log
    • *.tmp

gitignore example

# R / RStudio
.Rproj.user/
.Rhistory
.RData

# OS / editor
.DS_Store
.vscode/

# Generated data / logs
data/output/
*.log
*.tmp

# Quarto build files
*_files/
*_cache/
_freeze/

gitignore and large data analysis

We need to be careful with version control when working with large datasets:

  • Definitely add raw data folder(s) to .gitignore
  • What about .ipynb files?
    • Tend to embed images and other outputs into the notebook (not what you want under version control)
    • Be sure to “Clear All Outputs” before commit (or use jupytext to do this automatically)
    • One reason why separating analysis from output/presentation is good practice

Analysis versus “presentation”

  • With large datasets, keep a data/analysis layer (scripts, notebooks, pipelines)
    • Reads/wrangles data
    • Produces cleaned data, tables, and figures as outputs
  • Keep a separate presentation layer (Quarto/LaTeX/qmd)
    • Uses the outputs (tables, figures, summaries)
    • Does not store heavy data or binary state
  • Tools like Jupytext follow this principle
    • Strip outputs / state
    • Store only the code and structure in plain text for version control

Examples

In macOS/Linux:

ln -s /Users/username/data/claims_100pct \
      /Users/username/projects/ma-project/data
  • First path is the real data folder
  • Second path is the location of the new link

Windows:

mklink /D "C:\projects\ma-project\data" "D:\shared\data\claims_100pct"
  • /D creates a directory (e.g., entire folder instead of a file)
  • First path is the location of the new link
  • Second path is the real data folder

Practice!

  1. Start a repository on GitHub
  2. Clone repository to your local computer or Open OnDemand
  3. Create a basic folder structure
  4. Create a ReadMe and a .gitignore file
  5. Create symbolic link
  6. Commit changes to git
  7. Push changes to GitHub