Managing project files
Monday, March 27, 2006
As I accumulate projects (both new and completed), the maintenance and storage of the project files becomes increasingly important. There are two important goals for a file structure: find things quickly and don’t lose anything. My current strategy is as follows:
Every project has a consistent folder structure:
|-- analysis
|-- data
|-- db
|-- doc
|-- fig
`-- utils
analysis holds the R source files of the analysis. These, typically, are experiments and snippets of code. The main analyses are in the doc directory.
data contains data files. Generally, these are csv files from clients.
db is for sql dumps of databases and sqlite files. I prefer working with databases over flat text files or Excel spreadsheets. These files are kept in the data folder and converted to sql databases for analyses.
doc holds the analysis and writeup as a Sweave file. This combines R and LaTeX to create a complete document from one source file.
fig is for diagrams and plots. Many of these are generated when processing the Sweave file, but some are constructed from other sources.
utils holds scripts and binaries that are required to run the analysis.
This entire directory structure is maintained with Subversion, so I have a record of changes and can access the project files from any networked computer.
Finally, once a project is complete, I archive the project and construct a sha checksum of the zip file.
openssl dgst -sha1 -out checksums.txt archive.zip
This checksum allows me to verify that the archive remains stable over time. Coupled with a good backup routine, this should keep the project files safe.
This may seem elaborate, but data and their analyses are too important to be left scattered around a laptop’s hard drive.
One other approach I’ve considered is using the R package structure to maintain projects. This is a useful guide, but the process seems too involved for my purposes.