The Plain Person’s Guide to Plain Text Social Science

1. The Plain Person’s Guide to Plain Text Social Science process

1. The Plain Person’s Guide to Plain Text Social Science process

1.1. Introduction

The short version is:

you should use tools that give you more control over the process of data analysis and writing.
I recommend you write prose and code using a good text editor; analyze quantitative data with R and RStudio, or use Stata;

minimize error by storing your work in a simple format (plain text is best), and

make a habit of documenting what you’ve done.

For data analysis, consider using a format like RMarkdown and tools like Knitr to make your work more easily reproducible for your future self.

Use Pandoc to turn your plain-text documents into PDF, HTML, or Word files to share with others.

Keep your projects in a version control system.

Back everything up regularly.

Make your computer work for you by automating as many of these steps as you can.

1.1.1. Motivation

You can do productive, maintainable and reproducible work with all kinds of different software set-ups, there is no One True Way to organize things

First, the transition to graduate school is a good time to make changes.
- there’s less inertia and cost associated with switching things around than later.
Second, in the social sciences, text and data management skills are usually not taught to students explicitly ⇒ you may end up adopting the practices you first encounter, which may not be optimal

1.1.2. Two Revolutions in Computing

Two ongoing computing revolutions are tending to pull in opposite directions.

On one side, mobile, cloud-centered, touch-screen, phone-or-tablet model → bring powerful computing to more people than ever before.
This revolution is the one everyone is talking about, because it is happening on a huge scale and is where all the money is.
It puts single-purpose applications in the foreground.
It hides the workings of the operating system from the user, and
it goes out of its way to simplify or completely hide the structure of the file system where items are stored and moved around.
On the other side, open-source tools for plain-text coding, data analysis, and writing are also better and more accessible than they have ever been.
This has happened on a smaller scale than the first revolution, of course.
But still, these tools really have revolutionized the availability and practice of data analysis and scientific computing generally. They continue to do so, too, as people work to make them better
mostly work by joining separate, specialized widgets into a reproducible workflow. because the process of data analysis is that way, too.
do much less to hide the operating system layer —instead they often directly mesh with it— and they often presuppose a working knowledge of the file system underpinning the organization of the things the researcher is using, from data files to code to figures and final papers.

1.1.3. The Office Model and the Engineering Model

1.1.3.1. Office Model

Office solutions tend towards a cluster of tools where something like Microsoft Word is at the center of your work.
A Word file or set of files is the most “real” thing in your project.
Changes citation and reference managers, outputs of data analyses—tables, figures, … are inside that file or files.
The master document may be passed around from person to person to be edited and updated.
The final output is exported from it, but maybe most often the final output just is the .docx file, cleaned up and with the track changes feature turned off.

1.1.3.2. Engineering Model

plain text files are at the center of your work.
The most “real” thing in your project will either be those files or, more likely, the version control repository that stores the project.
Changes are tracked outside of files, again using a version control system.
Data analysis is managed in code that produces outputs in (ideally) a known and reproducible manner.
Citation and reference management will likely also be done in plain text, as with a BibTeX .bib file.
Final outputs are assembled from the plain text and turned to .tex, .html, or .pdf using some kind of typesetting or conversion tool.
Very often, because of some unavoidable facts about the world, the final output of this kind of solution is also a .docx file.

1.2. Keep a Record

1.2.1. Make Sure You Know What You Did

do your work in a way that leaves a coherent record of your actions
- write down what you did as a documented piece of code
- Rather than figuring out but not recording a solution to a problem you might have again, write down the answer as an explicit procedure
- Instead of copying out some archival material without much context, file the source properly, or at least a precise reference to it.
a document, file or folder should always be able to tell you what it is
- Beyond making your work reproducible, you will also need some method for organizing and documenting
- this may mean little more than keeping your work in plain text and giving it a descriptive name
- It should generally not mean investing time creating some elaborate classification scheme that becomes an end in itself to maintain.
repetitive and error-prone processes should be automated if possible
- Rather than copying and pasting code, write a general function
- Instead of retyping and reformatting the bibliography, use software that can manage this for you automatically.
software applications are not all created equal, and some make it easier than others to do the Right Thing.

1.2.2. Use Version Control

Writing involves a lot of editing and revision. Data analysis involves cleaning files, visualizing information, running models, and repeatedly re-checking your code for mistakes. You need to keep track of this work.
As projects grow and change, and as you explore different ideas or lines of inquiry, the task of documenting your work at the level of particular pieces of code or edits to paragraphs in individual files can become more involved over time
A good version control system allows you to “rewind the tape”

1.2.2.1. Benefits of using version control

combines the virtues of “track changes” with those of backups.
- Every repository is a complete, self-contained, cryptographically signed copy of the project, with a log of every recorded step in its development by all of its participants.
It puts you in the habit of committing changes to a file or project piecemeal as you work on it, and (briefly) documenting those changes as you go.
It allows you to easily test out alternative lines of development or thinking by creating “branches” of a project.
It allows collaborators to work on a project at the same time without sending endless versions of the “master” copy back and forth via email.
it provides powerful tools that allow you to automatically merge or (when necessary) manually compare changes that you or others have made.
it lets you revisit any stage of a project’s development at will and reconstruct what it was you were doing

1.2.3. Back Up Your Work

Regardless of whether you choose to use a formal revision control system, you should nevertheless have some kind of systematic method for keeping track of versions of your files
Offsite backup means that in the event (unlikely, but not unheard of) that your computer and your local backups are stolen or destroyed, you will still have copies of your files

1.3. Write and Edit

1.3.1. Use a Text Editor

working with any highly structured document subject to a lot of revision ⇒ write using a good text editor.
they focus on working with text efficiently, while keeping it in a plain and portable format, as opposed to binary file formats like .docx.
text editors have syntax highlighting, automatic formating, linting, …
that tool should automatically take care of things like entries in your bibliography, the labelling of tables and figures, and cross-references and other paraphernalia.
The best editors can closely integrate with the tools you use to do the various pieces of your work.

1.3.2. Use Markdown

1.4. Reproduce Work

1.4.1. Minimize Error

minimizing error means addressing two related problems:

Find ways to further reduce the opportunity for errors to creep in without you noticing.
This is especially important when it comes to coding and analyzing data.
Find a way to figure out, retrospectively, what it was you did to generate a particular result.
Using a revision control system gets us a good distance down this road.
But there is more we can do at the level of particular reports or papers.

–

When you write code it is often in the process of doing some analysis on the fly. As a rule, you should try to document your work as you go
this usually means
1. adding (brief, but useful) comments to your work to explain what it is a piece of code is meant to do
2. trying to write your code so that is readable
3. don’t repeat yourself (avoid copy-paste, define functions and utilities)

1.4.2. From Server Farm to Data Table

Errors in data analysis often well up out of the gap that typically exists between the procedure used to produce a figure or table in a paper and the subsequent use of that output later (usually stored separatedly)
Each of these transitions introduces the opportunity for error. In particular, it is easy for a table of results to get detached from the sequence of steps that produced it