Data Files - Text Manipulations and Data Importing

Datafiles come in many different flavors/shapes and your success in obtaining meaningful results depends on your ability to process them into a form that is understood by your stats program of choice. Although the requirements for importing data will vary greatly among programs, the most useful (i.e., commonly recognized) structure of a datafile will have column entries within a line separated by a "tab" (i.e., ASCII name: - "ht"; C Escape Sequence: "\t"; Hexadecimal character code $0x09). A line is separated from the next row of data points with an "end of line" character. These unfortunately vary depending on the computer platform. Textfiles formatted for MacOS indicate an end of line with a "carriage return" character (i.e., ASCII name: - "cr"; C Escape Sequence: "\r"; Hexadecimal character code $0x0D); Files formated for UnixOS code the start of a new line with a "line feed" character (i.e., ASCII name: - "lf" or "nl"; Hexadecimal character code $0x0A); Dos/Windows files use both characters ("cr", "lf") for that purpose. A good text editor will allow you to convert between these. Take a good look at the file first: How many rows are there? Are there multiple entries per line? How are they separated? What line endings are there? Are there missing data points? Does the first line contain data or variable names? Based on these considerations design a strategy to process the file - if needed. The following exercises are meant to familiarize you with some common things.

Exercise 1: Make sure you can replicate these steps

Know how to read in, manipulate data, and get descriptives in your stats program.

Exercise 3: Mainframe-style Files

  1. Download the file "Students.txt". Save the content of the file as a TEXT file and you will be able to view and edit it with any text editor, such as BBEdit. It contains data on students who have taken biology courses. The data are arranged in column format as exported from a mainframe. Consult the file "StudentsLayout.txt" for information about what variables are present and what columns they occupy.
  2. Create a GREP search/replace pattern to tear the columns apart into separate entries. There are many manuals that explain GREP-style arguments. Here is some info on the following topics: What is GREP?, GREP Search Patterns, Replacement Patterns, Examples, and Advanced GREP.
  3. Import the file into your stats program of choice
  4. Obtain descriptive data for the student's state of residence

Competences earned this week:

  1. Process and import data as text files into your statistics program
  2. Obtain and interpret descriptive statistics for different variable types
  3. Start to collect your tools for manipulating text, analyzing data, etc.

Special Reward:

  1. The shortest GREP pattern for exercise 3 (measured in # characters) will automatically earn 10% towards the 100% class grade.

last modified: 1/20/15