In Good Shape – How to Successfully Prepare Your Data in R, Stata, and SPSS


In the "In Good Shape" series, various methods for optimally preparing data for analysis will be demonstrated over the coming weeks. Implementation will be shown separately in R, Stata, and SPSS, highlighting the advantages and disadvantages of each software.
Data Import and Export
No matter how good your methodological skills are, if your data aren't in the desired format, even the simplest statistical procedure can't be applied. Even worse, using incorrectly formatted data might lead you to apply the wrong statistical test entirely. For example, this can become problematic when differentiating between paired and independent samples, as shown in the post about the t-test. Such errors can drastically alter results and interpretations—sometimes without the analyst even noticing.
This entry first addresses methods for importing data into statistical software and exporting them again (after processing). While not considered a core part of data preparation, this step forms the initial and final phases surrounding data preparation. Future entries will discuss topics such as data types, data restructuring, data merging, and handling strings and date formats.
Dataset
The dataset used for demonstrating data preparation should naturally fit the "In Good Shape" theme. Therefore, we've tracked our candy consumption over two weeks. To avoid embarrassing anyone, we've used fictional names - specifically, those of our favorite statisticians. The first few rows of the dataset "Sweets," sorted by day, are shown below.

The structure and variable types will be discussed in later posts. First, let's explore how to access these imported data.
Data Import
Before beginning any data preparation, the dataset must first be imported into the appropriate software. Regardless of the software used, the data format must be considered. There are countless formats and import methods, but in practice, the most common challenge is importing data from XLS(X), CSV, or the native format of the chosen software. The process for importing these three formats into R, Stata, and SPSS is presented below. For consistency, we assume that the file—regardless of format—is named "sweets".
After each import, it is crucial to verify that the data were imported correctly. The number of variables and cases should match the original dataset. Therefore, for each software, the data view after importing will also be shown.
R
The simplest method for importing data in R is using an RData file. Typically, no additional arguments need to be specified beyond the file name and path.
Importing a CSV file is easiest with the read.csv2()
function, which is a specialized version of read.table()
. It's important to check the delimiter and decimal separator, which can be adjusted using the sep
and dec
arguments if necessary. Additionally, the argument header
should be set to TRUE
if the first row contains variable names.
For importing an Excel file (XLSX format), the read.xlsx2()
function can be used. Here, the sheet name and an optional specific range must be specified.
The functions mentioned above are just one of many possible ways to import each file type into R.
# Import RData-File
load(file = "dateipfad/sweets.RData")
# Import csv-File
read.csv2(file = "dateipfad/sweets.csv")
#Import xlsx-File, sofern Daten im 1. Tabellenblatt
library(xlsx)
read.xlsx2(file = "dateipfad/sweets.xlsx“, sheetIndex = 1)

The above figure shows the imported data. The number of variables and cases (not visible here) matches the original dataset. However, whether the variables have the correct data type for analysis will be addressed later.
Stata
In Stata, only one dataset can be open at a time. If new data need to be loaded, the current dataset must be closed first. When importing data into Stata, it is important to note that an error message will be displayed if a dataset is already open and another one is being loaded. If you are sure that you no longer need the current dataset, you can use the clear option.
The data format in Stata is called "dta" and can be loaded using the use command. Only the file path needs to be specified.
Other data types can be imported using import. With import delimited, CSV (and TXT) files can be loaded. The delimiter can be adjusted if needed using the delimiters option.
Importing an Excel file is done using import excel, where the sheet name can be specified with sheet and the cell range with cellrange. Variable names in the first row must also be defined using firstrow.
* Import dta-File
use "dateipfad/sweets.dta”, clear
* Import csv-File
import delimited "dateipfad/sweets.csv”, firstrow clear
* Import xlsx-File, sofern Daten in 1. Tabellenblatt
import excel "dateipfad/sweets.xlsx", sheet("Tabelle1") firstrow clear

In Stata, the data type can be roughly identified based on the color-coded display of variables. Red text indicates strings, black text represents all types of numeric values, and blue text is used for factors. While data types will be discussed later, it is important during import to ensure that (decimal) numbers are recognized correctly. In this specific case, the variable "obst" should be displayed in black rather than red to ensure it is interpreted as numeric.
SPSS
In this blog series, SPSS will be explained using syntax rather than dialog boxes. The native SPSS data format is called "sav". As with the previous programs, importing this format requires only specifying the file path.
The syntax for importing a CSV file is slightly more complex, especially when dealing with a large number of variables in the dataset. In this case, the delimiter must be explicitly specified. If only a single file needs to be imported, it is recommended to generate the syntax using the dialog boxes.
Similarly, the command for importing XLSX files is significantly longer compared to the other statistical software. Therefore, using dialog boxes can also simplify the import process in SPSS.
* Import sav-File.
GET
FILE='dateipfad/sweets.sav'.
* Import csv-File.
GET DATA
/TYPE=TXT
/FILE'dateipfad/sweets.csv’
/ENCODING='Locale'
/DELCASE=LINE
/DELIMITERS=","
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE=ALL
/VARIABLES=
Tag
Mitarbeiter
Obst
Gummibärchen
Snickers
PickUp
CACHE.
EXECUTE.
* Import xlsx-File (unter der Annahme, dass die Daten im Blatt „Tabelle 1“ liegen.
GET DATA /TYPE=XLSX
/FILE='dateipfad/sweets.xlsx'
/SHEET=name 'Tabelle1'
/CELLRANGE=full
/READNAMES=on
/ASSUMEDSTRWIDTH=32767.
EXECUTE.

The resulting data corresponds (in appearance) to the original data. It can be seen that SPSS automatically converts the decimal separator if the basic settings are available.
Overview
Datenexport
After data preparation, the dataset must be exported to the desired format to avoid repeating the data preparation process each time. To ensure clear distinction, it is recommended to rename the dataset after processing. Here, the name "sweets_final" will be used.
R
Objects in R can be saved using the save() function. This requires specifying both the object to be saved and the file path.
To save the data as a CSV file, use write.csv2(), which complements the read.csv2() function. Alternatively, write.table() can also be used.
For Excel export, the complementary write.xlsx2() function is available. Specific results or plots can also be exported individually to Excel using workbook functions.
Additionally, the functions write.foreign() and write.dta() allow exporting data to formats used by other statistical software.
Note: When using the last-mentioned functions, row numbering is automatically stored as the first variable (without a name). To prevent this, set the argument row.names to FALSE.
Stata
Exporting data to the "dta" format in Stata is done using the save command, where only the file path needs to be specified. To export data in other formats, the export command is used. Similar to the import process, export delimited and export excel are used for CSV and XLSX formats, respectively.
Similar to workbook functions in R, Stata allows exporting specific results with formatting to an Excel workbook using the putexcel command. An introduction to this command can be found here.
SPSS
Data can be saved in the SAV format using the save outfile command. This also allows specifying which variables should be kept or removed.
To export data in CSV or XLSX format, use the save translate outfile command, specifying the file type as an argument.
Overview
Summary
The correct data structure is essential for the proper application of statistical methods. Before the data structure can be adjusted to meet methodological requirements, the data must first be imported correctly. Similarly, exporting data into various formats is crucial to facilitate further analysis. The approach differs between statistical software such as R, Stata, and SPSS. The easiest way to import data is by using the software's native format (RData, dta, or sav). For any questions regarding data preparation, our experts are happy to assist at info@statworx.com.