Loading large JSON files
By Jasmine Siswandjo in data
April 12, 2024
Tackling a sizable JSON (~30 million rows, 10GB) dataset in R?
Ever tried working with a massive JSON file and had your machine wave a white flag? That’s where I found myself when I wanted to dig into New York’s Google business reviews. I tried loading it as usual into RStudio with purrr:map
and jsonlite::fromJSON
, but it was so large that the progress bar wouldn’t even show up. The data is from
Google local review data, created and uploaded by Tianyang Zhang and Jiacheng Li from UCSD.
After some Googling, I realized I would have to split the data up because my computer’s vector memory couldn’t handle reading the entire 10GB file at one go. Existing solutions online involved reading the entire file and then splitting it, but it wasn’t working on my laptop because of the lack of memory space. My solution ended up being reading in the file line by line, and splitting it while going through each line.
Slice and dice
I first used Python to split the 10GB file into 100 files of ~335,000 lines each. Why 335,000? Because I knew that reading a file with about 200,000 lines took around a minute, so this felt like a good balance between time and size.
Slicing and dicing the massive JSON file into more manageable chunks was necessary due to its sheer size and complexity. Handling a single file of 10GB with 30 million rows can be overwhelming for both memory and processing capabilities, especially in environments with limited resources.
import json
import os
input_file_path = "review_data/review-New_York.json"
output_dir = "review_data/review_data_split"
os.makedirs(output_dir, exist_ok=True)
max_lines_per_file = 335000
current_lines = 0
output_file_index = 1
with open(input_file_path, "r") as input_file:
output_file = None
for line in input_file:
data = json.loads(line)
if output_file is None:
output_file = open(
os.path.join(
output_dir,
f"output_{output_file_index}.json"), "w"
)
output_file.write(json.dumps(data) + '\n')
current_lines += 1
# close current output file and open a new one
# once max lines is reached
if current_lines >= max_lines_per_file:
output_file.close()
output_file_index += 1
current_lines = 0
output_file = open(
os.path.join(
output_dir,
f"output_{output_file_index}.json"), "w"
)
# close last output file
if output_file:
output_file.close()
This created output_1.json
to output_100.json
. But I still needed to load them into R. All the output files are in review_data/review_data_split
, and I wanted to make RData
files because they’ve generally been my go-to when working with big files.
The reviews are in this format:
- user_id - ID of the reviewer
- name - name of the reviwer
- time - time of the review (unix time)
- rating - rating of the business
- text - text of the review
- pics - pictures of the review
- resp - business response to the review including unix time and text of the response
- gmap_id - ID of the business
The code runs through 2 loops. The first loop goes through the 100 output files and works on 5 at a time, and the second loop reads through each line (with a handy named progress bar, thanks to the map
function!), extracts out the 4 columns I want (time, rating, text and gmap_id), and then turns it into a data frame. These individual data frames are then saved into an RData file.
library(foreach)
library(tidyverse)
filepaths <- list.files(path = "review_data/review_data_split")
foreach(i = seq(1, 100, by = 5)) %do% {
reviews_by_5 <- foreach(
current_file = filepaths[i:(i + 4)], .combine = "rbind"
) %do% {
review_json <- purrr::map(
readLines(
paste0("review_data/review_data_split/", current_file)
),
jsonlite::fromJSON,
.progress = paste("loading file", current_file)
)
l <- map(review_json, function(x) {
cbind(
x$time,
ifelse(is.null(x$rating), NA, x$rating),
ifelse(is.null(x$text), NA, x$text),
x$gmap_id
)
})
r <- data.frame(matrix(unlist(l), ncol = 4, byrow = TRUE))
colnames(r) <- c("time", "rating", "text", "gmap_id")
r
}
# remove duplicates
reviews_by_5 <- reviews_by_5 |> distinct()
save(reviews_by_5, file = paste0("reviews_", ceiling(i / 5), ".RData"))
rm(reviews_by_5)
}
Now we can combine them into one large data frame, and analyse the reviews!