Loading large JSON files

By Jasmine Siswandjo in data

April 12, 2024

Tackling a sizable JSON (~30 million rows, 10GB) dataset in R?

Ever tried working with a massive JSON file and had your machine wave a white flag? That’s where I found myself when I wanted to dig into New York’s Google business reviews. I tried loading it as usual into RStudio with purrr:map and jsonlite::fromJSON, but it was so large that the progress bar wouldn’t even show up. The data is from Google local review data, created and uploaded by Tianyang Zhang and Jiacheng Li from UCSD.

After some Googling, I realized I would have to split the data up because my computer’s vector memory couldn’t handle reading the entire 10GB file at one go. Existing solutions online involved reading the entire file and then splitting it, but it wasn’t working on my laptop because of the lack of memory space. My solution ended up being reading in the file line by line, and splitting it while going through each line.

Slice and dice

I first used Python to split the 10GB file into 100 files of ~335,000 lines each. Why 335,000? Because I knew that reading a file with about 200,000 lines took around a minute, so this felt like a good balance between time and size.

Slicing and dicing the massive JSON file into more manageable chunks was necessary due to its sheer size and complexity. Handling a single file of 10GB with 30 million rows can be overwhelming for both memory and processing capabilities, especially in environments with limited resources.

import json
import os

input_file_path = "review_data/review-New_York.json"
output_dir = "review_data/review_data_split"
os.makedirs(output_dir, exist_ok=True)

max_lines_per_file = 335000
current_lines = 0
output_file_index = 1

with open(input_file_path, "r") as input_file:
    output_file = None

    for line in input_file:
        data = json.loads(line)

        if output_file is None:
            output_file = open(
              os.path.join(
                output_dir,
                f"output_{output_file_index}.json"), "w"
              )

        output_file.write(json.dumps(data) + '\n')
        current_lines += 1

        # close current output file and open a new one
        # once max lines is reached
        if current_lines >= max_lines_per_file:
            output_file.close()
            output_file_index += 1
            current_lines = 0
            output_file = open(
              os.path.join(
                output_dir,
                f"output_{output_file_index}.json"), "w"
              )

# close last output file
if output_file:
    output_file.close()

This created output_1.json to output_100.json. But I still needed to load them into R. All the output files are in review_data/review_data_split, and I wanted to make RData files because they’ve generally been my go-to when working with big files.

The reviews are in this format:

user_id - ID of the reviewer
name - name of the reviwer
time - time of the review (unix time)
rating - rating of the business
text - text of the review
pics - pictures of the review
resp - business response to the review including unix time and text of the response
gmap_id - ID of the business

The code runs through 2 loops. The first loop goes through the 100 output files and works on 5 at a time, and the second loop reads through each line (with a handy named progress bar, thanks to the map function!), extracts out the 4 columns I want (time, rating, text and gmap_id), and then turns it into a data frame. These individual data frames are then saved into an RData file.

library(foreach)
library(tidyverse)

filepaths <- list.files(path = "review_data/review_data_split")

foreach(i = seq(1, 100, by = 5)) %do% {
  reviews_by_5 <- foreach(
    current_file = filepaths[i:(i + 4)], .combine = "rbind"
  ) %do% {
    review_json <- purrr::map(
      readLines(
        paste0("review_data/review_data_split/", current_file)
      ),
      jsonlite::fromJSON,
      .progress = paste("loading file", current_file)
    )
    l <- map(review_json, function(x) {
      cbind(
        x$time,
        ifelse(is.null(x$rating), NA, x$rating),
        ifelse(is.null(x$text), NA, x$text),
        x$gmap_id
      )
    })
    r <- data.frame(matrix(unlist(l), ncol = 4, byrow = TRUE))
    colnames(r) <- c("time", "rating", "text", "gmap_id")
    r
  }
  # remove duplicates
  reviews_by_5 <- reviews_by_5 |> distinct()
  save(reviews_by_5, file = paste0("reviews_", ceiling(i / 5), ".RData"))
  rm(reviews_by_5)
}

Now we can combine them into one large data frame, and analyse the reviews!

Posted on:: April 12, 2024

Length:: 3 minute read, 639 words

Categories:: data

See Also: