Loading large JSON files

By Jasmine Siswandjo

April 12, 2024

Tackling a sizable JSON (~30 million rows, 10GB) dataset in R?

Here was my approach, combining Python and R.

I wanted to work on New York’s Google business reviews, and just loading it was a challenge. The JSON file had 30 million rows, and was originally 10GB. I tried loading it as usual into R with purrr:map and jsonlite::fromJSON, but it was so large that the progress bar wouldn’t even show up. The data is from Google local review data, created and uploaded by Tianyang Zhang and Jiacheng Li from UCSD.

After some Googling, I realized I would have to split the data up because my computer’s vector memory couldn’t handle reading the entire 10GB file at one go. Existing solutions online involved reading the entire file and then splitting it, but it wasn’t working on my laptop because of the lack of memory space. My solution ended up being reading in the file line by line, and splitting it while going through each line.

Slice and dice

I first used Python to split the 10GB file into 100 files of ~335,000 lines each. I wanted there to be about 335,000 lines because I knew that it took about a minute to read the metadata file that has about 200,000 lines.

Slicing and dicing the massive JSON file into more manageable chunks was necessary due to its sheer size and complexity. Handling a single file of 10GB with 30 million rows can be overwhelming for both memory and processing capabilities, especially in environments with limited resources.

import json
import os

input_file_path = "review_data/review-New_York.json"
output_dir = "review_data/review_data_split"
os.makedirs(output_dir, exist_ok=True)

max_lines_per_file = 335000
current_lines = 0
output_file_index = 1

with open(input_file_path, "r") as input_file:
    output_file = None
    
    for line in input_file:
        data = json.loads(line)
        
        if output_file is None:
            output_file = open(
              os.path.join(
                output_dir, 
                f"output_{output_file_index}.json"), "w"
              )
        
        output_file.write(json.dumps(data) + '\n')
        current_lines += 1
        
        # close current output file and open a new one 
        # once max lines is reached
        if current_lines >= max_lines_per_file:
            output_file.close()
            output_file_index += 1
            current_lines = 0
            output_file = open(
              os.path.join(
                output_dir, 
                f"output_{output_file_index}.json"), "w"
              )

# close last output file
if output_file:
    output_file.close()

This created output_1.json to output_100.json. But I still needed to load them into R. All the output files are in review_data/review_data_split, and I wanted to make RData files because they’ve generally been my go-to when working with big files.

The reviews are in this format:

  • user_id - ID of the reviewer
  • name - name of the reviwer
  • time - time of the review (unix time)
  • rating - rating of the business
  • text - text of the review
  • pics - pictures of the review
  • resp - business response to the review including unix time and text of the response
  • gmap_id - ID of the business

The code runs through 2 loops. The first loop goes through the 100 output files and works on 5 at a time, and the second loop reads through each line (with a handy named progress bar, thanks to the map function!), extracts out the 4 columns I want (time, rating, text and gmap_id), and then turns it into a data frame. These individual data frames are then saved into an RData file.

library(foreach)
library(tidyverse)

filepaths <- list.files(path = "review_data/review_data_split")

foreach(i = seq(1, 100, by = 5)) %do% {
  reviews_by_5 <- foreach(
    current_file = filepaths[i:(i + 4)], .combine = "rbind"
  ) %do% {
    review_json <- purrr::map(
      readLines(
        paste0("review_data/review_data_split/", current_file)
      ),
      jsonlite::fromJSON,
      .progress = paste("loading file", current_file)
    )
    l <- map(review_json, function(x) {
      cbind(
        x$time,
        ifelse(is.null(x$rating), NA, x$rating),
        ifelse(is.null(x$text), NA, x$text),
        x$gmap_id
      )
    })
    r <- data.frame(matrix(unlist(l), ncol = 4, byrow = TRUE))
    colnames(r) <- c("time", "rating", "text", "gmap_id")
    r
  }
  # remove duplicates
  reviews_by_5 <- reviews_by_5 |> distinct()
  save(reviews_by_5, file = paste0("reviews_", ceiling(i / 5), ".RData"))
  rm(reviews_by_5)
}

Now we can combine them into one large data frame, and analyse the reviews!

Posted on:
April 12, 2024
Length:
4 minute read, 642 words
See Also: