Often when writing functions or packages, you might need to create some sample data to test your functions or for users to play with. Before ChatGPT, I always wrote my sample data by hand, or hacked together some function to generate it.
Today, I used ChatGPT to generate sample events for a ggplot2 calendar extension I’ve been building for fun. It was surprisingly easy and effective to prompt the model to:
- write 5 YAML nodes with the structure shown below, and
- mock up example links and a relevant emoji, and
- to modify the events to be within the same 3-month window, and have durations varying between 1-5 days:
data-raw/demo-events-gpt.yml
- event_id: EVT001
startDate: "2024-05-15"
endDate: "2024-05-16"
event_title: "TechFest 2024"
event_descr: "TechFest 2024 brings together innovators, startups, and tech enthusiasts for two days of keynote speeches, panel discussions, and workshops covering the latest trends in technology, artificial intelligence, and blockchain."
event_emoji: "🤖"
event_link: "https://techfest2024.example.com"
- event_id: EVT002
startDate: "2024-05-25"
endDate: "2024-05-27"
event_title: "Global Health Summit"
event_descr: "The Global Health Summit convenes healthcare professionals, policymakers, and researchers from around the world to discuss pressing issues in public health, disease prevention, and healthcare accessibility."
event_emoji: "🌍"
event_link: "https://globalhealthsummit.example.com"
That’s all fine and good, but I wanted to include this data in my ggplot extension package, and it’s generally good practice to document your data and where it comes from. Of course, this is not a strict requirement, especially for demo datasets, but it got me thinking about when we should be “citing” generative AI. The R packages textbook provides pretty clear guidance on documenting data in R packages.
Here’s what I came up with for adapting that advice to cite ChatGPT as a data source:
- explicitly including
gpt
in the data frame created from the above yaml file:demo_events_gpt
- including a link to the Chat in
data-raw/
script and@source
documentation tag - explicitly mentioning ChatGPT 3.5 in the data documentation
R/data.R
#' 5 Sample Events generated by ChatGPT 3.5
#'
#' A set of 5 demo events generated by ChatGPT 3.5 on 13 Apr, 2024.
#' The following field descriptions were also generated
#' in the same session.
#'
#' @format
#' A data frame with 5 rows and 7 columns:
#' \describe{
#' \item{event_id}{Unique identifier for the event.}
#' \item{startDate}{Start date of the event.}
#' \item{endDate}{End date of the event.}
#' \item{event_title}{Title or name of the event.}
#' \item{event_descr}{Description of the event.}
#' \item{event_emoji}{Emoji representing the event theme or type.}
#' \item{event_link}{URL link to the event's website or page.}
#' }
#'
#' @source https://chat.openai.com/share/c68b7a82-5378-45c8-bf41-7fa134f0b74a
"demo_events_gpt"
Citation
@online{huang2024,
author = {Huang, Cynthia},
title = {Documenting Data Produced Using {ChatGPT} in {R} {Packages}},
date = {2024-04-13},
url = {https://www.cynthiahqy.com/posts/documenting-gpt-generated-data},
langid = {en}
}