## Note: Examples in this vignette are set to not run on CRAN. If you
## would like to build this vignette locally, you can do so by first
## setting the environmental variable 'NOT_CRAN' to 'true' on your
## computer and then rebuilding the vignette.
This vignette provides more details on how the
noaastormevents package interacts with the online NOAA Storm Events database to pull storm event listings based on user queries.
The NOAA Storm Events data is available online at https://www.ncdc.noaa.gov/stormevents/. That website includes documentation on the data, as well as a page that allows bulk data download of yearly csv files either through ftp or http (https://www.ncdc.noaa.gov/stormevents/ftp.jsp). Data is available from January 1950 and tends to be updated to within a few months of present.
Data is stored in bulk by year in compressed comma-separated files (
.csv.gz files). Each year has three compressed files available:
File names for each file include both the year of the data (e.g., “1950”) and the date the file was last modified (e.g., “20170120”). Files are given regular names other than these two specifications. This regular naming scheme allows us to use regular expressions in code within the
noaastormevents package on all listed file names to identify the exact name of a file for a specific year, as explained in the next section.
The size of all three file types has increased with time (see figure below; note that the y-axis is log 10). The largest file for any given year is the “Details” file. Most file sizes increased substantially in 1996 (dotted vertical line), when the database dramatically expanded the types of events it included. Before 1996, the database covered tornadoes and, for some years, a few other types of events. From 1996, the database expanded to include events like floods, tropical storms, snow storms, etc. While “Locations” files exist in the database for early years, they contain no information until 1996. See the documentation at the NOAA Storm Events database website for more information on the coverage of the database at different times across its history.
The database data is stored in files separated by year, so the file for an entire year is identified and downloaded when a user asks for event listings from any time or any type of event that year. For example, if a user wants to list flood events from the week of Hurricane Floyd in 1999, functions in the
noaastormevents package would first identify and download the full “Details” data file for 1999 and then filter down to flood events starting in the correct week.
To identify the online file path for a specific year, the
find_file_name function in the
noaastormevents package uses the
htmltab function (from the package of the same name) to create a dataframe listing all files available for download from the NOAA Storm Events database. The function then uses regular expressions to identify the file name in that listing for the requested year. For example, the name of the file with “Details” information for 1999 can be determined with:
find_file_name(year = "1999", file_type = "detail")
Here is the full definition of the
Typically, this function will only be used internally rather than called directly by a user.
Once the file name has been determined, a function in the package then downloads that file to the user’s computer. For some years, files are very large, so this download can take a little while. To avoid downloading data from the same year more than once within an R session, the downloading function stores the downloaded data for that year in a temporary environment in the R user’s session. In later requests for the same year, the function will first check for data from this year in the temporary environment and only download the data from the online database if it is not already available on the user’s computer.
This environment is created to be temporary, which means that it is deleted at the end of the current R session. While some packages that access online databases cache any downloaded data in a way that persists between R sessions, we chose not to do that and instead only cache within an R session, but delete all data at the close of the R session. This is because some of the Storm Event files are very large, and most users will likely only want to keep a small subset of the data for a given year (e.g., only flood events during the week of Hurricane Floyd). It would be wasteful of memory to cache all the 1999 data indefinitely on the user’s computer in this case; instead, the user should use our package to create the desired subset of the data and then explicitly store that subset locally to use in future analysis.
The function for downloading the file for a year is called
download_storm_data. Here is it’s full definition:
noaastormevents package allows a user to query storm events either by a date range or by a named historical tropical storm, rather than a year. The
create_storm_data function inputs either a date range or a storm name, as well as the requested file type, and downloads data for the appropriate year or years. If the user requests a date range, the function will download yearly data files for all years included in that range. If the user requests a tropical storm, the function will pull the data for that particular year. Here is the full definition of
As a note, many of the functions in the
noaastormevents package that allow linking events with tropical storms rely on historical data for the storms, including storm tracks, estimated distances to eastern U.S. counties, and dates when the storm was closest to each county. The package pulls this historical data from the
hurricaneexposuredata package, through the interfacing package
hurricaneexposure. The hurricane data goes from 1988 to (currently) 2015 and includes all Atlantic basin tropical storms that came within 250 km of at least one U.S. county. The following storms are included in that package and so available to be used for functions in
noaastormevent package focuses on higher-level functions, which result in a simplified and cleaned version of this storm events data, a user can use the
create_storm_data function to pull the full dataset for a year into R and work with the raw, uncleaned version. For example, here is a call that pulls the raw data for 2015 into an R object called
<- create_storm_data(date_range = c("2015-01-01", "2015-12-31"))
events_2015 slice(events_2015, 1:3)
This raw data has 51 columns. This includes:
EVENT_ID). Note that there are more unique event IDs (57,779 for the 2015 events data) than unique episode IDs (9,511 for the 2015 events data)
The following sections provide some summary statistics for data from this database for a single year (2015), to help users better understand the available data. Users may want to conduct similar data analysis themselves with the set of data they pull from the NOAA Storm Events database relevant to a particular research project. The code from this vignette (available at the package’s GitHub repository) can serve as a starting point for that.
In the 2015 event listings, here are the types of events and the number of reported events for each:
Here are how the start dates for listings for each event type are distributed over the year (event types are ordered by decreasing total count during the year; note that the y-axes vary depending on the range of events by date for each event type):
Many event types are clearly seasonal (e.g., winter weather, winter storms, heavy snow, cold, extreme cold, blizzards, ice storms, lake-effect snow, and avalanches are all much more common during winter months, while tropical depressions and tropical storms are all limited to the hurricane season). However, for some events, reporting seasonal patterns might be based not just on the true pattern of events but also on the timing of important exposures and impacts of the events. For example, rip currents have many more listings during the spring and summer, which may be related to events being more likely to be listed when more people are swimming. Frost event listings are particularly high at the start and end of the frost season, rather than in the middle of winter, which may be related to the impacts of frost on crops being higher in spring and fall than during the winter. If working with this data, it important to keep in mind that the data are based on reporting, and there may be related influences on the probability of an event being reported and included in the data that differ from using data from something like a weather station.
“Episodes” seem to collect related “events”, where events can vary in the type or location of the event, while an “episode” collects events that belong to the same large system. The following graph shows, for each episode listed in 2015, the number of events listed for the episode (x-axis) and the size (in days) of the range of begin dates across events in the episode.
An episode will never include events in more than one state, so a large weather system could potentially be described by multiple episodes in different states:
events_2015 select(EPISODE_ID, STATE) %>%
summarize(n_states = length(unique(STATE))) %>%
summarize(max_n_states = max(n_states))
Here are maps with the beginning locations of events in the episodes with the most events in 2015. Note that the beginning latitude and longitude are not listed for every event, resulting in one of the episodes not having any points on the map. From the other maps, it is clear that events within the episode were fairly close together.
For these episodes with the most events in 2015, the following graph shows the number of events reported for the episode. One of the episodes was a winter storm, another was heavy rains and floods, while the rest of the episodes included high winds, hail, tornadoes, rain, and / or flooding.
Once we removed event types with less that 50 listings in 2015, we did a cluster analysis of event types, to group events that are more likely to occur together within an episode. The following plot shows the resulting cluster structure of these event types.
The next graph shows the number of events of each event type (excluding event types with less than 50 total listings in 2015). Each row represents an episode.
SOURCE column of the raw data gives information on how each event was reported.
The majority of events in this database, at least for 2015, were reported by either a trained spotter or the public.
The following graph shows, for each type of event in 2015, the percent reported by each source. For some types of events, reporting is dominated by a specific source. For example, most high surf reports come from trained spotters, while most drought reports come from drought monitors and most tornado reports come from the NWS Storm Survey. For other types of events, reporting sources are more diversified. Both axes of the plot are ordered by overall frequency (i.e., overall number of each type of event and overall number of reports from each source).
Each event has a state listed for the event (
STATE). The following graph gives the number of reported events in each state for 2015:
Note that “states” include bodies of water (e.g., specific Great Lakes, the Hawaii waters, the Gulf of Mexico) and territories (American Samoa, Guam, Puerto Rico, Virgin Islands).
For some event types, the latitude and longitude of the beginning of the event is included with the event listing.
Of the 2015 events with a latitude and longitude listed for the beginning of the event and that are in the continental U.S., here are those locations by month:
Here are those locations by event type:
Some events have different latitudes and longitudes for the beginning and ending locations. For example, here are maps for one state (Arkansas) of events with different starting and ending locations:
Some events are reported by forecast zone (
CZTYPE of “Z”) rather than county (
CZTYPE of “C”). Specific types of events are typically either always reported for a county or always reported for a forecast zone (see table below). Events typically reported by county include floods (“Flash Flood”, “Flood”, “Debris Flow”), tornado-like events (“Tornado”, “Funnel Cloud”, “Dust Devil”), and a few other events often related to thunderstorms (“Thunderstorm Wind”, “Hail”, “Heavy Rain”, “Lightning”). Events typically reported by forecast zone include severe winter weather (“Winter Weather”, “Winter Storm”, “Heavy Snow”, “Cold/Wind Chill”, “Extreme Cold/Wind Chill”, “Blizzard”, “Frost/Freeze”, “Ice Storm”, “Sleet”, “Lake-Effect Snow”, “Avalanche”, “Freezing Fog”), extreme heat (“Heat”, “Excessive Heat”, “Drought”), events related to the water or coast (“Marine Thunderstorm Wind”, “High Surf”, “Coastal Flood”, “Waterspout”, “Astronomical Low Tide”, “Rip Current”, “Tropical Storm”, “Marine High Wind”, “Marine Hail”, “Marine Strong Wind”, “Hurricane”, “Seiche”, “Storm Surge/Tide”, “Tropical Depression”, “Marine Dense Fog”, “Sneakerwave”, “Tsunami”), and a few others (“High Wind”, “Dense Fog”, “Strong Wind”, “Wildfire”, “Dust Storm”, “Dense Smoke”).
For events reported by county, here are maps showing distributions in the number of events reported in 2015:
Here is a sample of events that are instead reported by forecast zone, with the state,
CZ_NAME, and event narrative included. Note that the county name is often provided by the
CZ_NAME column, although the
CZ_FIPS value is the forecast zone for any event listed by forecast zone. We use code to try to match
CZ_NAME listings to a table of U.S. county names and associated county FIPS for each event listed by forecast zone, to allow these events to be included in event listings and maps created by functions in
Within the code in
match_forecast_county is used to try to match a county FIPS to each of the events listed by forecast zone. To get the full code for that function, you can run
match_forecast_county (i.e., the function name, without parentheses after). To match an events listed by forecast zone to a county, this function tries the following to try to match all or part of the
cz_name columns in the storm events data to the state and county names in the
county.fips dataframe that comes with the
cz_name to the county name in
county.fips after removing any periods or apostrophes in
county.fips. Then check the two words before ‘county’, then the one and two words before ‘counties’.
cz_name and try to match it to the county name in
county.fips. Then check the last two words in
cz_name, then check the last three words in
cz_name before matching.
In addition, there are a few final steps in cleaning the data. First, all listings with “Utah” in the
cz_name are set to missing– while there is a Utah County, Utah, from inspection of event listings in 2015, events in Utah with “Utah” in
cz_name often referred to parts of the state, rather than to the county. Further, any event with “National Park” listed in
cz_name is set to not match with a county FIPS. In Wyoming, Park County was being matched to a
cz_name for Yellowstone National Park, and this could be a problem in other states, so this extra check was included.
In the 2015 events data, there are 22,664 events listed by forecast zone rather than county. Once the
match_forecast_county function is applied to these events, 16,405 events were linked to a county (72%) while 6,259 events (28%) could not be matched to a county.
Of the events not matched to a county, 2,997 events were outside the continental U.S. (i.e., in Hawaii, Alaska, U.S. territories, or waters):
This left 3,262 events in the continental U.S. that were listed by forecast zone but could not be linked to a county by the
match_forecast_county function. Most of these events had a value for
cz_name with words related to mountains, water, adjacency (e.g., “area of”, “vicinity of”), or a few other word types (e.g., “desert”, “hwy”). The following table gives the number of these remaining unmatched events with words from each of these categories (note: the
cz_name for an event may have words from more than one of these categories, in which case it would be counted in this table under both categories).
The following table summarizes the number of events that could not be linked to a county that contained at least one of each of these types of words:
For 2015, here are the
cz_name values for the events that could not be matched to a county and did not include the words listed above:
For the events listed by forecast zone that could be successfully matched to a county, here are the geographic distributions in event counts in 2015:
When using this function, and for event listings generated using this function, the user may want to hand-check that event listings with names like “Lake” and “Mountain” in the
cz_name column are not erroneously matched to counties with names like “Lake County” and “Mountain County”. Code like the following can be used for these checks (in this case, checking a dataframe of event listings named
z_events_2015 that is output from
match_forecast_county and so has
fips added for each event listing, if a match could be found):
z_events_2015 filter(cz_type == "Z") %>%
select(cz_name, state, fips) %>%
mutate(cz_name = str_to_title(cz_name)) %>%
filter(str_detect(cz_name, "Lake") & !is.na(fips)) %>%
The “Details” datasets for each year include six measurements of the impacts of each event:
Many of the impact values are given using abbreviations for amounts. For example, this listing for a tornado gives the property damage as “5.00K”:
%>% filter(event_id == "582970")county_events_2015
noaastormevents package uses a function to pull out these abbreviations and convert associated impact values to numeric values (e.g., 5000 for “5.00K”). The conversions conducted are:
This is done by the function
Usually, it seems that the cost per events within an episode do not overlap, so that the costs from all events in an episode can be summed to generate a total damage cost. While in some cases, different event listings within the same county in the same episode have the same damage cost, it often seems that the total cost was divided in these cases across events, as in this case, where the total estimated cost of $6,186,909 in the county (see the narratives) was spread evenly over the two event listings:
However, there are a few cases where it appears that costs might be duplicated over different events within the same episode in a county. For example, in the following listing, it looks like the total estimated damage cost of $1.7 million (see that narrative) is repeated over the two listed events, which would mean that adding damages across events in the episode would lead to a county damage cost of twice the true amount.
Of the times when a county had two or more events as part of the same episode in 2015, in 3,136 (87%) cases the damage costs were not identical across the event listings for the county and episode, while in 452 cases (13%) the cost listings were identical across event listings (as shown in the cases just given). If using damage cost estimates for research, it may be wise to hand-check for cases where damage cost estimates are duplicated across different events in a way that prevents summing across events to get the cumulative episode total cost.
When costs were summed over all event and county listings in an episode, here are the ten episodes in the 2015 dataset with the highest total damage costs (combining property and crop damage):
Here is a table of the total number of events with non-zero damage (either for property damage or crop damage), the total damage costs across all events of that type in 2015, and the median and maximum damage per event for events of that type with non-zero damage costs. Event types are ordered by total damage costs summed across all events.
The following graph shows the distribution of events between those with listed damage costs greater than $0 and those without.
The following figure shows the top 50 episodes in terms of damages (property and crop damage combined) in 2015: