R scripts to handle dirty data
Filling missing values with NA: I assume that org_xts in the following code represent a given time-series object in which we need to handle missing readings
[code lang="css"]</pre>
#org_dfs represents object with missing readings
timerange = seq(start(org_xts),end(org_xts), by = "hour")# assuming original object is hourly sampled
temp = xts(rep(NA,length(timerange)),timerange)
complete_xts = merge(org_xts,temp)[,1]
<pre>[/code]
Removing Duplicate values: Here, we will identify duplicate entries on the basis of duplicate time-stamps.
[code language="lang="css"]</pre>
# dummy time-series data
timerange = seq(start(org_xts),end(org_xts), by = "hour")# assuming original object is hourly sampled
temp = xts(rep(NA,length(timerange)),timerange)
# identify indexes of duplicate entries
duplicate_enties = which(duplicated(index(temp)))
# data without duplicates
new_temp = temp[-duplicate_entries,]
<pre>[/code]
Resample Higher frequency data to lower frequency: In this function, we will resample the high-frequency data to lower frequency data. Note that there are some tweaks done according to timezone, currently set to "Asia/Kolkata"
[code language="lang="css"]</pre>
resample_data <- function(xts_datap,xminutes) {
library(xts)
#xts_datap: Input timeseries xts data, xminutes: required lower frueqeuncy rate
ds_data = period.apply(xts_datap,INDEX = endpoints(index(xts_datap)-3600*0.5, on = "minutes", k = xminutes ), FUN= mean) # subtracting half hour to align hours
# align data to nearest time boundary
align_data = align.time(ds_data,xminutes*60-3600*0.5) # aligning to x minutes
return(align_data)
}
<pre>[/code]