You can also access the list of shortcuts with Shift + Alt + K so the sheet may be redundant to many. However, I found that having a physical copy next to my computer helped a lot while I was still learning the shortcuts.
Some favourites of mine are:
Using code sections/chunks – Use Ctrl+Shift+R to insert a code section and a popup box will appear for you to name that section. Ctrl+Alt+T runs the current code section. When you are done working on a code section you can ‘fold’ it up to improve the readability of your file (Alt+L is fold current code section, Alt+O is fold all sections).
Re-running code quickly – Ctrl + Shift + P will execute the same region of code that was just previously run with the changes made since then.
Deleting/moving stuff faster – Ctrl+D deletes an entire line. Ctrl + backspace deletes the current word as in most word processing software. Alt + up/down moves code up and down lines in the console while Shift+Alt+up/down copies lines up/down.
Switch between plots – To toggle between plots use Ctrl+Shift+PgUp/PgDn (It’s a lot faster than using the arrows above the plots!)
(To see the charts clearly, please click to expand)
3 Peaks 2014 was held over the weekend, in much more pleasant weather than last year’s scorching heat.
Here are some charts that show a bit about this year’s ride, and also how it compared to last year’s.
This year’s 235km ride attracted 1922 entrants, up from 1543 last year. This chart shows that, despite the larger field this year, there were a lot fewer DNFs in 2014. The weather no doubt played a role in this. A total of 1639 people finished.
In percentage terms you can see that the completion rate was about 10% higher this year.
Broken down by gender you can see that women who entered had a slightly higher completion rate than men. Men were more likely to DNS and women were more likely to DNF.
On to finishing times: Here’s a chart showing finish times by 30 minute intervals. Six absolute weapons managed to crack the 8 hour mark and 75 cracked the 9 hour mark.
Times by Gender (stacked histogram) – You can see here that a lot more men participated than women.
The (ecdf) chart below shows the cumulative distribution of times by gender.
And on to the coveted average speed section. A total of four people broke the holy grail of 30km/h (including stops!). I was lucky enough to suck one of these gents’ wheels in the Sydney Rapha Gentlemen’s Race last year – I cannot imagine how painful attempting that would have been on Sunday.
The ride results contain data on how long it took to do the three monster climbs. I subtracted the time spent climbing the three major climbs from the total time and called this the time ‘descending’ (optimistic, I know). While there is a positive relationship between time spent climbing and ‘descending’, the dispersion of the points shows the enormous amount of variation.
If you divide the time spent climbing by the time ‘descending’ you can see this variation just presented in a different way.
The following charts investigates how a rider’s time up Tawonga Gap relates to their overall performance. Nobody who took more than 30 minutes to get up Tawonga Gap broke 9 hours and nobody who took more than 35 minutes broke 10 hours. Events like this are all about finding the right bunch to be a part of, so you have to make sure you don’t miss the bus on the first climb.
If we focus on just the fastest people up Tawonga Gap (those who do it sub 28 mins) you can see that the top finishers at the end of the day weren’t the fastest up TG. They all rode together
while a few others went up the road (and potentially blew up). EDIT: Andy’s comment tells us nobody went up the road from the front bunch on TG so the few posting faster times must have been slower on the Falls descent and were trying to catch the lead group or they started in a later wave.
Now, let’s have a look how people who participated last year and came back to face the challenge again this year.
The following two charts show how last year’s performance is related to this year’s. The first chart shows proportional ranking and the second shows a rider’s placing (there were a lot more entrants this year, so a given place is better (proportionally) in 2014).
This chart compares performance up Tawonga Gap in 2013 vs 2014. On average people improved over the year but the effect size is small.
70 people who did not finish in 2013 came back to tackle the event again in 2014. 9 of these 70 didn’t show up on Sunday but the good news story is that 52 of the 70 finished. The chart below groups people into their race ‘status’ (finished, DNF, DNS) from last year to see what happened to them this year. Those that finished last year were very likely to finish again this year. Those that DNS-ed last year were more likely to DNS in 2014.
Three peaks is a massive event, drawing people from all over Australia.
The following chart shows which state participants come from. Being held where it is, the biggest representation obviously comes from Victorians but A LOT of people come from NSW for this event, most from Sydney as we’ll see in a second. I know that 20+ come down from my old club alone.
Using suburb names I could determine how far people travelled (as the crow flies) to get to 3 peaks. In short, people travelled a long way to put themselves through hell. The four distinct lumps in the following density plots show people travelling from Melbourne, Sydney, Brisbane and WA respectively.
EDIT (17-8-14): I no longer use the XML2R package to pull SDMX data. There is a new package called RSDMX which makes the task described below a lot easier. You can find it and some examples here – https://github.com/opensdmx/rsdmx
OECD.Stat is a commonly used statistics portal in the research world but there are no easy ways (that I know of) to query it straight from R. There are two main benefits of querying OECD.Stat straight from R:
1. Create reproducible analysis (something that is easily lost if you have to download excel files)
2. Make tweaks to analysis easily
There are three main ways I could see to collect data from OECD.Stat
- Find the unique name for each dataset and scrape the html from the dataset’s landing page (e.g. the unique URL for pension operating expenses is http://stats.oecd.org/Index.aspx?DataSetCode=PNNI_NEW). This probably would have been the easiest way to scrape the data but it doesn’t offer the flexibility that the other two options do.
- Use the OECD Open Data API . This was the avenue I explored initially but it doesn’t seem that the API functionality is fully built yet.
- Use the SDMX query provided under the export tab on the OECD.Stat site. This URL query can be easily edited to change the selected countries, observation range and even datasets.
I went with option 3, using the SDMX query.
To get the query that you need to use in R, navigate to your dataset and click
export -> SDMX (XML) (as per picture below)
Then, copy everything in the ‘SDMX DATA URL’ box
In the example below I am using the trade union density dataset.
Getting the SDMX URL as described above for the trade union dataset gives us a very long URL as it contains a lot countries. I cut it down in this example for clarity to:
The important parts of the URL above are
- UN_DEN – This is the Trade Union Density dataset code
- AUS+CAN+FRA+DEU+NZL+GBR+USA+OECD – Unsurprisingly, this is the list of countries we are querying, you can delete countries or if you know the ISO country codes you can add to it.
- startTime=1960&endTime=2012 – Change the date range as you please.
Note that many datasets have a lot more options on offer so there is usually a bunch more junk after the dataset code in the URL.
The following code R code creates a melted data frame, ready for use in analysis or ggplot or rCharts. I make use of Carson Sievert’s XML2R package. All you need to do is paste your own SDMX URL into the relevant spot.
library(XML2R) file &amp;amp;amp;lt;- &amp;amp;amp;quot;http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/UN_DEN/AUS+CAN+FRA+DEU+NZL+GBR+USA+OECD/OECD?startTime=1960&amp;amp;amp;amp;endTime=2012&amp;amp;amp;quot; obs &amp;amp;amp;lt;- XML2Obs(file) tables &amp;amp;amp;lt;- collapse_obs(obs) # The data we care about is stored in the following three nodes # We only care about the country variable in the keys node keys &amp;amp;amp;lt;- tables[[&amp;amp;amp;quot;MessageGroup//DataSet//Series//SeriesKey//Value&amp;amp;amp;quot;]] dates &amp;amp;amp;lt;- tables[[&amp;amp;amp;quot;MessageGroup//DataSet//Series//Obs//Time&amp;amp;amp;quot;]] values &amp;amp;amp;lt;- tables[[&amp;amp;amp;quot;MessageGroup//DataSet//Series//Obs//ObsValue&amp;amp;amp;quot;]] # Extract the country part of the keys table # Have to use both COU and COUNTRY as OECD don't use a standard name country_list &amp;amp;amp;lt;- keys[keys[,1]== &amp;amp;amp;quot;COU&amp;amp;amp;quot; | keys[,1]== &amp;amp;amp;quot;COUNTRY&amp;amp;amp;quot;] # The country names are stored in the middle third of the above list country_list &amp;amp;amp;lt;- country_list[(length(country_list)*1/3+1):(length(country_list)*2/3)] # Bind the existing date and value vectors dat &amp;amp;amp;lt;- cbind.data.frame(as.numeric(dates[,1]),as.numeric(values[,1])) colnames(dat) &amp;amp;amp;lt;- c('date', 'value') # Add the country variable # This code maps a new country each time the diff(dat$date)&amp;amp;amp;lt;=0 ... # ...as there are a different number of readings for each country # This is not particularly robust dat$country &amp;amp;amp;lt;- c(country_list, country_list[cumsum(diff(dat$date) &amp;amp;amp;lt;= 0) + 1]) #created this as too many sig figs make the rChart ugly dat$value2 &amp;amp;amp;lt;- signif(dat$value,2) head(dat)
This should create a data frame for nearly all annualised OECD.Stat data. You will need to do some work with the dates if you want to use more frequently reported data (e.g. quarterly, monthly data).
Once the data is set up like this it is very easy to visualise.
library(ggplot2) ggplot(dat) + geom_line(aes(date,value,colour=country))
Or just as easy but a bit cooler, use rCharts to make it interactive
# library(devtools) # install_github('rCharts', 'ramnathv', ref = 'dev') library(rCharts) n1 &amp;amp;amp;lt;- nPlot(value2 ~ date, group = &amp;amp;amp;quot;country&amp;amp;amp;quot;, data = dat, type = &amp;amp;amp;quot;lineChart&amp;amp;amp;quot;) n1$chart(forceY = c(0)) #To publish to a gist on github use this code, it will produce the url for viewing #(you need a github account) n1$publish('Union Density over time', host = 'gist')
Link to the interactive is here (or click on the image below)
Following from the .gif I posted yesterday, I had a request to animate Australia’s projected population.
In 2008, The Australian Bureau of Statistics modeled 72 permutations of fertility, migration and life expectancy in order to project Australia’s population out to 2101. Of the resulting 72 series, the ABS provide us with three of them as annual time series. ‘Series A’ uses high growth (HG) assumptions, ‘Series B’ follows current trends and ‘Series C’ uses low growth (LG) assumptions. The specific assumptions of each series can be seen in the table below.
The ABS also modeled the population with zero net overseas migration but that was only modeled (thankfully) so it would be apparent what the impact of NOM on the population is.
The assumptions vary significantly under the different scenarios, so much so that the forecast difference in population between the likely scenario and the HG one is more than 15 million by 2101. The difference between HG and LG is enormous with the HG population nearly double that of the LG one in 2101. More than anything, this shows the uncertainty, sensitivity and danger generated by compounding projections over a long time frame.
Below are two versions of the .gif, one with the three growth scenarios plotted and one with just the likely scenario (I like this one more). The ABS estimated the population in yearly age intervals all the way up to 99 years of age then they bundled everyone older into ‘100 or older’. This causes a problem in the charts as is apparent once the projected number of 100+ year old residents start piling up. I stopped the animations at 2080 as I think that is already looking plenty far ahead and that looking too far ahead might detract from the story at hand.
Note that forecasts start from 2011 as I couldn’t find more recent stats in my brief search.
Apart from the obvious ageing of the population you can also see the increase in the population that occurs either side of the 25 year old mark which is occurring due to NOM. Australia is essentially getting working age adults for free after we’ve let their home countries bear the cost of raising and educating them – sounds like a good deal to me.
What does it all mean?
As the chart below (implicitly) shows, the percentage of Australians who are of ‘working age’ will fall rapdily into the future. This will put stress on our health system and the public purse. There will be implications for policy around health, superannuation and the retirement age, as all of these areas will have to accommodate the large shift in the age distribution. As Australia’s population grows we will need to accommodate our new citizens so need the foresight to implement effective infrastructure spending and urban planning .
All data sourced from here – http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3222.02006%20to%202101?OpenDocument
Fun Facts(you deserve it if you got this far):
Under Scenario A, Melbourne overtakes Sydney as Australia’s biggest city in 2039.
Under the likely growth scenario, Tasmania’s population will start declining from 2031
This .gif shows the distribution of Australia’s population over time. Values are in one year age intervals.
I’ve seen some creative visualisations of issues surrounding the Australian election recently though not as many maps as I expected. ‘ggplot2’ is the go-to package for plotting in R so I thought I’d see if I could plot the Australian electoral divisions with ggplot2. By using the Australian Electoral Commission’s GIS mapping coordinates and mutilating Hadley Whickam’s tutorial it was a pretty easy process.
1. Download the AEC boundary GIS data (warning 24mb).
2. Extract the file to your R working directory.
3. Run code…
The data.frame this process creates has 2.5m observations so mapping can take a while. I’m sure there are much more effective ways to map GIS data but I wanted to stick to ggplot2 in this instance.
require("rgdal") # requires sp, will use proj.4 if installed require("maptools") require("ggplot2") require("plyr") require("rgeos") #I upped my memory limit as the file we are going to map is pretty large memory.limit(6000) australia = readOGR(dsn=".", layer="COM20111216_ELB_region") australia@data$id = rownames(australia@data) #This step isn't in the tutorial, need to do this due to a couple of errors in the AEC GIS data. australia.buffered = gBuffer(australia, width=0, byid=TRUE) australia.points = fortify(australia.buffered, region="id") australia.df = join(australia.points, australia@data, by="id") #This will show you the variables in the dataset head(australia@data) ggplot(australia.df) + aes(long,lat,group=group,fill=ELECT_DIV)+ #Don't want a legend with 150 variables so suppress the legend geom_polygon(show_guide = FALSE ) + geom_path(color="white") + #for some reason it maps too much ocean so limit coords (EDIT: due to Christmas Island) coord_equal(xlim=c(110,155))
This gives you
While it’s a nice picture, it’s of little use as it is impossible to see small electorates.
State by state mapping. might be more useful Here is some code to map the ACT. I suggest anyone experimenting should play around with mapping the ACT data as it doesn’t take long to process.
ggplot(subset(australia.df, STATE == "ACT")) + aes(long,lat,group=group,fill=ELECT_DIV)+ geom_polygon() + geom_path(color="white") + #include limits to remove Jervis bay plotting coord_equal(xlim=c(148.5,149.5))
To include your own data for mapping just add it to the australia@data data.frame, merging by australia@data$ELECT_DIV. The charts look good, but to make them really eye-catching I suggest you take them into inkscape.