Google allows you to export your data from various products. In this post I show how one can run analysis on my data from Google Fit to find various pieces of information using the basic bash command line tools.
Getting The Data
The first thing we need to do is download our data from Google fit. To do this, you need to go to takeout.google.com. From there you can select which product’s data you want to export. For this project you will only need your google fit data, so hit the select none button. Then select the fit data as the only one to export. Scroll to the bottom, and hit export to get your data. Depending on how much data you have it might take a while, it will email you once the export is complete to download your data.
After you’ve downloaded the data, go ahead and unzip that into what ever directory you want. It will extract in as the following folder format.
Takeout
`- Product
`- Data Folders.
Since, we’re working with fit data we’re going to cd Takeout/Fit/Daily
Aggregations
from your selected extraction directory. Google fit will export
your data in two formats, one is a weird xml format, and the other is plain
CSVs. We want the CSVs, because they can be operated on really easily with
command line tools.
Data Format
In the Daily Aggregations
, we’re find files named like YYYY-MM-DD.csv
,
containing the activities of that day, and Daily Summaries.csv
which contain
the aggregated data of each day in this set. The date named files just look at
that specific day for what happened, while the summary file will give an idea
of what happened on a daily basis rather than a fifteen minute window. So, if
you want more granular data, you should use the date named files otherwise just
use the daily summaries because it reduces the amount of data crawled.
You should look at the headers in your csv file to what happen for that day with your activities. For instance, you have your base columns, then appended to them is the duration of that activity during that time period for date named files, and for summary file it is the total amount of activity completed.
Examining Daily Summaries
Since, I’m interested in my min, max and average calorie use per day, we’ll use the summary file and some awk magic. So, we’ll start with a basic pipeline to select a column excluding the file header from a csv file. From there we can pipe it into our favorite tool to analyze a bag of numbers, be that awk or R. R gives us more more from the get go so let’s use that on this data.
Now with these aliases in place we can do a simple analysis on the summaries for the interesting statistical data or whatever bag you’re looking at. Use it like the below.
The Rscript used just reads in all of the values from stdin
, then uses the magic summary
function
to get some information out of it. Information like you should have learned how to get in third grade,
like the min, max, median and mean. As you can see I burn more 2000 calories a day on average.
Min. 1st Qu. Median Mean 3rd Qu. Max.
562.8 1948.3 2058.5 2087.0 2192.1 3352.1
Examining Data on a Date By Date Basis
Now let’s look out at how one might calculate some statistic derived from the files for each date. I’ve found that each of the date files uses a range based approach to stats where the day is broken into fifteen minute increments. Let’s say I want to figure out how many calories I burn during the average fifteen minute period. Let’s use the pipeline from above and some find magic.
Conclusion
You can build some pretty powerful pipelines just using the basic unix tools, and a little bit of R at times. Sure we’re not churning through gigabytes here, but it does give you an idea of how to use smaller tools to avoid building a big heavy cluster to crunch a relatively small amount of data.