Posting video access files to Drive with Google's APIs
Using the Google Drive API to load video access files quickly and dirtily
[Note: I wrote the first draft of this just over a year ago, just before COVID. In the interim, the TVTV project launched successfully, Trump is gone, the world has totally changed, etc., etc.]
I’ve been playing around with the different Google APIs recently, partly out of curiosity and partly out of necessity. Probably the most incredible thing about them is their vast scale (I guess that isn’t so surprising) and how labyrinthine the documentation can be. As someone who hadn’t used the Google APIs before last year, it took a lot of head bashing and staring blankly through my computer screen before I started to get the hang of how Google structures the different APIs, and how to interact with them in Python (and, against my better judgement, Ruby).
I’ll try to describe here how I set up a project (proxyDriver) in Google to work with the APIs, authenticate, and write code that does something useful with them, hopefully more or less efficiently.
TVTV
As part of an NEH grant project, we digitized hundreds of 1/2-inch open reel videotapes from the 1972 political conventions, and now the scholars we hired to write contextual essays need to view the access files ahead of the project’s official launch (hopefully before the elections!). Of course, the derivatives will all be streaming on Archive.org eventually, but we aren’t there yet internally and just need to get some version of the videos in place for the writers. And since they are busy with their own schedules and are scattered around the country, streaming was the only viable option… but where?
UC Berkeley has a deal with Google to give us unlimited storage (probably to fuel the coming SkyNet AI revolution), so Drive was a natural choice. Using the UI was clearly a waste of time at this scale, so why not code something that creates low rez (but still legible) copies of the videos and in the same breath also uploads them?
Dependencies
You can go to the proxyDriver
repo linked above and sort of follow along, but there are some dependencies to set up first. Most obviously you need a Google account, and also the Google Python client. Installing these is simple with pip
and Google gives you this command: pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
[n.b.: I use pip3
]. Boom! you’re in.
Setting up a Drive project
All of the cool stuff you’re going to do is going to be associated with a “project” in the Google Developer interface which has its own API tokens and so on. You can make different projects that correspond with different… projects… to do different stuff, which can include not just Drive, but also YouTube and the rest of the Google universe. The process is pretty simple and actually pretty well explained in the documentation. You just need to enable the Drive API as part of your project (the instructions are good) and go from there.
Setting up authentication & a reusable token
This part is a little weird. I was hoping to run this project using an ssh connection to a remote server, but to set up authentication the way I did (with OAuth 2.0, which once set up is simple and reliable) you need to have a browser where you give your consent to authenticate yourself as a user the first time you use it. The output of the authentication process is a JSON
text file that you can copy over to your remote machine with scp
or what have you. The Google Python client then sets up the special sauce that lets you connect.
One thing to keep in mind is that the API requires the calling function to have approved scopes, so I had to experiment a bit to find which ones I needed to work on the files in my project. Definitely the documents
and drive.file
permissions were needed for this project. I also gave my code the drive
scope, which gives broad access to the linked Drive and is maybe overkill (and/or unsafe?).
Using the Python client for the Drive API
The Python client is really powerful and also a little fiddly, but I reused a lot of code from Google’s Python quickstart and their official repo with examples. I feel like I really just scratched the surface of what’s possible, but the examples (upload, replace, etc.) get you pretty far. One minor annoying thing is that for video you have to tell Google that the MIME type is video/mp4
even if it isn’t really. It doesn’t seem to actually care, but it does need that reassurance. Maybe Google re-encodes any video you throw at it as mp4?
The main functions that I used in uploader.py
in my code (aside from login
which is mostly just copied from the Google example) are MediaFileUpload
(a Python class defined by the API client which sets the parameters like local filepath and MIME type) followed by g_drive.files().create().execute()
which actually uploads the thing and handles broken connections, retries, etc. Pretty simple when you look at it! Here’s the line of code in my project that calls these.
A bit more about the API call & response structure
It might be helpful to look at the function I used to see what the create
(i.e., create a new Drive object) API call actually does:
def upload_it(self):
file_metadata = {
'name': self.baseName,
'parents': self.parents
}
media = MediaFileUpload(
self.localPath,
mimetype=self.mimeType,
resumable=True)
uploaded_file = self.g_drive.files().create(
body=file_metadata,
media_body=media,
fields='id',
supportsAllDrives=True
).execute()
This upload_it
function is defined in a class called Upload
that can create an instance of a single file to upload. The first variable, file_metadata
, sets the name of the file you’re uploading along with the parent
folder ID in Drive (the folder ID is an important concept for my purposes; it’s just the string of characters after https://drive.google.com/drive/u/0/folders/
in a Drive folder’s URL). The next variable, media
, creates an instance of the Google API class MediaFileUpload
where you define the local path for the file, its MIME type, and whether you want to allow resumable uploads if the connection is dropped for some reason. Next, uploaded_file
connects to the g_drive
connection instance (defined earlier in the Upload
class; really I should have had a single connection instance that gets passed to each upload… oh well). Then it calls the files().create()
function from the Drive API, using file_metadata
as the HTML body and media
as the binary payload. I actually don’t remember why you need the fields='id'
parameter, but it’s in all the examples… supportsAllDrives
is set to True since I’m using a “shared” Drive belonging to my institution. If you’re using a personal Drive account you may not need this?
Once you execute g_drive.files().create()
and the upload is successful, the HTTP response body includes a Files resource
object, which is basically JSON describing the file and its status. I just used this response to test that the upload was successful (i.e. I run get
on the object and verify that it has an id
property), but you could take it and then modify the object in Drive, say adding metadata, or adding it to other folders, or using it downstream in whatever way your fancy strikes you.
Making it go
The idea for the TVTV project was to upload batches of access files, with each batch related to a production getting its own folder on Drive. For this reason I set up the script to look in a single batch folder where it would go file by file to transcode an access copy to a temporary folder, then do the upload. The transcodes are all done at the same low-rez spec, hard coded into the Migration
class in proxy.py
.
I also realized it would be handy to use this script for other uses, so I added command-line flags you can use to specify Drive folders arbitrarily (rather than setting it in a config file), and also to skip the access file process, and to set a different MIME type if you’re working with images or audio. I’ve actually used it more for this by now than for my original purpose!
ffmprovisr
Also, I have to give a shoutout to ffmprovisr, which I use basically any time I need to do something with FFmpeg. I’m so bad at remembering each detail of the different commands I use on a regular basis, but with this cookbook I can devote my precious, dwindling brain reserves to more grandiose efforts.
What’s next
I’ve used this [in the year since COVID] for lots of different things now and it’s still pretty handy when I have a batch of large files that need to get to Drive. I think if I do anything else it will be a little refactoring to make it more efficient, add some actual tests for each upload, maybe streamline the config/setup and command-line flag processes. In the meantime, I’ll keep using it!