Use cached_path to download a file from GitHub releases?

I was hoping to use cached_path to download a file from a GitHub release. The motivation was to attach my pretrained AllenNLP model (model.tar.gz) to a GitHub release and then use cached_path to download it. This way a user could just supply a model name and I could map it to the actual URL, and then download/cache the model on their system.

However, I get a 403 error when I try this:

from allennlp.common.file_utils import cached_path

cached_path("https://github.com/JohnGiorgi/DeCLUTR/releases/download/v0.1.0rc1/declutr_small.tar.gz")

although I can access it with wget no problem:

wget https://github.com/JohnGiorgi/DeCLUTR/releases/download/v0.1.0rc1/declutr_small.tar.gz

Is there a way cached_path can be used to acheive this?

I don’t know; @petew, any ideas? This sounds like a feature request / bug.

Huh, looks like the 403 error comes from the initial HEAD request to fetch the Etag. I’m guessing it has to do with the fact that the link you have redirects to some resource on S3. We could potentially add a flag to cached_path that tells it to skip checking the Etag.

@petew Yup that looks to be the problem, I set etag = None in file_utils.py and then I was able to download the model with:

cached_path("https://github.com/JohnGiorgi/DeCLUTR/releases/download/v0.1.0rc1/declutr_small.tar.gz")

I would be happy to open a PR with that adds a flag to skip the etag. Another approach would be to regex the url to see if it looks like a file attached to a GitHub release (e.g. it could look for "github.com" and "/releases/download").

A PR would that would be great. I’d rather go that approach than adding regexes for special cases.