Usage¶

The Megu Package, can be used both as a library and through a provided CLI tool. For the purposes of this “Usage” documentation, we will be stepping through how to use the framework as a library.

Data Descriptors¶

First things first, we need to define some common data descriptors.

Url

Wraps a basic string URL in a furl for better manipulation of the provided URL.

This is really a basic type that is used to unify the communication of a URL throughout the package.
Content

Defines some media content discovered on a site (image, video, audio, etc.).

A content instance has several data descriptors within it:
- Meta
  
  Describes some site-provided metadata about the content.
- Checksum
  
  Describes some validation checksum for the locally fetched content.
Resource

Defines a resource that can be fetched to help reproduce some content locally.

This is an abstract definition, concrete implementations such as HttpResource should be used within content instances.
Manifest

Defines a grouping of locally fetched resources that can be merged to reproduce some content.

Configuration¶

All configuration used internally by the tool is ready from the MeguConfig instance. This object contains mostly metadata, and some unique temporary directory paths to use for storing downloaded artifacts.

Within this config, there are three directories that can be overridden by environment variables which are defined by MeguEnv.

MEGU_PLUGIN_DIR The directory where plugins are stored.

By default this value is set to {user config dir}/megu/plugins. The actual path to the user’s config directory depends on the OS being used. Checkout appdirs for more information.

MEGU_LOG_DIR The directory where logs are stored.

By default this value is set to {user log dir}/megu. The actual path to the user’s log directory depends on the OS being used. Checkout appdirs for more information.

MEGU_CACHE_DIR The directory where persistent caches are stored.

By default this value is set to {user cache dir}/megu The actual path to the user’s cache directory depends on the OS being used. Checkout appdirs for more information.

Installing Plugins¶

Plugins are stored as installed packages within a subdirectory in the plugin directory. The subdirectory that these plugins are installed into defines the name of the plugin (not the package).

Take for example the following directory structure:

~/.config/megu/plugins/
└── megu_gfycat/                      # This is the plugin name
   ├── LICENSE
   ├── README.md
   ├── megu_gfycat/                   # Not this (this is the package name)
   │   ├── __init__.py
   │   ├── api.py
   │   ├── constants.py
   │   ├── guesswork.py
   │   ├── helpers.py
   │   ├── plugins/
   │   │   └── ...
   │   └── utils.py
   ├── megu_gfycat-0.1.0.dist-info/
   │   ├── INSTALLER
   │   ├── LICENSE
   │   ├── METADATA
   │   ├── RECORD
   │   ├── REQUESTED
   │   ├── WHEEL
   │   └── direct_url.json
   └── pyproject.toml

In this name of the the top-level megu_gfycat (within the plugins/ directory) is the plugin name. The nested megu_gfycat is the same name, but it refers to the installed package.

So installing plugins takes 2 steps:

Create a new directory within the plugin’s folder.
Install a plugin package to the newly created plugin folder.

A helper method megu.plugin.manage.add_plugin() can automate this for you if you’re not working in a containerized solution. Behind the scenes, this method utilizes pip to install the given package to a newly created plugin directory using the same names as the given package URL. Be sure to use URLs understood by pip if using this method.

from megu.plugin.manage import add_plugin

add_plugin("git+https://github.com/stephen-bunn/megu-gfycat.git")

If you are working in a solution like Docker, you should be making use of the environment variable MEGU_PLUGIN_DIR to set a plugin directory and then use pip to install the desired package yourself.

ENV MEGU_PLUGIN_DIR=/.megu/plugins/
RUN mkdir -p $MEGU_PLUGIN_DIR/megu_gfycat
RUN python -m pip install --upgrade git+https://github.com/stephen-bunn/megu-gfycat.git --target $MEGU_PLUGIN_DIR/megu_gfycat

We include a fallback plugin megu.plugin.generic.GenericPlugin that assumes the content can be fetched with a single resource using a single HTTP GET request. If no plugins are provided, this generic plugin will always be used. Of course, if it content can’t be fetched using this naive approach, it will fail.

Fetching Content¶

There are 6 steps to fetching content from a URL to the local filesystem in this framework:

Discover the plugin best suited to get content from a given URL.
Iterate over available content from a given URL using the discovered plugin.
Filter down what content should be fetched.
Get the best suited downloader for the filtered content.
Download the content using the downloader to produce a manifest of artifacts.
Merge the downloaded manifest of artifacts to reproduce the content from the given URL.

We provide a module megu.services which exposes some helpful functions to reduce the boilerplate necessary to implement each of these steps.

Plugin Discovery¶

To discover the best plugin to handle a given URL, you can use the get_plugin() function. Depending on what plugins you have available, it will attempt to eagerly determine if a plugin can handle the given URL. Otherwise, this function will provide the fallback megu.plugin.generic.GenericPlugin instance as the best suited plugin.

from megu.services import get_plugin

URL = "https://gfycat.com/pepperyvictoriousgalah-wonder-woman-1984-i-like-to-party"

plugin = get_plugin(URL)
# If megu-gfycat is installed, will return "Gfycat Basic" plugin
# Otherwise, will return "Generic Plugin"

Content Iteration¶

Now that you have the best plugin for the given URL, you need to invoke the plugin’s extraction logic to get Content entries that can be downloaded. You can use the iter_content() iterator to invoke a plugin with a URL.

from megu.services import iter_content

for content in iter_content(URL, plugin):
   ...
   # iterates over available content from the given URL as lazily as possible

This iterator will yield all content extracted by the plugin within the for loop. These content entries may be many instances of the same content using different qualities. To reduce what content is handled, we need to filter the results down a bit.

Content Filtering¶

Filtering content can be done using the functions provided in the megu.filters module. The most simple best_content() filter will only take unique content with the highest indicated quality. This filter can be applied directly to the call of iter_content to reduce any required nesting.

from megu.services import iter_content
from megu.filters import best_content

for content in best_content(iter_content(URL, plugin)):
   ...
   # filters out content from the content iterator

Note that in order to determine which content is the best, this filter is greedy with the iter_content generator.

Downloader Discovery¶

With the content extracted from the plugin, we need to get the best suited downloader for the content. You can use the get_downloader() to get the most appropriate downloader.

from megu.services import get_downloader

downloader = get_downloader(content)

The downloader is determined by the type of resources specified by the content. By default, this downloader will fallback to the HttpDownloader if no downloader can handle the given content.

Content Download¶

The provided downloaders will produce a Manifest instance. Otherwise, downloading the content can be done right from the download_content() method.

manifest = downloader.download_content(content)

Manifest Merge¶

The final step is to merge the downloaded manifest to a file path. You can use the merge_manifest() function to help you. Note that this step is a little weird as the plugin actually provides the manifest merging functionality. For this purpose, we need to provide the plugin instance back into the function along with the fetched manifest.

from megu.services import merge_manifest

merge_manifest(plugin, manifest, Path("~/Downloads/", content.filename).expanduser())

Altogether, the full content fetching script can be written as the following:

from pathlib import Path
from megu.services import get_plugin, get_downloader, iter_content, merge_manifest
from megu.filters import best_content

URL = "https://gfycat.com/pepperyvictoriousgalah-wonder-woman-1984-i-like-to-party"

plugin = get_plugin(URL)
for content in best_content(iter_content(URL, plugin)):
    downloader = get_downloader(content)
    manifest = downloader.download_content(content)
    to_path = merge_manifest(
       plugin,
       manifest,
       Path("~/Downloads/", content.filename).expanduser()
    )
    print(f"Downloaded {content.id} to {to_path}")

This of course is skipping over any kind of content de-duplication and checking if the content is already present or cached on the local filesystem. But, this is pretty lightweight solution for fetching content from a given URL using this framework.

If you find anything that you think could be improved, you can interact with development in the Megu GitHub repository.