Usage¶
The Megu Package, can be used both as a library and through a provided CLI tool. For the purposes of this “Usage” documentation, we will be stepping through how to use the framework as a library.
Data Descriptors¶
First things first, we need to define some common data descriptors.
- Defines some media content discovered on a site (image, video, audio, etc.).A content instance has several data descriptors within it:
- Defines a resource that can be fetched to help reproduce some content locally.This is an abstract definition, concrete implementations such as
HttpResource
should be used within content instances. - Defines a grouping of locally fetched resources that can be merged to reproduce some content.
Configuration¶
All configuration used internally by the tool is ready from the
MeguConfig
instance.
This object contains mostly metadata, and some unique temporary directory paths to
use for storing downloaded artifacts.
Within this config, there are three directories that can be overridden by environment
variables which are defined by MeguEnv
.
MEGU_PLUGIN_DIR
The directory where plugins are stored.{user config dir}/megu/plugins
.
The actual path to the user’s config directory depends on the OS being used.
Checkout appdirs for more information.MEGU_LOG_DIR
The directory where logs are stored.{user log dir}/megu
.
The actual path to the user’s log directory depends on the OS being used.
Checkout appdirs for more information.MEGU_CACHE_DIR
The directory where persistent caches are stored.{user cache dir}/megu
The actual path to the user’s cache directory depends on the OS being used.
Checkout appdirs for more information.Installing Plugins¶
Plugins are stored as installed packages within a subdirectory in the plugin directory. The subdirectory that these plugins are installed into defines the name of the plugin (not the package).
Take for example the following directory structure:
~/.config/megu/plugins/
└── megu_gfycat/ # This is the plugin name
├── LICENSE
├── README.md
├── megu_gfycat/ # Not this (this is the package name)
│ ├── __init__.py
│ ├── api.py
│ ├── constants.py
│ ├── guesswork.py
│ ├── helpers.py
│ ├── plugins/
│ │ └── ...
│ └── utils.py
├── megu_gfycat-0.1.0.dist-info/
│ ├── INSTALLER
│ ├── LICENSE
│ ├── METADATA
│ ├── RECORD
│ ├── REQUESTED
│ ├── WHEEL
│ └── direct_url.json
└── pyproject.toml
In this name of the the top-level megu_gfycat
(within the plugins/
directory) is the plugin name.
The nested megu_gfycat
is the same name, but it refers to the installed package.
So installing plugins takes 2 steps:
Create a new directory within the plugin’s folder.
Install a plugin package to the newly created plugin folder.
A helper method megu.plugin.manage.add_plugin()
can automate this for you if you’re not working in a containerized solution.
Behind the scenes, this method utilizes pip
to install the given package to a newly created plugin directory using the same names as the given package URL.
Be sure to use URLs understood by pip
if using this method.
from megu.plugin.manage import add_plugin
add_plugin("git+https://github.com/stephen-bunn/megu-gfycat.git")
If you are working in a solution like Docker, you should be making use of the environment variable MEGU_PLUGIN_DIR
to set a plugin directory and then use pip
to install the desired package yourself.
ENV MEGU_PLUGIN_DIR=/.megu/plugins/
RUN mkdir -p $MEGU_PLUGIN_DIR/megu_gfycat
RUN python -m pip install --upgrade git+https://github.com/stephen-bunn/megu-gfycat.git --target $MEGU_PLUGIN_DIR/megu_gfycat
We include a fallback plugin megu.plugin.generic.GenericPlugin
that assumes the content can be fetched with a single resource using a single HTTP GET request.
If no plugins are provided, this generic plugin will always be used.
Of course, if it content can’t be fetched using this naive approach, it will fail.
Fetching Content¶
There are 6 steps to fetching content from a URL to the local filesystem in this framework:
Discover the plugin best suited to get content from a given URL.
Iterate over available content from a given URL using the discovered plugin.
Filter down what content should be fetched.
Get the best suited downloader for the filtered content.
Download the content using the downloader to produce a manifest of artifacts.
Merge the downloaded manifest of artifacts to reproduce the content from the given URL.
We provide a module megu.services
which exposes some helpful functions to reduce the boilerplate necessary to implement each of these steps.
Plugin Discovery¶
To discover the best plugin to handle a given URL, you can use the get_plugin()
function.
Depending on what plugins you have available, it will attempt to eagerly determine if a plugin can handle the given URL.
Otherwise, this function will provide the fallback megu.plugin.generic.GenericPlugin
instance as the best suited plugin.
from megu.services import get_plugin
URL = "https://gfycat.com/pepperyvictoriousgalah-wonder-woman-1984-i-like-to-party"
plugin = get_plugin(URL)
# If megu-gfycat is installed, will return "Gfycat Basic" plugin
# Otherwise, will return "Generic Plugin"
Content Iteration¶
Now that you have the best plugin for the given URL, you need to invoke the plugin’s extraction logic to get Content
entries that can be downloaded.
You can use the iter_content()
iterator to invoke a plugin with a URL.
from megu.services import iter_content
for content in iter_content(URL, plugin):
...
# iterates over available content from the given URL as lazily as possible
This iterator will yield all content extracted by the plugin within the for loop. These content entries may be many instances of the same content using different qualities. To reduce what content is handled, we need to filter the results down a bit.
Content Filtering¶
Filtering content can be done using the functions provided in the megu.filters
module.
The most simple best_content()
filter will only take unique content with the highest indicated quality.
This filter can be applied directly to the call of iter_content
to reduce any required nesting.
from megu.services import iter_content
from megu.filters import best_content
for content in best_content(iter_content(URL, plugin)):
...
# filters out content from the content iterator
Note that in order to determine which content is the best, this filter is greedy with the iter_content
generator.
Downloader Discovery¶
With the content extracted from the plugin, we need to get the best suited downloader for the content.
You can use the get_downloader()
to get the most appropriate downloader.
from megu.services import get_downloader
downloader = get_downloader(content)
The downloader is determined by the type of resources specified by the content.
By default, this downloader will fallback to the HttpDownloader
if no downloader can handle the given content.
Content Download¶
The provided downloaders will produce a Manifest
instance.
Otherwise, downloading the content can be done right from the download_content()
method.
manifest = downloader.download_content(content)
Manifest Merge¶
The final step is to merge the downloaded manifest to a file path.
You can use the merge_manifest()
function to help you.
Note that this step is a little weird as the plugin actually provides the manifest merging functionality.
For this purpose, we need to provide the plugin
instance back into the function along with the fetched manifest.
from megu.services import merge_manifest
merge_manifest(plugin, manifest, Path("~/Downloads/", content.filename).expanduser())
Altogether, the full content fetching script can be written as the following:
from pathlib import Path
from megu.services import get_plugin, get_downloader, iter_content, merge_manifest
from megu.filters import best_content
URL = "https://gfycat.com/pepperyvictoriousgalah-wonder-woman-1984-i-like-to-party"
plugin = get_plugin(URL)
for content in best_content(iter_content(URL, plugin)):
downloader = get_downloader(content)
manifest = downloader.download_content(content)
to_path = merge_manifest(
plugin,
manifest,
Path("~/Downloads/", content.filename).expanduser()
)
print(f"Downloaded {content.id} to {to_path}")
This of course is skipping over any kind of content de-duplication and checking if the content is already present or cached on the local filesystem. But, this is pretty lightweight solution for fetching content from a given URL using this framework.
If you find anything that you think could be improved, you can interact with development in the Megu GitHub repository.