Writing My First Open Source Package - Content Aggregation CLI
Introduction
A content aggregator is simply an application that gathers content from across the web in order to allow the user an consolidated way of consuming content.
A content aggregator can also save you a lot of time wasted on endless scrolling news feeds and getting distracted from random post on your reddit feed for example.
Content aggregation helps us optimize our content consumption — instead of scrolling through 5 different websites we only need a single one, and instead of endless scrolling trying to filter the content we care about, we can be presented with content related to our topics of interest immediately.
In this article, you will learn how to create your own customized content aggregator with python from scratch.
Brief Detour
When writing this post, I had a minimal code example of a content aggregator that I planned to share with you, but while writing I had a thought of expanding it and eventually I even published it to PyPi as my first open source package.
Ideally, by the end of this post, you’d be able and would want to contribute to Fuse yourself.
Prerequisites
- A local development environment for Python 3.7+
- Familiarity with Python.
Step 1 - Installing Dependencies
In this step, you will install the modules that you will utilize later on. To do so, you will create a file that will hold the requirements for the entire project.
The packages you are going to install are:
- feedparser - An RSS parsing module
- praw - Python Reddit API Wrapper module
- colorama - Enable colored terminal text
- typing - Adding support for type hints
Create a new file called requirements.txt
.
Each line in this file will include the name of the package and the required version to install.
Copy the following requirements to your requirements.txt
file
1 | feedparser==6.0.8 |
To install all of the packages listed in the requirements.txt
file, run the following command
1 | pip3 install -r requirements.txt |
In this step, you installed all the packages necessary for this tutorial.
Next, you will get a basic understanding of how the project is structured.
Step 2 - High Level Design
In order to support various sources in a convinient way, we will create a base abstract class called Source
.
Every source that we wish to add will inherit from it and extend its functionality.
In this post I am going to cover the RedditSource
and MediumSource
, both are subclasses of Source
.
Lastly, we will have a SourceManager
which will be given a list of sources and will trigger each source fetching mechanism.
In this step, you got a basic understanding of the project’s structure.
Next, you will implement the base abstract class Source
Step 3 - Implementing the Base Class
In this step, you will implement the base abstract class Source
.
Open a new file called models.py
and write the following code
1 | from abc import ABC, abstractmethod |
The above class has two functionalities - one is to connect to the source if needed (via API key for example) and a second one is to fetch content from the source.
The implementation will stay empty in this class and every specific source will have to implement the mentioned functionality.
In this step, you implemented the base abstract class Source
.
Next, you will implement the SourceManager
class.
Step 4 - Implementing the Manager Class
In this step, you will implement the SourceManager
class.
Open the file models.py
and write the following code
1 | [label models.py] |
As discussed in the high level design step, the SourceManager
will get a list of sources, and upon calling it, the SourceManager
will trigger each source fetch
function and print the results.
There is also a function to add sources which is currently unused, but might be useful later on.
In this step, you implemented the SourceManager
class and basically finished writing the wrapping of this application.
Next, you will learn how to fetch content from reddit and implement the RedditSource
class.
Step 5 - Implementing Reddit Source
In this step, you will implement the RedditSource
class.
To start with, you will need to get an API key in order to use the praw
library and query Reddit’s API.
Here’s a short guide on Reddit’s github on how to do so -
Make sure you have a client id and a client secret.
Once you have the client id and secret, add them as environment variables REDDIT_CLIENT_ID
and REDDIT_CLIENT_SECRET
.
Now, create a new file called reddit_source.py
and open it.
Let’s first take care of the minimal necassary implementation which is defined by the Source
class.
1 | [label reddit_source.py] |
Let’s go through the implementation briefly, starting with the init
method, you will get a subreddit you wish to query, the metric you wish to query on which is either hot or top and a limit of results you want to see.
Inside the init
function, we create a connection to Reddit’s API via the praw library.
In order to create the connection, you should pass the client id and secret that you generated in the begining of this step.
Next, going over the fetch
method, depending on the metric you got, you retrieve the matching results from praw
using the connection object.
Lastly, we reformat the results from the API so that results across different sources will have a unified representation.
To create a unified representation, open the file models.py
and add the following Result
class
1 | [label models.py] |
The above Result
class simply gets the title and the url of the post we queried and prints it to the terminal using colorama
module.
After creating the Result
class, come back to the reddit_source.py
file and finish the implementation of the RedditSource
class.
1 | [label reddit_source.py] |
The reformat_results
function is responsible for taking the raw results given from the API and transforming it to the unified representation class you created earlier.
Lastly, by implementing the __repr__
method, you can print all the results that you fetched and the implementation of the RedditSource
is done.
In this step, you implemented the RedditSource
class and created a unified representation for all different sources.
Next, you will get a taste of what’s already implemented by executing the program.
Step 6 - Executing Partial Implementation
In this step, you will execute what you have implemented so far.
To do so, create a file called main.py
and use the following code.
1 | [label main.py] |
The above code simply creates two reddit sources, the first is for programming subreddit and the second for shower thoughts subreddit.
After creating these sources, we pass them as a list to the SourceManager
and call it in order to execute the program.
Execute your program with
1 | python main.py |
In this step, you executed what you implemented in the last 5 steps.
Next, you will add an additional source, which will be Medium
.
Step 7 - Implementing Medium Source
In this step, you will implement the MediumSource
class.
As we did before, let’s first take care of the minimal necassary implementation which is defined by the Source
class.
Create a new file called medium_source.py
and use the following code.
1 | [label medium_source.py] |
As you might have noticed, the MediumSource
is slighly different than the RedditSource
.
Here, we don’t need to connect through an API, so the implementation of connect
will remain empty.
To query this source, we will use the feedparser
module which will retrieve results based on tagging from the Medium feed.
To complete the implementation, we are missing the reformat_results
and __repr__
functions which will look quite similar to the RedditSource
matching functions.
1 | [label medium_source.py] |
As in the RedditSource
class, the reformat_results
function is responsible for transforming the raw results we queried into the unified representation class you created in an earlier step.
In this step, you implemented the MediumSource
class, and by doing so finished implementing your content aggregator (at least to the scope that I am going to cover).
Next, you will execute the entire program.
Step 8 - Executing The Program
Similarly to step 6, open main.py
.
You should have the following implementation there from step 6.
1 | [label main.py] |
Now, you can throw another type of source in, which is the MediumSource
.
Note: All the new lines or lines that were changed are marked in #new
.
1 | [label main.py] |
Now, execute your program with the command
1 | python main.py |
In this step, you executed your content aggregator and you are ready to add more sources on your own.
What’s Next
As I mentioned earlier, I turned this content aggregator project into an open source tool called Fuse
.
If you are excited about adding more sources I invite you to challenge yourself and contribute to Fuse
If you are willing to contribute and facing some problems don’t hesitate to reach out.
Looking for a powerful, self-hosted backend for forms?
I’m building Collecto — a production-ready tool designed to handle your forms with ease and security. Check it out here and be part of its journey!