Writing an ingester

First we need to write a program that will download the blog posts and "ingest" them, making them work with our system. We call this program an "ingester".

In our ingester, we'll read the Creative Commons blog's RSS feed, download the articles linked from the feed, clean them up so that we can re-theme them to fit with our app, and package them into a "hatch" which is the next step in the app-building process.

An ingester is a NodeJS program. There is an NPM package called libingester which provides some facilities that make it easier to write an ingester. We will write our Creative Commons blog ingester using libingester. We'll also use other NPM packages for cleaning up the HTML.

Setting up a Node module for the ingester

First, create a directory where the ingester code will live, and create a package:

mkdir cc-ingester
cd cc-ingester
npm init

The npm init command will ask you some questions in order to set up the package.

Accept the default name of cc-ingester.
Put 0.0.0 as the version.
Write a nice description, such as "Ingester for Creative Commons blog".
Accept the default entry point.
Leave the test command, git repository, and keywords blank.
Put your name as the author.
Put CC0-1.0 as the license, or whatever you choose (the code in this tutorial is licensed CC0-1.0, which means that anyone can use, copy, and modify it freely.)

Set up any developer tools you might be planning to use, for example:

npm install --save-dev eslint
./node_modules/.bin/eslint --init

Parsing the RSS feed

Create a file in the cc-ingester directory called index.js. We'll write the ingester in here.

First we want to read the Creative Commons blog's RSS feed and make sure we can parse its entries. For this we'll use the utilities included in libingester. Install it like this:

npm install --save libingester

Create an index.js. Here's a minimal version that will parse the RSS feed and show which posts we would ingest:

const Libingester = require('libingester');

const feedURI = 'https://creativecommons.org/blog/feed/';

function ingestArticle({date, title}) {
    console.log(date, title);
}

async function main() {
    const items = await Libingester.util.fetch_rss_entries(feedURI);
    items.forEach(ingestArticle);
}

main();

Add "start": "node index.js" to the scripts dictionary in package.json, and test the script by running npm start!

You will probably see no articles. Looking at the libingester API documentation, that is because fetch_rss_entries() only looks at the previous day of entries. Let's supply some better parameters, putting no maximum on the number of posts and a 3-month maximum on the posts' age:

Libingester.util.fetch_rss_entries(feedURI, Infinity, 90)

This time, npm start prints more post titles, but not 90 days' worth. Examining the RSS feed shows why: it only shows 10 items. This is a problem with a lot of RSS feeds. Unfortunately there isn't a standard way to paginate RSS feeds. However, we can see in the RSS feed that it is generated by Wordpress, which gives us an easy solution. Wordpress RSS feeds support pagination, and since Wordpress is so common, libingester includes a utility for paginating them, which we can use:

const paginator = Libingester.util.create_wordpress_paginator(feedURI);
Libingester.util.fetch_rss_entries(paginator, Infinity, 90)

This works, showing three months of posts! Let's temporarily change the maximum number of items to 3 for further development, in order to save time.

NOTE: There are also environment variables that control these parameters. It would be much more convenient to use the environment variables once you are using the ingester in production, but for this tutorial, we'll edit the values in the code.

Obtaining metadata

Now that we have RSS entries, we can download the posts, and get their metadata (such as title, author, and tags).

To see the HTML we're dealing with and get a sense of how much cleaning must be done, let's get our ingestArticle() function to write out some basic metadata from the RSS feed and save a copy of the HTML:

const util = require('util');
const writeFile = util.promisify(require('fs').writeFile);
// ...
async function ingestArticle({link, title, author, date}, ix) {
    let $ = await Libingester.util.fetch_html(link);
    console.log('-'.repeat(40));
    console.log(`TITLE: ${title}`);
    console.log(`AUTHOR: ${author}`);
    console.log(`URL: ${link}`);
    console.log(`DATE PUBLISHED: ${date}`);

    await writeFile(`${ix}.html`, $.html());
}

Libingester.util.fetch_html returns a Cheerio object. Cheerio is a library for DOM manipulation without a browser, and its interface is the same as jQuery.

Examining the written-out HTML in 0.html, 1.html, and 2.html shows that this blog has easily available metadata in its header, which follows the OpenGraph standard:

<meta property="og:locale" content="en_US" />
<meta property="og:type" content="article" />
<meta property="og:title" content="Katherine Maher, Ruth Okediji, Chris Bourg to Keynote Creative Commons Global Summit - Creative Commons" />
<meta property="og:description" content="We’re super excited to announce our keynote speakers for the 2018 CC Global Summit from April 13-15 in Toronto." />
<meta property="og:url" content="https://creativecommons.org/2018/01/29/summit-keynotes-2018/" />

We can quite easily get the metadata using Cheerio:

const imageURI = $('meta[property="og:image"]').attr('content');
const synopsis = $('meta[property="og:description"]').attr('content');
const lastModified = $('meta[property="article:modified_time"]')
    .attr('content');
console.log(`IMAGE URI: ${imageURI}`);
console.log(`SYNOPSIS: ${synopsis}`);
console.log(`LAST MODIFIED: ${lastModified}`);

In the case of tags, things are slightly complicated. Wordpress distinguishes predefined "categories" and free-form "tags". The data model for offline content that we use here, allows tagging content with an array of strings. These tags can be any string, and are not visible to users of the app. Later on, when we define the content structure for the app, we can create a "set" that is visible in the app's UI. A set includes any number of tags, and can be marked "featured" to make it more prominent in the app, or not. (You'll learn more about this later in the walkthrough.)

Later on we will make Wordpress categories into featured sets, and Wordpress tags non-featured. For now, we will mark the tag IDs of Wordpress tags with tag: and leave the Wordpress categories as-is, so that we'll know what to do later.

const wpCategory = $('meta[property="article:section"]')
    .attr('content');
const wpTags = $('meta[property="article:tag"]')
    .map(function () { return $(this).attr('content'); })
    .get();
const tags = wpTags.map(t => `tag:${t}`);
tags.unshift(wpCategory);
console.log(`TAGS: ${tags}`);

Cleaning up the blog posts

If you're creating an app from a website's content, it's incongruous when it looks like the whole website is embedded inside the app. We want to get rid of all the theming and navigation elements on the web page, so that we can later design a more unified style for our app's UI elements and its content.

Although it would be possible to do this with traditional web scraping and DOM manipulation tools such as Cheerio, we are going to use Fathom to clean up the content. Fathom is a fairly new technology that allows you to interpret an HTML document based on "rules" that you write.

npm install --save cheerio fathom-web jsdom

NOTE: Both libingester and Fathom require a DOM library for processing the HTML. Unfortunately, libingester works internally with Cheerio which is much faster than JSDOM, but Fathom requires some features that Cheerio doesn't provide. Therefore, we have to convert from one to the other and back again in our ingester.

We use a rule set based on Fathom's example "Readability" ruleset, which scores paragraphs and paragraph-like HTML nodes based on their likelihood to be part of the main text, then picks the biggest cluster of high-scoring nodes.

const rules = ruleset(
    // Isolate the actual blog post body text. Based on Fathom's example
    // Readability rules
    rule(dom('p,li,ol,ul,code,blockquote,pre,h1,h2,h3,h4,h5,h6'),
        props(scoreByLength).type('paragraphish')),
    rule(type('paragraphish'), score(byInverseLinkDensity)),
    rule(dom('p'), score(4.5).type('paragraphish')),

    // Tweaks for this particular blog
    rule(type('paragraphish'), score(hasAncestor('article', 10))),
    rule(dom('.entry-summary p'), score(0).type('paragraphish')),
    rule(dom('figure'), props(scoreByImageSize).type('paragraphish')),

    // Find the best cluster of paragraph-ish nodes
    rule(
        type('paragraphish').bestCluster({
            splittingDistance: 3,
            differentDepthCost: 6.5,
            differentTagCost: 2,
            sameTagCost: 0.5,
            strideCost: 0,
        }),
        out('content').allThrough(Futils.domSort)));

// ...

const dom = JSDOM.jsdom($.html(), {
    features: {ProcessExternalResources: false},
});
const facts = rules.against(dom);
const html = facts.get('content')
    .filter(fnode => fnode.scoreFor('paragraphish') > 0)
    .map(fnode => fnode.element.outerHTML).join('');
await fs.writeFile(`${ix}.html`, `<article>${html}</article>`);

For brevity, we've omitted the imports and the functions scoreByLength(), scoreByImageSize(), byInverseLinkDensity(), and hasAncestor(). If you're running your own program as you're following along, you can find them in the full code.

Running our ingester with npm start now shows the HTML in 0.html, 1.html, and 2.html to be quite minimal. We'll do some final cleanup with Cheerio, removing unused DOM attributes, to keep the sizes of the posts as small as possible. After all, our app may be downloaded by people with limited internet connections, so we should try to optimize for that.

$ = Cheerio.load('<article>');
$('article').append(html);

const all = $('*');
all.removeAttr('class');
all.removeAttr('style');
const imgs = $('img');
['attachment-id', 'comments-opened', 'image-description', 'image-meta',
    'image-title', 'large-file', 'medium-file', 'orig-file',
    'orig-size', 'permalink']
    .forEach(data => imgs.removeAttr(`data-${data}`));
imgs.removeAttr('srcset');  // For simplicity, only use one size
imgs.removeAttr('sizes');

Or even better, just use the util.cleanup_body() utility. It has good defaults so it already does all the above and more. And you can overwrite or extend the defaults if the pages you are ingesting needs a customized cleanup.

$ = Cheerio.load('<article>');
$('article').append(html);

Libingester.util.cleanup_body($.root());

Creating a hatch

We're now going to render the ingested pages into one of libingester's article formats, and then save the result in a hatch. An article format is a class that we can pass HTML snippets into, and it will render a nice-looking page that's suitable for embedding into our app. A hatch is libingester's term for the packaged-up content: it's a hatch that you open and drop content ("assets") into, and from there it goes for further processing.

To create the hatch, we change our main() function a bit, and change our ingestArticle() function to take the hatch as its first parameter.

async function main() {
    const hatch = new Libingester.Hatch('cc-blog', 'en');
    const paginator = Libingester.util.create_wordpress_paginator(feedURI);
    const items = await Libingester.util.fetch_rss_entries(paginator,
        Infinity, 90);
    await Promise.all(items.map(entry => ingestArticle(hatch, entry)));
    hatch.finish();
}

We're going to use one of libingester's predefined article formats, but it's also possible to write your own. The article format we'll use is called BlogArticle. The BlogArticle is itself an asset which we'll drop into the hatch. Here we create it, and give it all the metadata that we determined earlier:

const postAsset = new Libingester.BlogArticle();
postAsset.set_title(title);
postAsset.set_synopsis(synopsis);
postAsset.set_canonical_uri(link);
if (lastModified)
    postAsset.set_last_modified_date(lastModified);
postAsset.set_date_published(date);
postAsset.set_author(author);
postAsset.set_tags(tags);

We include some static metadata such as the license, which we know because the whole blog is under that license:

postAsset.set_license('CC BY 4.0 International');
postAsset.set_read_more_text(`"${title}" by ${author}, used under CC BY 4.0 International / Reformatted from original`);

We also have to convert all the post's images to assets and drop them into the hatch. After all, we are packaging up content for offline viewing, so all the images have to be offline as well.

We take the imageURI we determined earlier and use that as the post's thumbnail image:

const thumbnailAsset = Libingester.util.download_image(imageURI);
hatch.save_asset(thumbnailAsset);
postAsset.set_thumbnail(thumbnailAsset);

Next, we pick a "main image" for the post. The main image is a feature of the BlogArticle format: one image can be highlighted and given special placement at the top of the post. We'll take a guess that the first <figure> element in the post would make a good main image. We can always amend our guess later when we refine the ingester.

const figures = $('figure');
if (figures.length) {
    const main = figures.first();
    const img = $('img', main);
    const mainImageAsset = Libingester.util.download_img(img, baseURI);
    hatch.save_asset(mainImageAsset);

    postAsset.set_main_image(mainImageAsset);
    postAsset.set_main_image_caption($('figcaption', main).text());

    $(main).remove();
}

Note that we have to remove the main image from the article body, according to the documentation of set_main_image(), or else it will get rendered twice.

Then, we drop all the other image assets into the hatch:

$('figure').each(function () {
    const img = $('img', this);
    const figureAsset = Libingester.util.download_img(img, baseURI);
    hatch.save_asset(figureAsset);
});

Finally, we give what remains of the HTML to the post asset, and drop that into the hatch too:

postAsset.set_body($);
postAsset.render();

hatch.save_asset(postAsset);

Putting it all together

Now, since we're done testing, set the maximum number of posts back to Infinity in order to get the full 90 days of content.

If you've followed along and typed in or copied the code examples, you have the full ingester now. If you didn't, or something's not working right, then check the full code here.

When you run this ingester with npm start, you should see the posts being processed one by one, and then the hatch will be saved in the current directory. The hatch is a .tar.gz file, but libingester leaves an uncompressed directory full of .data and .metadata files, and a manifest. You can inspect these files to see what's in them, but in the next section we will use a tool called Hatch Previewer for that.

Further remarks and ideas

If you're packaging up content that isn't organized in an RSS feed, or the RSS feed isn't paginated, you might have to use web scraping. There are other packages available from NPM that can help you write a web scraper.

It's not required to download the content from online! If you have a local archive of content, you can simply write your ingester to process it from your local archive and put it into the hatch.

Make sure that you have the rights to modify and redistribute any content that you turn into an app! Most websites' copyright belongs to the website's owner. To make an app, you should use your own website, or make sure that the content is freely redistributable, like Wikipedia and its affiliated sites.

Subpages

Full code

The full ingester code

The results of the search are

License and Copyright info

Documentation in this page is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, unless otherwise noted.

Code snippets in this page are licensed under a Creative Commons Zero 1.0 Universal. license, unless otherwise noted.

Edit on github