maoserr's comments

maoserr · on Oct 21, 2024

Hey I want to address why this extension is different from other scrapers.

This is for ad hoc generation of EPub from websites that don't have scrape well using traditional scrapers (think standard request based command line scripts or some other chrome extensions that scrape based on open tabs/window) for some reasons:

1. Usually command line scrapers and other extensions have predefined sites they work for, this one's outside of those sites

2. Or they requires nontrivial configuration and/or code

3. Some sites use javascript to dynamically generates/retrieve the text, in which case you need the browser to run the JS - This was the biggest gap for me.

4. This one runs in the browser, so maybe less likely to be detected and blocked

I also don't intend this scraper to be robust or used in a repeated fashion as a background scheduled job, that's why there's a UI for selecting key elements for scraping. It's meant to be more generalized so that you don't have to have a configuration for a site to still be able to scrape it relatively easily with just some mouse clicks.

If the site you're scraping is already handled by the other programs/extensions, then this wouldn't perform better since the other ones are specifically configured for those sites. Otherwise, this extension gives you the tool to scrape something once or twice without spending too much time coding/configuring.

I don't find myself sticking to the same site a lot, so wrote this.

maoserr · on Oct 21, 2024

You can import a csv of all the chapter links, looks like it's just incremental numbering in the url

t-3 · on Oct 21, 2024

The issue is most likely cloudflare blocking most the best scraping methods. If the site can be pulled down with eg. wget or curl without a bunch of options that you definitely aren't writing by hand, pandoc can just be used to directly make an epub.

maoserr · on Oct 21, 2024

It extract the main content using Readability by default (you can configure it with something else). Logins would depend on how you're parsing. It has two modes, it either browses to the page inside the window (for non-refreshing pages), or retrieves it in the background using fetch.

dotancohen · on Oct 21, 2024

Terrific, thank you.

maoserr · on Oct 21, 2024

Works on this site: https://docs.ray.io/en/latest/ for me.

maoserr · on April 15, 2022

I wrote something similar: https://github.com/maoserr/epublifier

It's more geared towards longer web novels with 50+ chapters (I've used it on novels with 500 chapters before). Instead of opening each page as a tab, it fetches chapters from a Table of Contents page.

It was written for jnovel/cnovel/knovel site, but it can handle any generic page that has a list of links.

gfaster · on April 15, 2022

I also wrote an alternative solution (not a public repo), but I found that relying on site maps and other link lists generally gave unsatisfactory results. Instead, my solution navigated as a user and actually used next chapter links. While that slowed it down (+ 10 seconds between requests to be polite), it could handle very large books, with the largest I used being 700+ chapters at the time (5000 pages).

slopdo · on April 15, 2022

This is almost the same aproach I used for Bloxp[0]. I have some common Previous Post link markups and I try to navigate from the last post in a blog, one by one, to the first. I also allow to manually indicate the HTML markup to use for crawling a given blog, in case it is not matching any of the common ones.

I uploaded the site 10 years ago (at first I did it because it was useful to me) and I have made almost no changes since then but many people still use it as a simple way to export a full blog into an ePub.

[0] http://www.bloxp.com

maoserr · on April 15, 2022

Yea I also had that idea before, but I didn't want to maintain a bunch of different "next chapter" finder logic.

But I do agree it would be a more reliable way of doing things.

TakerofVita · on April 15, 2022

I've written similar scripts to do this, but lncrawl replaced most of them https://github.com/dipu-bd/lightnovel-crawler/

maoserr · on Aug 6, 2010

The homepage is here: http://www.mao-yu.com/projects/redshiftgui/, but there's not much useful info there.