Hacker Newsnew | past | comments | ask | show | jobs | submit | klodolph's commentslogin

I already think that markdown is barely ok for writing documentation, and the experience of plugins in mdBook is why I tell people not to use it (edit: it = mdBook). The base flavor of mdBook is minimalistic. Maybe that’s a good thing, that you’re given a minimalistic markdown as a starting point? But if it’s minimalistic, then it’s certainly missing some things that I’d want to use in the documentation I write, and the experience of using plugins is, well, not very good.

My current recommendations are MkDocs (material theme), Jekyll, and Docusaurus. Hugo gets a qualified recommendation, and I only recommend mdBook for people who are doing Rust stuff.


Can you speak on what features were missing ootb and which plugins respectively gave you trouble? I'm not sure how "people who are doing Rust stuff" would specifically get more out of it either? Are you implying you cant just use the tool/plugins without familiarity in Rust? This is not my experience.

What is missing from markdown? mdbook also uses in some parts the GH flavored one, so you can create notes [1] and similar. On top of that, you can add support for Mermaid.

Personally, I don't think you need more than that for 90% of the documentation, but I'm happy to hear more about your use case.

[1]: https://github.com/orgs/community/discussions/16925


Markdown is devalued as a format because of the bizarre shortage of Markdown VIEWERS. You find Markdown documents in every open-source project, and you always wind up viewing them with all the embedded formatting characters. Why?

Why provide documentation in a format that is so poorly supported by READERS? Or, to respect the chicken-&-egg problem here: Why is there such a shortage of Markdown viewers?

Every time this comes up, respondents always cite this or that EDITOR that has Markdown "preview." NO. We're not editing the documentation; we're just reading it. So why do we have to load the document into an editor and then invoke a "preview" of it? Consider how nonsensical the term "preview" is in that case: What are we "previewing" its appearance in, given the dearth of Markdown readers?


If you're looking for a CLI markdown reader then I'd recommend glow.

https://github.com/charmbracelet/glow


Thanks. But that's even stranger! I just want to double-click a Markdown file in my OS's file browser and read it, with formatting applied.

Ahh, if you're on KDE then Okular should do the trick, it should even update live when you update the markdown file.

Thanks!

I hope it’s like what happened to countries like England, France, and Spain. You see your empire collapse but the country itself remains intact.

England “gave up” scientific and technological leadership during the 20th century. (That’s a tongue-in-cheek take on it, don’t read too much into it.)


It worked out well for Europe because the country that took over its position of leadership position post-WW2 (USA) was aligned with it in all ways (politically, culturally, scientifically, economically), and so (western) European countries could still enjoy all the benefits. It will not be the case this time around, because the next generation of innovation and leadership is going to come from China.

I think that is the most likely outcome. However, if the decline starts occurring too rapidly, I do think violent far-right (and perhaps far-left) paramilitary action could become a major problem, like in 1920s/1930s Germany. Tons of time spent lurking in far-right extremist communities out of morbid curiosity, and the spread of far-right ethnosupremacist sentiment on basically every social media platform, has me concerned.

The good news is those people are fundamentally absolute losers.

Yes, nearly all of them absolutely are. (I have talked to many of them and they really truly are.) That fact does genuinely assuage my concerns. Still, I do wonder if a future charismatic far-right politician who does not come across as a loser could do far better than previous generations ever could have predicted. The worst possible person at the worst possible time.

Yes but Spain, England, and France all had decade long declines that reversed. Except you know, at the end. When it didn't reverse.

We are witnessing the end of... something. Is it the end of the Roman Republic or is this the end of the Roman Empire?

Two very different situations despite being so politically fraught and full of change.


> England “gave up” scientific and technological leadership during the 20th century. (That’s a tongue-in-cheek take on it, don’t read too much into it.)

Was forced to give up, due to the economic devastation of WWII, might be more accurate (though of course there were other factors too).


I feel like I grokked Perl enough and I still write Perl code, but I also think that there are some technical reasons why it declined in popularity in the 2000s and 2010s. All those differences between $ % @, the idea of scalar or list context, overuse of globals, and references. These features can all make sense if you spend enough time in Perl, and can even be defended, but it creates a never-ending stream of code that looks right but is wrong, and it all seems to create a lot of complexity with very little benefit.


I think a reasonable solution is “people who find the answer should observe that the question was asked eight years ago, and certainly double-check the answer”. If it’s a question about company internal codebases or operations, then you should have access to see the code or resources the answer is talking about.


I have an overly reductive take on this—it’s Unix environment variables.

You have your terminal window and your .bashrc (or equivalent), and that sets a bunch of environment variables. But your GUI runs with, most likely, different environment variables. It sucks.

And here’s my controversial take on things—the “correct” resolution is to reify some higher-level concept of environment. Each process should not have its own separate copy of environment variables. Some things should be handled… ugh, I hate to say it… through RPC to some centralized system like systemd.


Windows registry just sort of hovering in the backdrop


Something that is still inheritable, between “there is one and it is global” and “there is a separate copy for each process”.


Bugs can get introduced for other reasons besides “feature not completed”.


> Langan has not produced any acclaimed works of art or science. In this way, he differs significantly he differs significantly from outsider intellectuals like Paul Erdös, Stephen Wolfram, Nassim Taleb, etc.

Paul Erdős is the only outsider intellectual on that list, IMO.

(Also note that ő and ö are different!)


Can you even be called an "outsider" when everyone who recognizes the name associates it with "eccentric but well respected mathematician who was well liked enough in the community that people would regularly let him sleep in their homes for days on end"? According to his wikipedia page, Erdős collaborated with hundreds of other mathematicians. That's the very opposite of being an outsider IMO.


I think of those two, agentic crawlers are worse.


That sounds like the desired outcome here. Your agent should respect robots.txt, OR it should be designed to not follow links.


An agent acting on my behalf, following my specific and narrowly scoped instructions, should not obey robots.txt because it's not a robot/crawler. Just like how a single cURL request shouldn't follow robots.txt. (It also shouldn't generate any more traffic than a regular browser user)

Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling.


Confused as to what you're asking for here. You want a robot acting out of spec, to not be treated as a robot acting out of spec, because you told it to?

How does this make you any different than the bad faith LLM actors they are trying to block?


robots.txt is for automated, headless crawlers, NOT user-initiated actions. If a human directly triggers the action, then robots.txt should not be followed.


But what action are you triggering that automatically follows invisible links? Especially those not meant to be followed with text saying not to follow them.

This is not banning you for following <h1><a>Today's Weather</a></h1>

If you are a robot that's so poorly coded that it is following links it clearly shouldn't that's are explicitly numerated as not to be followed, that's a problem. From an operator's perspective, how is this different than a case you described.

If a googler kicked off the googlebot manually from a session every morning, should they not respect robots.txt either?


I was responding to someone earlier saying a user agent should respect robots.txt. An LLM powered user-agent wouldn't follow links, invisible or not, because it's not crawling.


It very feasibly could. If I made an LLM agent that clicks on a returned element, and then the element was this trap doored link, that would happen


There's a fuzzy line between an agent analyzing the content of a single page I requested, and one making many page fetches on my behalf. I think it's fair to treat an agent that clicks an invisible link as a robot/crawler since that agent is causing more traffic than a regular user agent (browser).

Just trying to make the point that an LLM powered user agent fetching a single page at my request isn't a robot.


You're equating asking Siri to call your mom to using a robo-dialer machine.


If your specific and narrowly scoped instructions cause the agent, acting on your behalf, to click that link that clearly isn't going to help it--a link that is only being clicked by the scrapers because the scrapers are blindly downloading everything they can find without having any real goal--then, frankly, you might as well be blocked also, as your narrowly scoped instructions must literally have been something like "scrape this website without paying any attention to what you are doing", as an actual agent--just like an actual human--wouldn't find our click that link (and that this is true has nothing at all to do with robots.txt).


If it's a robot it should follow robots.txt. And if it's following invisible links it's clearly crawling.

Sure, a bad site could use this to screw with people, but bad sites have done that since forever in various ways. But if this technique helps against malicious crawlers, I think it's fair. The only downside I can see is that Google might mark you as a malware site. But again, they should be obeying robots.txt.


Your web browser is a robot, and always has been. Even using netcat to manually type your GET request is a robot in some sense, as you have a machine translating your ascii and moving it between computers.

The significant difference isn't in whether a robot is doing the actions for you or not, it's whether the robot is a user agent for a human or not.


should cURL follow robots.txt? What makes browser software not a robot? Should `curl <URL>` ignore robots.txt but `curl <URL> | llm` respect it?

The line gets blurrier with things like OAI's Atlas browser. It's just re-skinned Chromium that's a regular browser, but you can ask an LLM about the content of the page you just navigated to. The decision to use an LLM on that page is made after the page load. Doing the same thing but without rendering the page doesn't seem meaningfully different.

In general robots.txt is for headless automated crawlers fetching many pages, not software performing a specific request for a user. If there's 1:1 mapping between a user's request and a page load, then it's not a robot. An LLM powered user agent (browser) wouldn't follow invisible links, or any links, because it's not crawling.


How did you get the url for curl? Do you personally look for hidden links in pages to follow? This isn't an issue for people looking at the page, it's only a problem for systems that automatically follow all the links on a page.


Yea i think the context for my reply got lost. I was responding to someone saying that an LLM powered user-agent (browser) should respect robots.txt. And it wouldn't be clicking the hidden link because it's not crawling.


Maybe your agent is smart enough to determine that going against the wishes of the website owner can be detrimental to your relationship the such website owner and therefore the likelihood of the website to continue existing, so is prioritizing your long-term interests over your short-term ones.


How does a server tell an agent acting on behalf of a real person from the unwashed masses of scrapers? Do agents send a special header or token that other scrapers can't easily copy?

They get lumped together because they're more or less indistinguishable and cause similar problems: server load spikes, increased bandwidth, increased AWS bill ... with no discernible benefit for the server operator such as increased user engagement or ad revenue.

Now all automated requests are considered guilty until proven innocent. If you want your agent to be allowed, it's on you to prove that you're different. Maybe start by slowing down your agent so that it doesn't make requests any faster than the average human visitor would.


The only real difference that LLM crawlers tend to not respect /robots.txt and some of them hammer sites with some pretty heavy traffic.

The trap in the article has a link. Bots are instructed not to follow the link. The link is normally invisible to humans. A client that visits the link is probably therefore a poorly behaved bot.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: