Hacker Newsnew | past | comments | ask | show | jobs | submit | EmilStenstrom's commentslogin

Added features since initial release:

- Bleach-like sanitization feature built in and enabled by default

- Transforms API for simple HTML mutations

- Rewamped docs

- Playground powered by PyOdide (thanks for the idea simomw!)


I see that the blog post is mentioning not finding the web100k dataset, it's here: https://github.com/EmilStenstrom/web100k

Thanks, Emil! I've updated the post with this link.

As the author, it's a stretch to say that JustHTML is a port of html5ever. While you're right that this was part of the initial prompt, the code is very different, which is typically not what counts as "port". Your mileage may wary.

To see the actual errors, just paste your HTML here and see: https://emilstenstrom.github.io/justhtml/playground/ - any parsing errors show up below the input box.

Some tags do require ending tags, others do not. Personally I find it hard to remember which ones, so I just close things out of caution. That way you’re always spec-correct.


Data driven test suites are really good for building trust in a library. Both the html5lib-tests suite and my recent xss-bench are examples of this!


The reason for this was to be able to build trust in the new sanitization features of my other project: https://friendlybit.com/python/justhtml-sanitization/


Thanks for flagging this. Found multiple errors that are now fixed:

- The quoted test comes from justhtml-tests, a custom test suite added to make sure all parts of the algorithm are tested. It is not part of html5lib-tests.

- html5lib-tests does not support control characters in tests, which is why some of the tests in justhtml-tests exist in the first place. In my test suite I have added that ability to our test runner to make sure we handle control character correctly.

- In the INCOMING HTML block above, we are not printing control characters, they get filtered away in the terminal

- Both the treebuilder and the tokenizer are outputting errors for the found control character. None of them are in the right location (at flush instead of where found), and they are also duplicate.

- This being my own test suite, I haven't specified the correct errors. I should. expected-doctype-but-got-start-tag is reasonable in this case.

All of the above bugs are now fixed, and the test suite is in a better shape. Thanks again!


Hi! The expected errors are not standardized enough for it to make sense to enable --check-errors by default. If you look at the readme, you'll see that the only thing they're checking is that the _numbers of errors_ are correct.

That said, the example you are pulling our out does not match that either. I'll make sure to fix this bug and other like it! https://github.com/EmilStenstrom/justhtml/issues/20


run_tests.py does not appear to be checking the number of errors or the errors themselves for the tokenizer, encoding or serializer tests from html5lib-tests - which represent the majority of tests.

There's also something off about your benchmark comparison. If one runs pytest on html5lib, which uses html5lib-test plus its own unit tests and does check if errors match exactly, the pass rate appears to be much higher than 86%:

    $ uv run pytest -v 
    17500 passed, 15885 skipped, 683 xfailed,
These numbers are inflated because html5lib-tests/tree-construction tests are run multiple times in different configurations. Many of the expected failures appear to be script tests similar to the ones JustHTML skips.


I've checked the numbers for html5lib, and they are correct. They are skipping a load of tests for many different reasons, one being that namespacing of svg/math fragments are not implemented. The 88% number listed is correct.


Excellent feedback. I'll have a look at the running of html5lib tests again.


I think the reason this was an evening project for Simon is based on both the code and the tests and conjunction. Removing one of them would at least 10x the effort is my guess.


The biggest value I got from JustHTML here was the API design.

I think that represents the bulk of the human work that went into JustHTML - it's really nice, and lifting that directly is the thing that let me build my library almost hands-off and end up with a good result.

Without that I would have had to think a whole lot more about what I was doing here!


Do you mind elaborating? By API design, do you mean how they structured their classes, methods, etc. or something else?


I mean the design of the user-facing API: https://github.com/EmilStenstrom/justhtml/blob/main/docs/api...

See also the demo app I vibe-coded against their library here: https://tools.simonwillison.net/justhtml - that's what initially convinced me that the API design was good.

I particularly liked the design of JustHTML's core DOM node: https://github.com/EmilStenstrom/justhtml/blob/main/docs/api... - and the design of the streaming API: https://github.com/EmilStenstrom/justhtml/blob/main/docs/api...


The other way around works as well! ”Get me to 100% test coverage using only integration tests” is a fun prompt!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: