As the author, it's a stretch to say that JustHTML is a port of html5ever. While you're right that this was part of the initial prompt, the code is very different, which is typically not what counts as "port". Your mileage may wary.
Some tags do require ending tags, others do not. Personally I find it hard to remember which ones, so I just close things out of caution. That way you’re always spec-correct.
Thanks for flagging this. Found multiple errors that are now fixed:
- The quoted test comes from justhtml-tests, a custom test suite added to make sure all parts of the algorithm are tested. It is not part of html5lib-tests.
- html5lib-tests does not support control characters in tests, which is why some of the tests in justhtml-tests exist in the first place. In my test suite I have added that ability to our test runner to make sure we handle control character correctly.
- In the INCOMING HTML block above, we are not printing control characters, they get filtered away in the terminal
- Both the treebuilder and the tokenizer are outputting errors for the found control character. None of them are in the right location (at flush instead of where found), and they are also duplicate.
- This being my own test suite, I haven't specified the correct errors. I should. expected-doctype-but-got-start-tag is reasonable in this case.
All of the above bugs are now fixed, and the test suite is in a better shape. Thanks again!
Hi! The expected errors are not standardized enough for it to make sense to enable --check-errors by default. If you look at the readme, you'll see that the only thing they're checking is that the _numbers of errors_ are correct.
run_tests.py does not appear to be checking the number of errors or the errors themselves for the tokenizer, encoding or serializer tests from html5lib-tests - which represent the majority of tests.
There's also something off about your benchmark comparison. If one runs pytest on html5lib, which uses html5lib-test plus its own unit tests and does check if errors match exactly, the pass rate appears to be much higher than 86%:
These numbers are inflated because html5lib-tests/tree-construction tests are run multiple times in different configurations. Many of the expected failures appear to be script tests similar to the ones JustHTML skips.
I've checked the numbers for html5lib, and they are correct. They are skipping a load of tests for many different reasons, one being that namespacing of svg/math fragments are not implemented. The 88% number listed is correct.
I think the reason this was an evening project for Simon is based on both the code and the tests and conjunction. Removing one of them would at least 10x the effort is my guess.
The biggest value I got from JustHTML here was the API design.
I think that represents the bulk of the human work that went into JustHTML - it's really nice, and lifting that directly is the thing that let me build my library almost hands-off and end up with a good result.
Without that I would have had to think a whole lot more about what I was doing here!
See also the demo app I vibe-coded against their library here: https://tools.simonwillison.net/justhtml - that's what initially convinced me that the API design was good.
- Bleach-like sanitization feature built in and enabled by default
- Transforms API for simple HTML mutations
- Rewamped docs
- Playground powered by PyOdide (thanks for the idea simomw!)
reply