More

big_whack · 2025-10-03T15:09:22 1759504162

Does email work on iOS?

big_whack · on Oct 26, 2024

This is one of the ostensible guiding principles of the Washington Post: "The newspaper’s duty is to its readers and to the public at large, and not to the private interests of its owners."

big_whack · on Oct 26, 2024

Independence from ownership isn't really a left/right principle. The Washington Post has their set of guiding principles online - I think from those it's pretty clear why there is discontent.

https://www.washingtonpost.com/about-the-post

big_whack · on Aug 20, 2024

It's not really a problem of there being combinatorially many ways to join table A to table B, but rather that unless the join is fully-specified those ways will mostly produce different results. Your tool would need to sniff out these ambiguous cases and either fail or prompt the user to specify what they mean. In either case the user isn't saved from understanding joins.

halfcat · on Aug 21, 2024

> ”but rather that unless the join is fully-specified those ways will mostly produce different results.”

As a SQL non-expert, I think this is why we are averse to joins, because it’s easy to end up with more rows in the result than you intended, and it’s not always clear how to verify you haven’t ended up in this scenario.

sa46 · on Aug 21, 2024

I wish there were a way to specify the expected join cardinality syntactically for 1:1 joins.

    SELECT * FROM order JOIN_ONE customer USING (customer_id)

Malloy has join_one: https://docs.malloydata.dev/documentation/language/join

cess11 · on Aug 21, 2024

Don't have a suitable database to test on but I'm pretty sure standard SQL allows you to join on a subselect that limits its result to one.

singron · on Aug 21, 2024

You probably want to use LATERAL JOIN, which will compute the subquery for each left hand row. Otherwise it's computed just once.

cess11 · on Aug 21, 2024

With SQL there are commonly several options available, and tradeoffs might not be obvious until it's tested against a real data set. Sometimes even an EXPLAIN that looks good still has some drawback in production, though in my experience it's very rare.

I like that though, I have a preference for REPL style development and fleshing things out interactively. Enough so that I build Java like that, hacking things out directly in tests.

big_whack · on Aug 21, 2024

Sorry, but someone who is averse to joins is not a non-expert in SQL, they are a total novice. The answer is like any other programming language. You simply must learn the language fundamentals in order to use it.

halfcat · on Aug 21, 2024

Not sorry, I’ll stick with SQL non-expert as I’ve only worked with databases for a few decades and sometimes run into people who know more.

Working with a database you built or can control is kind of a simplistic example.

In my experience this most often arises when it’s someone else’s database or API you’re interacting with and do not control.

An upstream ERP system doesn’t have unique keys, or it changes the name of a column, or an accounting person adds a custom field, or the accounting system does claim to have unique identifiers but gets upgraded and changes the unique row identifiers that were never supposed to change, or a user deletes a record and recreates the same record with the same name, which now has a different ID so the data in your reporting database no longer has the correct foreign keys, and some of the data has to be cross-referenced from the ERP system with a CRM that only has a flaky API and the only way to get the data is by pulling a CSV report file from an email, which doesn’t have the same field names to reliably correlate the data with the ERP, and worse the CRM makes these user-editable so one of your 200 sales people decides to use their own naming scheme or makes a typo and we have 10 different ways of spelling “New York”, “new york”, “NY”, “newyork2”, “now york”, and yeah…

Turns out you can sometimes end up with extra rows despite your best efforts and that SQL isn’t always the best tool for joining data, and no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows. You might even say I’m averse.

consteval · on Aug 21, 2024

Yes when you have duplicated data and data inconsistencies/integrity issues you might get duplicate data and data inconsistencies/integrity issues in your output.

This is a problem of form, not method. JOINs are a fantastic, well-defined method to aggregate data. If the form of your data is messed up, then naturally the result may be too.

> no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows

People say this type of thing but SQL is an incredibly high-level language.

Yes debugging a big SQL query can suck. Debugging the equivalent of it is almost always much worse. I'd rather debug a 30-line query than a 400 line perl script that does the same thing. Because that's actually the alternative.

I have manually aggregated data, messy data, in awk, perl, python... it is much, much worse.

halfcat · on Aug 22, 2024

> ”I have manually aggregated data, messy data, in awk, perl, python... it is much, much worse.”

Yes, but with Python/etc you can at least do the same logic in a plain loop, which is much slower but serves as a more reliable reference of what the data is supposed to be, which can be used to validate the functionality of SQL output.

Is there an equivalent in SQL of this “slow and dumb” approach for validating? Like, I’m not sure if a lateral join is essentially doing the same thing under the hood.

rawgabbit · on Aug 22, 2024

Most databases have the concept of temporary tables that will automatically disappear when your session ends. For troubleshooting, I would breakdown each step and save it in a temp table. Validate it. Then use it as the input for the next step. Etc.

rawgabbit · on Aug 21, 2024

I have run into this scenario a few times where the multi-hour processes produces an explosion of rows that no one cared to troubleshoot further. They only wanted to de-duplicate the final result.

In practice, I ended up creating a skeleton table that not only had the appropriate constraints for de-duplication. But I would also create mostly empty rows (empty except for the required column key ID fields) with the exact rows they were expecting. And then I would perform an UPDATE <skeleton> ... FROM <complicated join>. It is a hack but if there was no desire to rewrite a process that was written years ago by teams of consultants, I can only do what I can do.

yunolearn · on Aug 21, 2024

Nobody is entitled to any of this being easy. If you don't like working with badly-designed databases, why not simply work with people who know how databases work? In the meantime, I have bad news: nobody is going to do the hard work for you.

yunolearn · on Aug 21, 2024

To verify, you have to think through the problem and arrive at a solution, like all of programming.

big_whack · on Aug 20, 2024

I think the problem is the quirkiness on the English side, not the SQL side. You could translate datalog to SQL or vice versa, but understanding intention from arbitrary english is much harder. And often query results must be 100% accurate and reliable.

randomdata · on Aug 20, 2024

> I think the problem is the quirkiness on the English side

While likely, the question asked if there was any improvement shown with other targets to validate that assumption. There is no benefit in thinking.

> And often query results must be 100% accurate and reliable.

It seems that is impossible. Even the human programmers struggle to reliably convert natural language to SQL according to the aforementioned test study. They are slightly better than the known alternatives, but far from perfect. But if another target can get closer to human-level performance, that is significant.

yuliyp · on Aug 20, 2024

When I find someone claiming a suspicious data analysis result I can ask them for the SQL and investigate it to see if there's a bug in it (or further investigate where the data being queried comes from). If the abstraction layer between LLM prompt and data back is removed, I'm left with (just like other LLM answers) some words but no way to know if they're correct.

randomdata · on Aug 21, 2024

1. How would the abstraction be removed? Language generation is what LLMs do; a language abstraction is what you are getting out, no matter what. There is no magic involved.

2. The language has to represent a valid computer program. That is as true of SQL as any other target. You can know that it is correct by reading it.

big_whack · on Aug 20, 2024

Once you have SQL, you have datalog. Once you have datalog, you have SQL. The problem isn't the target, it is getting sufficiently rigorous and structured output from the LLM to target anything.

randomdata · on Aug 21, 2024

So you already claimed, but, still, curiously we have no answer to the question. If you don't know, why not just say so?

That said, if you have ever used these tools to generate code, you will know that they are much better at some languages than others. In the general case, the target really is the problem sometimes. Does that carry into this particular narrow case? I don't know. What do the comparison results show?

big_whack · on July 21, 2024

I don't really know much about India but they seem to have more illegal immigration than the US does and per https://www.bbc.com/news/world-asia-india-50670393 seem like they have discussions about amnesty there too. My assumption would be that any regional economic power with long borders will struggle with illegal immigration.

big_whack · on July 18, 2024

On the OP's resume, they claim to have written the "highest performing in-memory database in the world". It's here: https://github.com/carrierdb. There are no users.

Perhaps it is the highest-performing in-memory DB in the world, but it's a competitive space and the claim is extremely grandiose. If I were hiring my assumption would be it's BS. I would recommend the OP collapse the resume to a single-page PDF and ensure it contains only supportable claims.

big_whack · on July 6, 2024

Postgres pads tuples to 8 bytes alignment so an indexed single-column int takes the same space as an indexed bigint. That's the usual case for indexed foreign keys.

Differences can appear in multicolumn indexes because two ints takes 8 bytes while two bigints takes 16, however the right layout of columns for an index is not always the layout that minimizes padding.

sgarland · on July 6, 2024

Postgres doesn't necessarily pad to 8 bytes; it depends on the next column's type. EDB has a good writeup on this (https://www.2ndquadrant.com/en/blog/on-rocks-and-sand/), but also here's a small example:

  CREATE TABLE foo
    (id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, iid INT NOT NULL);

  CREATE TABLE bar
    (id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, iid BIGINT NOT NULL);

  CREATE TABLE baz
    (id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, iid BIGINT NOT NULL);

  -- fill each with 1,000,000 rows, then index

  CREATE INDEX {foo,bar,baz}_iid_idx ON {foo,bar,baz}(iid);

  SELECT table_name,
         pg_size_pretty(pg_table_size(quote_ident(table_name))) "table_size",
         pg_size_pretty(pg_indexes_size(quote_ident(table_name))) "index_size"
  FROM information_schema.tables
  WHERE table_schema = 'public';

   table_name | table_size | index_size
  ------------+------------+------------
   foo        | 35 MB      | 43 MB
   bar        | 42 MB      | 43 MB
   baz        | 42 MB      | 43 MB

`foo` has an INT followed by an INT, and its table size is 35 MB. `bar` has an INT followed by a BIGINT, and its table size is 43 MB; this is the same size for `baz`, despite `baz` being a BIGINT followed by a BIGINT.

big_whack · on July 6, 2024

You seem to think you're disagreeing with me but afaict you're just demonstrating my point, unless your point is just about how (int, int) will get packed. That's what I meant about the column order of indexes. If you have two ints and a bigint, but you need to index it like (int, bigint, int), then you aren't gaining anything there either.

As your example shows, there is no benefit in index size (e.g for supporting FKs) in going from int to bigint for a single key. You end up with the same index size no matter what, not twice the size which was what I took your original post to mean.

sgarland · on July 6, 2024

I misunderstood your post, I think. I re-ran some experiments with a single-column index on SMALLINT, INT, and BIGINT. I'm still not sure why, but there is a significant difference in index size on SMALLINT (7.3 MB for 1E6 rows) vs. INT and BIGINT (21 MB for each), while the latter two are the exact same size. I could get them to differ if I ran large UPDATEs on the table, but that was it.

big_whack · on July 5, 2024

Reconfiguring tables to use a different kind of unique ID (primary key in this context) can be a much bigger pain than an ordinary column rename if it is in use by foreign key constraints.

snicker7 · on July 6, 2024

OR if the primary key is exported out of the DB, i.e. for constructing URLs.

big_whack · on June 5, 2024

If you use a surrogate key, you still need a unique constraint in the table (probably the same columns you would otherwise call your natural PK). If your unique constraint isn't sufficient to capture the difference you mention, you need to add more columns.

However, that's strictly better than the natural PK situation, where you would need to not only add new columns to the key, but also add those columns to all referencing tables.

mkleczek · on June 5, 2024

> However, that's strictly better than the natural PK situation, where you would need to not only add new columns to the key, but also add those columns to all referencing tables.

Foreign keys referencing surrogate key has different semantics than fk referencing natural key - it is a can of worms actually and can lead to unexpected anomalies.

Lets take the example from the article (with surrogate key):

Restaurant(id, name, city)

Now let's add a possibility to record visits:

Visit(restaurant_id references Restaurant(id), user, date)

We have a procedure to register visits to a restaurant:

register_visit(restaurant_name, user_name, date_of_visit) { INSERT INTO visit SELECT id, user_name, date_of_visit FROM restaurant WHERE name = restaurant_name }

I very much enjoy spending time in "Polish Kielbasa" restaurant in Warsaw and I visit it everyday - I don't visit any other restaurant at all.

Now changes of a restaurant name will lead to the database containing misinformation:

register_visit('Polish Kielbasa', 'mkleczek', 2024-6-4); update restaurant set name = 'Old Kielbasa' where name = 'Polish Kielbasa' and city = 'Warsaw'; insert into restaurant ('Polish Kielbasa', 'Warsaw'); register_visit('Polish Kielbasa', 'mkleczek', 2024-6-4);

Question: what restaurants did I visit this year?

This kind of anomalies are avoided using _natural_ keys and - first of all - defining proper _predicate_ for _each_ relation.

The predicate of relation visit(restaurant_name, city, user, date) is quite obvious: "User [user] visited restaurant [restaurant_name] in [city] on [date]"

Question: What is the predicate of relation visit(restaurant_id, user, date)?