Cool! These are indeed very common graph-building steps.
Thinking outloud here, but some of these were supposed to be solved with RML (https://rml.io/) for the RDF paradigm. I witnessed a bit of their evolution: it started with similar operations as GraFlo and eventually they built some support for arbitrary java code. For example, say you want your node ID to be generated by concatenating the values of the firstName column and the lastName column, but only after some weird string normalization (think of making sure everything is utf8)... you woundn't want to make your schema-mappings Turing-complete, so you'd eventually have to allow for calling other functions. Any way, all of that was for RDF graphs, it's cool to see something like this for property graphs.
Somewhere in HN there's a post about disrupting or revolutionizing the laundromat industry, where some person is showered in praise (and later money) for setting up this lousy system.
And now the LavaWash server could be hacked to steal user data that a washing machine would never need, but the implementer chose to store without reasonable protection.
In countries communicating in non-English languages which are written in the latin script, there is a very large use of Latin-1. Even when Latin-1 is "phased out", there are tons and tons of documents and databases encoded in Latin-1, not to mention millions of ill-configured terminals.
One question, did you try to replicate the other result table (Table 3)?
If I understand correctly, top-2 accuracy would be 1 if you have only 2 classes, but it will differ from "normal" accuracy less and less as the number of classes increases (on average). So this shouldn't change the results for table 3 thaaat much as the datasets have large amounts of classes (see table 1).
In any case, top-2 accuracy of 0.685 for the 20-newsgroups dataset is pretty neat for a method that doesn't even consider characters as characters[1], let alone tokens, n-grams, embeddings and all the nice stuff that those of use working on NLP have been devoting years to.
[1] In my understanding of gzip, it considers only bit sequences, which are not necessarily aligned with words (aka. bytes).
I haven't yet replicated Table 3 because most of those datasets are much larger and it will take awhile to run (they said the YahooAnswers database took them 6 days).
Also, I have only tried the "gzip" row because that is all that is in the github repo they referenced.
Yeah, you're right, the more classes there are, probably the lower the effect this will have.
- does not allow for easy and clean importing of modules/libraries
- is not easily to write tests for
- has limited support for a debugger
- lacks a consistent style for such large queries (plus most textbook cover fairly simple stuff) which means it's hard for a developer to start reading someone else's code (more than in other languages)
- clearly indicates in its name that it is a Query language.
Save yourself the trouble and all your collaborators the pain of working with this code in the future, of trying to add new features, of trying to reuse it in another project.
If you want to operate near the data, use PL/Python for PostgreSQL.
-PostgreSQL extensions are easy to include and use.
-pgTAP exists for testing.
-A large query in SQL is not made smaller but translating it into an ORM DSL.
-If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data. If that's true, then why would you use them for querying data?
> If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data
If X then Y does not imply if not X then not Y. Java and Python do not indicate a purpose in their name because they are general-purpose.
Re modules/libraries: I meant it is not easy to write a piece of SQL code, and then import it into several queries to reuse it, or lend it to someone else for use on their on schema. It is possible, yes, but seldom done, because it is hell. PostgreSQL extensions could be used for this purpose, but developing an extension requires a different set of SQL statements (or luckily, python or c) than those used by the user of the extension, which makes compounding them a bit hard. Not impossible, just hard to maintain,
About your last point, I don't think that was my line of reasoning, but, yes, for the love of what is precious, don't open SQL files as python/java file objects and then parse and rummage through them to find the data you are looking for. Not impossible, just hard to maintain.
Thanks for pointing out pgTAP, didn't know about this.
For some reason, data-science folks haven't yet caught up with ORMs.. I don't know if this is good or bad, but (as the OP shows) they are more used to rows and columns (or graphs) than objects. Maybe that will change one day.
As for sharing SQL, that's easy to do within a database using views. Across databases with possibly different data models, that's not something I personally ever want to do.
Thinking outloud here, but some of these were supposed to be solved with RML (https://rml.io/) for the RDF paradigm. I witnessed a bit of their evolution: it started with similar operations as GraFlo and eventually they built some support for arbitrary java code. For example, say you want your node ID to be generated by concatenating the values of the firstName column and the lastName column, but only after some weird string normalization (think of making sure everything is utf8)... you woundn't want to make your schema-mappings Turing-complete, so you'd eventually have to allow for calling other functions. Any way, all of that was for RDF graphs, it's cool to see something like this for property graphs.