First, let me admit I'm slow to understand, and I also may have explained the above poorly.
The PDF is thousands of newlines, with multiple entries on each, convoluted with lots of garbage formatting. The only data to be preserved is winning num, date, evening/midday draw, fireball number (introduced in 2020-ish?) and factored along with the change from one to two daily draws (2001-ish) as acceptable anomalies in the data set, which has been done, I believe.
The difficult part of this, actually, was cleaning the squalid pdf.
In the end, after the work was all done, script using OCR would successfully append a number to the cleaned text/pdf, but usually not the correct num. The only reason I used OCR was that I couldn't find the right frames in the webpage that contained the latest winning numbers, and getting html extraction to to work in a script failed because of it.
I must admit, although I have used JSON files, I don't know much about them. Additionally, I'm ignorant enough that it's probably best not to attempt to advise me too much here for sake of thread sanitation - it could get bloated and off topic :)
I think with renewed inspiration, I could figure out a successful method to keep the public file updated, but I primarily need surefire methods of analysis upon the nums for my anomaly detection, which is a challenge for a caveman who never went to middle/high school and didn't resume school beyond 4th grade until community college much later. Of course, the fact that such an animal can twiddle with statistics and data analysis is a big testament to the positive attributes of LLMs, which without, the pursuit would be a vague thought at most.
Although I welcome and appreciate any feedback, I'm pretty sure it isn't too welcome here. I'll try to make sense of your suggestions though.
The PDF is thousands of newlines, with multiple entries on each, convoluted with lots of garbage formatting. The only data to be preserved is winning num, date, evening/midday draw, fireball number (introduced in 2020-ish?) and factored along with the change from one to two daily draws (2001-ish) as acceptable anomalies in the data set, which has been done, I believe.
The difficult part of this, actually, was cleaning the squalid pdf.
In the end, after the work was all done, script using OCR would successfully append a number to the cleaned text/pdf, but usually not the correct num. The only reason I used OCR was that I couldn't find the right frames in the webpage that contained the latest winning numbers, and getting html extraction to to work in a script failed because of it.
I must admit, although I have used JSON files, I don't know much about them. Additionally, I'm ignorant enough that it's probably best not to attempt to advise me too much here for sake of thread sanitation - it could get bloated and off topic :)
I think with renewed inspiration, I could figure out a successful method to keep the public file updated, but I primarily need surefire methods of analysis upon the nums for my anomaly detection, which is a challenge for a caveman who never went to middle/high school and didn't resume school beyond 4th grade until community college much later. Of course, the fact that such an animal can twiddle with statistics and data analysis is a big testament to the positive attributes of LLMs, which without, the pursuit would be a vague thought at most.
Although I welcome and appreciate any feedback, I'm pretty sure it isn't too welcome here. I'll try to make sense of your suggestions though.