1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

*large* datasets : SQL, JSON with framework, or plain JSON?

Discussion in 'Programming' started by seductiveapps.com, Mar 3, 2018.

  1. #1
    I'm opting for plain PHP+JSON.
    SEMrush
    Here's why :
    [​IMG]

    My newsApp (seductiveapps.com/news) is an RSS live blog app and i want to add a search feature.
    I happen to use PHP but you can use any web server and server-side scripting language for this i guess.

    It crawls many RSS addresses for the latest news items every minute, parses them in a uniform way, respects RSS data about not crawling a RSS address more than N times per hour and so on,

    and stores all data and results as JSON on the server’s filesystem, in a relative path that starts with YYYY/MM/DD/ and under that the entire menu structure (again as recursive folders per sub-menu title/key) plus all the files needed to show news for that menu structure as currently defined in seductiveapps/apps/seductiveapps/news/appContent/newsApp, cron_getFreshContent.php and newsItems/
    (see below for a link to the open source repository where you can find these files).

    For a list of RSS sources large enough to be interesting and provide new news each 2 minutes, I get about 20MB total data per day at the moment of writing this. I guess about 10 to 15MB of this data has to get searched through (the rest is used as recent-data for the live blog feature) in a fairly slow PHP way to get search results for that day for a single searchKey.
    With an SSD and a quad-core core-i5 CPU this should be do-able (i’m writing this document before embarking on the code, also to wrap my head around the problem properly).

    This saves one the trouble of keeping a SQL server configured and backed-up, one can merely copy files via the soon to be included backup script, automatically and regularly, across an network to another machine, always copying the least amount of files needed to get a complete backup at the destination.
    And let it be noted that without a lot of indexing effort (nearly impossible if you want to search description fields too, as would be needed), SQL is not going to outperform PHP+JSON by so much that it’s worth the effort of a SQL solution. I’d have to spend too much time fitting it into a SQL server and then extracting it fast. It’s a learning curve i’m most unwilling to climb, especially because i’ve found that with the right approach (the small bites approach, processing only small bits of data before you report on the results of what you’ve processed through), PHP+JSON is easier to program for and just as efficient from an end-user perspective.

    The end-user needs a smooth quick experience, and the server needs as little extra stress as possible.
    This is achievable by executing the search feature scripts on the server under a “nice -n 9 php /path/to/script.php” system tool launched automatically and periodically whenever the server is on, from the unix command “crontab -e”. this ensures that new items being processed and requested from the server always have a large priority over searches being performed.

    The end-user also needs a quick search feature ofcourse, and this is facilitated by offering the search keys already searched for in a tag cloud to the end user.
    And whenever a new search is performed, it is sent as a todo-item to the server, where it is performed by a script that reads these todo-files on the server’s filesytem and performs as many times as necessary, search a single day’s worth of news for a certain searchkey, then puts the date searched through in the todo-list to indicate that these results are ready to be downloaded and displayed by the browser. The results themselves are simply stored in a newsItems/YYYY/MM/DD/searches/searchKeyID and the todo-files are to be put in newsItems/settings/searches/searchKeyID.json – the searchKeyID allows for complex long searches that will not mess up a filesystem with long filenames that can no longer be copied to other filesystem formats (like NTFS on Windows).

    Todo-files are kept and whenever a new day starts (in the server’s timezone), the previous day is searched through on all search-keys.

    Todo-files have a last-used date, which indicates the last time this search was searched for by any end-user.
    This gives the site operator the option of deleting the results for searches no longer used. Doing so does mean you have to start a new backup from scratch afterwards.

    see also github.com/seductiveapps/seductiveapps
    and https://github.com/seductiveapps/seductiveapps/commits/master for the change log,
    and seductiveapps.com/news for a live demo

    P.S.
    (and please, can we save me the bitching about slow starting of my framework - i can't do nothing about throtteling, i also have a lot of my todo-list, and while tablet and phone compatibility has been largely handled now, optimizations for speed of loading of the site i have not been able to come around to - they're on the agenda though, i want to start advertising my news app within a few weeks to months and that does require time spent on page load speeds - which can indeed improve by quite a bit).

    P.S.2
    this is a thread about SQL vs JSON and whether or not proper JSON use needs a Mongo DB or a solution like that at all. what i'm trying to say here that using Mongo seems to me like strapping yourself into another straight jacket like SQL. Some people like that jacket, i don't. All i'm saying here is that it's not needed, something like SQL or Mongo. Not if you handle your data and filesystem properly.
     
    seductiveapps.com, Mar 3, 2018 IP
    SEMrush
  2. seductiveapps.com

    seductiveapps.com Member

    Messages:
    174
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    35
    #2
    i gave the matter some more thought, and came up with this practical solution for my newsApp's upcoming search feature :


    Default Ubuntu.com uses nearly all 4GB of memory on my relatively cheap core-i5 machine.
    But Lubuntu runs very happily on 512MB of memory, leaving between 3 and 3.5GB available for php apps like my newsApp version 2 :)

    Playing it safe, and assuming up to 25MB of news data per day, 3GB of a core-i5's 4GB capacity used by the app, 512MB by the OS, and 512MB spare RAM,

    that's 3000/25 = 120 days worth of news can be kept in RAM, and older news than that can be searched on all searchkeys currently searched for by end users, two or three days at a time.

    since everything will have to be run from one thread, it is important that i optimize this code properly. that means using memory pointers in PHP rather than making copies of data in RAM. i've got the experience required for that.
    it also means using a data structure that has shorter key names and less fields duplicated in each news item. hopefully that way, even longer periods of recent news can be kept decoded in RAM and ready to get searched quickly.

    finally, for things like search status displays (which will be necessary and very frequently updated), a small RAM disk of about 10MB on the server should do the trick. that prevents the SSD from having to endure unnecessary and heavy wear and tear.

    there will also be facilities to let the code do clean-up tasks and search-tasks in between the fetching of new news items and a few dozen seconds of pause to let the system cool down a bit.
    as a rule of thumb, you don't let your CPU run at 100% all the time. it wears out the server too fast.

    with most of the old news neatly already-json-decoded and in RAM memory instead of stored as JSON on any kind of disk, it's much simpler to write out much smaller *non-overlapping* files containing news-items per small time segment of each day (meaning no duplication from file to file as it is now in version 1),
    which ensures news can be loaded in small increments, even for time ranges well into the past. and this decreases page initialization time by a lot as well.

    and finally, once a certain number of news items has been shown in the browser, older ones should be removed as new ones come in.
    or the browser becomes dead-slow after a few hours - even the operating system needs a reboot then, as i've noticed in my tests.

    so i'll be busy for the next few weeks again :)
    but i'm going to end up with a newsApp worth advertising before summer..
    version 1 was written in less than a month-and-a-half, and half that time was spent fixing things in the core system rather than the news app itself.

    i'll update my github sourcecodes for https://github.com/seductiveapps/se...ps/apps/seductiveapps/news/appContent/newsApp
    to split into a 1.0.0 and a 2.0.0 branch, and then i'll start 2.0.0 as a complete rewrite.

    time to completion of version 2.0.0 : about 3 to 8 weeks.
    i'll update this thread when it's fully debugged on all platforms and up on the live server.
     
    seductiveapps.com, Mar 6, 2018 IP
  3. seductiveapps.com

    seductiveapps.com Member

    Messages:
    174
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    35
    #3
    version 2.0.0 is done, well, up to the point of correctly displaying news.
    see http://seductiveapps.com/news(section'English_News')
    and https://github.com/seductiveapps/se...s/seductiveapps/news/appContent/newsApp/2.0.0

    2.0.0 does a bunch of things a lot better than 1.0.0 :
    - uses a lot less disk activity
    - caches all news in RAM
    - searches the past 3 days worth of news to prevent duplicates from being added.
    - uses PHP pointers (which are a lot faster than the default memory copy operations) and traverses down fairly deep recursive arrays efficiently too.
    - treats RAM as a filesystem
    - can update the browser view of the news every minute instead of every 2 minutes,
    thanks to the proper use of curl_multi_* PHP routines.
    - has plenty of time, even when RSS sources are slow, to search news.
    (a search for news can even be done while PHP waits for curl results to flow in)

    but, 2.0.0 can't yet :
    - search news.
    - release RAM memory on server and in the browser

    i have some other things to spend my time on as well, including the updated to-do list at http://seductiveapps.com/TODO.txt
    but i have a real need to keep my newsApp up, so at the very least the RAM release routines will get built, fairly soon too.

    and finally, i should add that i have changed my http://seductiveapps.com/LICENSE.txt as well.
    the price has dropped but the terms have been sharpened up a bit to make them practical.

    especially the PHP class for newsApp-2.0.0 is interesting for intermediate-level PHP programmers, and the techniques there should be portable to any server or client language.
    enjoy :)
     
    seductiveapps.com, Mar 24, 2018 IP
  4. Barti1987

    Barti1987 Well-Known Member

    Messages:
    2,683
    Likes Received:
    112
    Best Answers:
    0
    Trophy Points:
    155
    #4
    I thought you were serious about that fine, until I read that last part. Got a good laugh!
     
    Barti1987, Mar 24, 2018 IP
    malky66 likes this.
  5. malky66

    malky66 Prominent Member

    Messages:
    3,248
    Likes Received:
    1,981
    Best Answers:
    73
    Trophy Points:
    390
    #5
    Yeah, like anyone's gonna copy his shite code...:rolleyes:
     
    malky66, Mar 24, 2018 IP
  6. deathshadow

    deathshadow Acclaimed Member

    Messages:
    8,569
    Likes Received:
    1,529
    Best Answers:
    223
    Trophy Points:
    515
    #6
    If by "correctly displaying" you mean a minute and a half of goofy loading animations followed by a mostly blank page with a menu.... where MOST people would likely bounce to some other site that actually bothers starting out with CONTENT during the 20+ seconds of the white screen with barely legible text saying "this site will load soon".

    JUST another laundry list of endless pointless garbage BROKEN JavaScript, zero content of value, and a giant middle finger to accessibility and usability.

    "Caches all news in RAM" -- WHAT news?

    Honestly your ENTIRE concept reeks of putting megabytes or more of crap client side that has ZERO business being processed client side! EVER!!! AGAIN, back the f*** away from the JavaScript!

    You've got 18 megabytes of bloated rubbish spanning 366 files to deliver a broken menu that's not even 1k of plaintext. In handshaking ALONE that's a real world best case load time of 73 seconds and a worst case of FIVE MINUTES OR MORE REGARDLESS of connection speed! If you don't understand what's wrong with that, you have NO business using HTML, CSS, or JavaScript! PARTICULARLY when 12 freaking megabytes and over 230 files of that is the scripttardery!

    But then that seems to be part and parcel of everything you've ever posted here. Even when you have good ideas (and there were a few) your implementations are bloated messes that will NEVER be deployable on a real website!
     
    Last edited: Mar 26, 2018
    deathshadow, Mar 26, 2018 IP
    malky66 likes this.
  7. seductiveapps.com

    seductiveapps.com Member

    Messages:
    174
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    35
    #7
    oh you troll go back to your cave and your 14k4 modem :p
     
    seductiveapps.com, Mar 27, 2018 IP
  8. seductiveapps.com

    seductiveapps.com Member

    Messages:
    174
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    35
    #8
    oh i see i get more ridicule from just about everyone else who decided to post something back too.

    well, then i won't be wasting my time on this forum anymore.
     
    seductiveapps.com, Mar 27, 2018 IP
  9. seductiveapps.com

    seductiveapps.com Member

    Messages:
    174
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    35
    #9
    you go and laugh at people without even bothering to tell them where and how you think they're on the wrong track.

    that's a great attitude for a forum like this, ridicule and trolling...
    no wonder there's no real discussions happening here,
    trolls can't stand real nor civilized discussions.
     
    seductiveapps.com, Mar 27, 2018 IP
  10. deathshadow

    deathshadow Acclaimed Member

    Messages:
    8,569
    Likes Received:
    1,529
    Best Answers:
    223
    Trophy Points:
    515
    #10
    I find it hard to believe you aren't getting the EXACT same response everywhere else... unless you've found some magical sycophant-land that specializes in blowing smoke up your backside.

    Though that could be how you ended up with this bloated mess of broken methodologies in the first place.

    What part of "regardless of connection speed" do you fail to grasp? I've got 120mbps downstream now and your pages STILL suck down a minute or more thanks to handshaking cache-empty! HENCE the comment about the sheer number of separate files and why that is an epic /FAIL/ at web development.

    On top of the fact that your pages STILL fail to load or display content -- just now it fails in EVERY browser, instead of "well it only works in Firefox" or whatever magical browser you were ONLY coding for.

    JUST like how the link in your signature -- on my 120mbps connection -- takes over two minutes to load an empty page with no music player. You'd almost think it was 7 megabytes in 426 separate files not even counting whatever media it is you're trying to load for something that probably doesn't warrant the presence of more than 96k of code in three files!

    But you don't want to hear that 90%+ of your code belongs in the trash and isn't real world deployable for at least half a century... so naturally we're "trolling" you instead of trying to give you the bloody wake-up call you so desperately need!
     
    deathshadow, Mar 27, 2018 IP