23 August 2019

Establishment of Regis Healthcare Informatics Database - Continued

Regis MSDS 696 Practicum II Presentation.

hi everyone and thank you for listening

to this presentation my name is Gilbert Bernal and this is a continuation project for my first practicum called the establishment of Regis Healthcare Informatics database so here's our outline of what we're gonna go over for this project just so everyone's aware my degree is the data science with the data engineering focus so this projects gonna be more data engineering focused to make sure that I'm doing the debt engineering goals the goal of this to complete the Regis health informatics database so when working on the previous project I did run into a few issues when trying to complete it so one of them was that I wasn't able to actually connect or get the database on the server when I did connect I'd either get disconnected randomly after being on for a few minutes or would get a timeout err when I was actually trying to connect to the server itself I also only had the code working for a single format so it really wasn't that versatile and couldn't really be used just for anything and there was no data governance codes created for the database so without that the database risked being messy and kind of out of control if people were to

start using this and data would just kind of stack up and get messy on it so the requirements and goals for this one were to create a way to allow for multiple formats to be used instead of just one which was CSV the next one was to create the actual PostgreSQL server database on the Regis server to create an upload form that would allow users to simply just upload their data to the server and kind of automate all this on the server itself so that nothing really had to be controlled or maintained there wouldn't be any kind of manual process so the overall goal was to complete the creation of the Regis server for the health informatics school using a post grade SQL database and automate all of this process using a unique system called cron which we'll get to in a little bit so the tools that were actually used for this project we're a Python HTML PostgreSQL cron and tableau Python is a coding language that has a ton of libraries the reason it was used is because of these libraries and how versatile it is using them and the fact that it works really well with PostgreSQL the reason why PostgreSQL was used was simply because it's an

open-source tool meaning that i didn't have to sign up or you provide any information really and it was completely free so this made it really easy but another reason why it was chosen was because of the fact that it has a command called copy this copy command allows for a way to simply upload CSV files directly into a table chosen in the database itself HTML was simply used to create a webform to upload to the server cron though is the UNIX scheduler that uses bash commands within a script to run on a time that's setup and will run this on an actual schedule to create the code needed to actually go and make the table tableau itself is going to be used as a way to do the analysis and show a proof of concept and how cron can run the ksh files with the bash script to create the tables and then connect the tableau and be able to build reports and do an analysis on so the first code that was created was convert by convert dot pi simply goes through and gets different data formats so being JSON or Excel and convert them to CSV files the reason why these two formats were chosen was because when going through different healthcare

organization databases I noticed that CSV was the most common the other two were Excel and JSON so what I did was simply go through create a function that would look to see if it was a JSON file CSV file or excel file and make the changes needed if it was a JSON file panthus is used to create a data frame and convert it to a CSV file if it's Excel Excel Rd is used Excel Rd is a Python library that works with Microsoft Excel so what this does is go through the excel file look at every single sheet and create a unique CSV file for the name of each sheet so the reason why this was done was so each of these sheets can be turned into a table and then from there the user could actually get these unique tables created by these sheets and do joint and work with the data in SQL to create one single source of truth of data that way in Excel you wouldn't have to do any kind of V lookups or anything to create one source once all these changes are done the Python code then moves it to the newest path and archives it so that it can be seen and used for the data Government's code which we're gonna go over next the database governance code

simply goes through the archive folder and looks to see the last modified date the last modified date is used to see when this code was used on that file and see how old it is if the age is past a certain date then it goes through and deletes the table using a post gray SQL command drop and the name of the file so the right now as you can see on the code it's only set to nine the reason why is because my files that I was using were pretty young and I didn't really have any that could be used that were passed more than nine at the time but this could be changed to any kind of format and the reason why I'm using age and not last access is because this database is supposed to be used by students who are in class and need a database to use and typically if it's an online class it's a tweak so what could happen is you change the days to eight weeks worth or you can change it past that case they want to access it after this way the database stays clean and there's no unused data or any kind of data storage that's being used unnecessarily so having this code actually helps create and creates a consistent way to keep the database

clean so once these codes are created I went and created the database to make sure that everything was working as expected I kind of had an SCP putty on at the same time connect to the database when doing so I noticed that I wasn't getting any time out errors I wasn't getting randomly disconnected so once I noticed this I went ahead and ran the post gray UNIX command to install once it installed I created my own username and then I went and created the H info database so once I created the database I simply went through and opened up all connections and doing this actually allowed me to connect the database outside of the pseudo user which you can see right here on the left is post grace so doing this actually allows me to go from my own user name and run the Python codes to connect to the database with no issue it also allows me to connect from PG for admin which is the post grade version of a GUI to kind of look at the database as a user and kind of run SQL commands in so once this was done I simply copied over the directories and made sure everything was consistent to what I was testing with this way there wasn't that many updates that need be

made on the server so once it's done I simply did an LS to make sure that my directories match and all my files were actually transferred over after this I ran the codes with the data that was copied over and made sure everything worked as expected once it worked as expected as you can see on the left table was created I went on the database itself around a simple select start everything worked as expected and as you can see the data itself is in there so once this was done I had to kind of find a way to automate this entire process to do that I ended up using something called cron cron is a UNIX scheduler so what it does is run these scripts called KS H files KH files is simply just a bash script that goes through and will run the Python code using the Python command what this script also does is goes through and looks for the trigger file in my Python code if these sugar files are not created then we know there's something wrong because these trigger files are created at the end so once they are done we then look and see if there are any air so once the kill count which is the time of it running

we're just 10 minutes we know something is wrong because these codes were I'm pretty quick then it says checking for errors so it gets any errors within the log that was created from the KS h file moves it to a directory it sends an email so this email will simply say there is an error please investigate so then you go to the logs look and see what the error was and correct the issue this had to be created for both the import CSV which is created in the previous practicum and the data governance on the right you actually can see these schedules so what I have it going for is every 15 minutes so every 15 minutes this code will run each of the case H files and create a log in the log file showing any errors that occur after this was done I created a web form which simply goes up loads the files to the server and shows the name of the file the size and so on once all this was done the analysis had to be tested to show a proof of concept so what I did was find some health care data on Cavill which looked at dingy cases in the fill peens dengue fever is a vector-borne disease that's spread by mosquitoes dengue has no cure it is contracted by

millions every year and can be fatal if it's the case is severe enough so as you can see on the right hand side I uploaded the file to the server once it did that I simply waited the 15 minutes and checked it using PG for admin with my connection to the server using a select star once I saw the data was in there I created a query and connected the tableau with this query I put in the tableau and imported the data what I did was also make a live connection live connections can be really useful in tableau if you're gonna update data consistently this is useful because some people want to keep their models the same they want to see it change as time goes on and people like to see consistent live data when doing any kind of daily operational reporting so this is the best part about tableau the hardest part is making sure the data is clean consistent and doesn't have that many errors or issues so once the reports were built you can see that this report really does tell a story on how dengue fear is occurring in the Philippines right now we're simply looking at the overall results for dengue fever on the Philippines for the

file provided by cattle so the first try on the left-hand side or top left I should say is our regression model so when looking at it we see that this really isn't that great of a fit this visually the chart kind of goes through and cuts the line pretty much in half when looking at the forecasting though we do see that the forecasting does kind of follow the same trend of what we're seeing for the overall result it's spiked and what it looks like to be 2013 and slowly starts to drop the forecasting is predicting it to kind of go with it the bottom-left chart shows an overall view using a pie chart for the different regions the middle shows the actual cases per year for each of the regions and the actual details that just shows the numerical results by year to the right we actually see our p-value in our square value when looking at this we don't really see that good of a fit so we can say that our model is not consistent to correct us we can do better fitting and just do more data manipulation that manipulation wasn't completely done for this project to make sure other than the fact that it had to

be filtered out for unnecessary regions some of the regions did have duplicates for different years so what I did was filter those out using the SQL query by picking only specific region types for region 1 2 3 and so on to try and create a consistency when looking at just a single region though we do see that the model does fit a little better within the confidence bands so to the top left we see that the model or regression line does cut through a little better the forecasting does actually match a little more the way that this region was chosen was using the tooltip on the actual tableau report to decide which region had the highest amount to kind of look into a deeper dive for this when looking at this we also see that our forecasting is better but we still don't have that great of a fit even though we do have better values it's still not the best overall using the proof of concept we see that we meet our goals we create a way to handle multiple file types for our previous codes the database was created on the server data governance was added to the database to make sure that the database itself does not get too dirty and everything was automated

using Python and cron automation allows the database to be on its own and kind of run as needed and will need little maintenance unless there's an air Tablo was kind of used to do the analysis and provide a proof of concept of how this database can be used in these tools together overall I think this creates a way for students to not have to go out of the way to find a way to store data such as Amazon's web storage or Oracle's web storage so doing this kind of helps everyone out to have a way to store data to use with different analytical tools and processes here are my references thank you for listening to this presentation have a good day