¶ Intro / Opening
If you use Excel for something, there is a lot of, you know, stereotyping that you know, you are so not so good at, you know, the typical data science profile, that people expect you to be setting up. Start getting started with python itself is a big challenge. I mean, like I'm a python programmer. I don't want people to, you know, get get angry at you. But I think the initial friction that you have with python is
really, really hot. But from the beginner to intermediate stage, you would see very less content on internet. No, boot camp. Trains them are no courses. Teachers them. There is very less content. In the middle space and I think because of that Gap, a lot of people actually get stuck in the beginner space. Hello, and welcome to data, Shadow the podcast on all things data. This podcast is a series of conversations with experts and Industry leaders in data. At each week.
We aim to unpack a different compartment of the data Sudanese. I am your host Catholic Church that I'm a blogger newspaper, columnist book author, and a former data and strategy consultant at currently head analytics and business intelligence for Woody one of India's largest logistics. Companies. You can follow me on Twitter at Karthik s that is Kar Phi. K s and read my blog at. No, Intruder.com. That is n and p. H, be a.com.
All opinions expressed in his podcast belong to me and my podcast Pious. And I do not reflect the views of any organizations. We might be Associated. Nothing discussing his podcast should be taken as been achieved for us. My initial plan was to record a podcast on our versus python and imagined a little fist fight on it as a guest and I would battle about whether our is more suited for data science or python overtime, better sense, prevailed. And I decided to be more
constructive. So, what we have now is a rather healthy conversation about the merits and demerits of the two big languages used for doing data sites. When do you use our? And when do you use Python? What are the advantages of being proficient in both languages? How do you integrate both into your workflow? Our Guest is Abdul Majid, Raja a data scientist at atlassian. He's a heavy user of both R, and python and a founder of Bangalore, our users group. He also offers tutorials in
programming, in both languages. So today we will be talking about our and python probably the two most popularly used programming languages, statistical packages. Whatever you call it for analytics.
¶ How Abdul got into analytics
So before we start, can you sort of take us through your journey into analytics how you got into it? And also how you got introduced to both our and to python? So I come from a digital analytics background. So I was predominantly into digital analytics and once the data volume started increasing a lot. So there was a need for me to use some programming language.
I actually first Learn Python. After I learn python after a point packages that I wanted to use for digital and index was easier and are for example, Google analytics Library, making our mark down reports making our shiny applications for doing Powers analysis like a Sankey diagram. So a lot of things became very easy and are so I started slowly migrating into our Our and then slowly became for me that you know, using one language was not enough.
So I started keeping a myself with both the languages and that's how I but formally. I learn python first myself and then I moved into our to learn on and then I did back and forth between both the languages and what was your background like before you got into digital analytics and by 3 and so on work, in terms of programming where you are programmer already? If so work, what were you programming in? I was a trained graduate in
computer science. Ring. But, you know, as most computer science Engineers. I was not good at programming when I was at University. I struggle because there was, there was Java and I didn't quite like, Java to let the time. I think I still hate Java. Probably that is why I ended up in analytics. Otherwise, I would have become, like a traditional ID programmer. So, I didn't code. And after my University, I was trained in Cobol. I didn't like kobali there.
So I left my job as soon as possible. I can. So this this actually frustrated me. That's where digital analytics actually gave me a good impression where It was a time when you know, analytics was not not like a great thing. Like people are not, there was no data science at the time and slowly migrated into the space as the space was growing. And that's how I moved into our and python.
But initially, if you ask me, I think I ended up in Orange python only because I didn't like any other programming language. I think my background is sort of similar, right? Hi. Again. I did my undergrad. In couple decides was asked to do a lot of coding in Java. And so I decided I don't want to be a programmer. So it just went to two completely different path later on. I think when I learn python I was Okay, they had introduced by thrilled when I was studying,
computer science. Maybe, I might have stayed a program of because like as a programming language, it's like he at least in my opinion like far easier to use than than Java, right? So, so yeah, so that's the that sort of my background as well. So did you have any other
¶ MS Excel in data science
experience with any other statistical? The software software's or with Excel or something before you got into Analytics? Okay. Axl. I was quite good at Excel all the time. I think I still think accelerates quite useful, even if we ran our python programmer, but I think the problem that I feel in Industry. I don't know if I'm jumping ahead.
But the problem that I feel is if you use Excel for something, there is a lot of, you know, stereotyping that you know, you are so not so good at, you know, the typical data science profile that people expect you to be, but but I think having Excel also, as part of your data science, book flow is very helpful. Not maybe you're not going to build a dashboard or something. I use a lot of gc8 in my current profile. This it is quite helpful for me
to, you know, converter small. Like, not every time I'm going to use ggplot for making charts. So sometimes they should gives me the flexibility to quickly turn a very powerful visualization. So I think Excel is still useful and I used to be using Excel at the time as well. Okay, I think as we talk about our in Python today, I think he'll also keep excel in the in the loop because I'm also a big fan of accelerating. The downside of excel in my
opinion. Is that like you can't take sort of Automate stuff and you have to do. I hold that with a win-win with some of the Microsoft tools that are actually becoming easier now, but but still, I think you can't automate easily. But in terms of just analyzing data in terms of this looking at data rate, like the thing with either our of python, you don't really necessarily need to. Look at the data. It will forces you to look at
the data. And sometimes that will give you a lot of insight intact or python doesn't necessarily provide a Jeremy, Howard. The co-founder of us today. I was like very well respected in deep Learning Community. He actually emphasizes a lot in looking at the data. Even when you are doing deep learning problem. He says, like, you have to actually look at the raw data and you know, how the neural network progress is? So I think your point is very valid in that case.
Yeah, because I mean, one of my first experiences was like again, it wasn't analytics job. I was asked to do the forecasting because the company was using Java for all its coding. Direct like why don't you write your forecasts programs in Java? I smell a six-month. Got no. We're in debt. Certainly someday. I was like, okay, I am going to try exit. I tried. Excellent. Within two days to steer by where I was going wrong and the whole thing was turned upside
¶ When to use R and when to use Python
down. So now like I mean now that I mean apart from Excel, I mean I assume you continue to use Excellence one, but within our and python like coming, do you have a kind of a preference on your default of what you re what you use or do you have something? Like there are certain situations where you use are certain kind of problem statements where you use Python like, can you take us through like What do you use in what
context? Yep, so I will just basically Define the workflow that I follow for data collection from, you know, the Big Data System, I use equal because it's Park sequel. So I try to write most of the things in there, like create like a temporary table for me. And then for EDM. Mostly, I prefer our and then for actual modeling if I have to do modeling then machine learning, I Professor python. The reason why I prefer python for machine learning is I think pythons APA for machine learning
cycle on jpa. Quite unmatchable. So there is a very clear, four-step process, Define fit predict and then evaluate I think tidy model is coming closer to that in our universe, but still, I think scikit-learn gives me that ability to if I want to change the model, just have to change the class that I'm defining at the top. Then I'll have a new model. So I strongly prefer like there is a very strong preference of using python specially secured
loan for machine learning. But if you ask me about Eda, I am one of those people who still a girl with matplotlib for every simple thing. I have to go online and then find. So I strongly prefer using our tidy, which is amazing. So I don't want to, you know, Miss that opportunity of using tidy was riding pipe pipe Based on data pipelines, to do some analysis, or Edie. I prefer the other place where I prefer orders, if you have to create like a web report, a simple report, not as oversight
report. I'm not talking about, you know, shiny or I'm not talking about streamlets, but if you want to repair a like a client-side report like a small panel Such as a report. Then I think our modern is again, quite unmanageable there. So, the other option that I have in Python world is I write something in jupyter Notebook. I format everything and then I, you know, I export it as HTML. And then I tried to publish it, but I don't find that hope flow
that interesting or that good. There are a couple of libraries, like, Fast pages. That is again coming from the past, not a community. There. You can write a jupyter notebook and publish it as a blog post, it's there, but I think the workflow that that you Have an R is really amazing. So I have a website now. That's called programming with or.com. So the entire website is actually created by our maqbool.
I create an arm of the on file. Who should do, the get up, there is get abortions running and then it publishes it. I think it's quite seamless and with our connect, if you, if your organization uses are connecting, our mob down is like a superpower to share internal internal analysis report. So that is a place I would strongly prefer on. Yeah. So why is it the, why do you think it's got that? I mean, I don't know how much
you like. History of the Dual credit for two languages or software's, if you were to call it that. Why do you think like optimized more in terms of like usability user friendly netting Zone while python if you think about it as more like sort of in some way developer-friendly, right? I mean like if you over, because it's easier to integrate into the into the overall, take workflow and things like that. So, why do you why do you think this differential thought of?
Yeah, I think it's a very valid point D of Broad and I strongly think our is developed for people. Who are not technically strong in mind, I could be wrong. But if you see people who are using our people from Humanities people, from social science, people who work in bioscience. I don't I I don't think they're like, you know, core programmers like how programmers tend to be but they can get their work done using their tool to give you an example.
I I actually teach in a local University in Bangalore and I've tried python before, this is for economic students. I've tried python before. So these students have never coded in their life at all. You told them to install a software as well. Like somebody has to sit next to them and then tell them it's easy for them to use Facebook, but it's very hard for them to use any technical because they
have this project in mind. So for them I've seen are they, you know, learning or much much faster than setting python. For example, I'll give you example setting up start getting started with python itself is a big challenge. I mean like I'm a python programmer. I don't want people to, you know, get get angry at me but I think the initial friction that you have with python is really
really hot. In fact, that is why A platforms like repeal it, you know, has that initial friction reduced variability is being used in a lot of schools or internationally, but if you want to set it up on your own machine as a workflow, if you are, if you're on Mac, do you know which python to use? Because Mac comes default with one python that python doesn't help you a lot. If you install an account or now you have, like, probably three
python on your machine. Do you open an account of prompt and then try something? Are you open your existing to Miller and do something? Then when you use Pape, then do you use paper clip 3? Then you just, you just comment to this, this range of problems. Then you have like your code editor, which coded to use, why? Somebody would be very strong with python by Chomp. Somebody would have a strong opinion with spider. Somebody would have strong
opinion with visual studio. So at the end of this process, somebody who tries to learn, let's say like I'm a doctor and I want to, you know, learn data science for something that I'm doing at the end of this process. If I'm a doctor, right? I would probably drop box. I don't expect anybody to stay at until this point and then still further. Go ahead and then Learn a language and our it's quite simple. Go to our download our go to our studio, download rstudio.
You don't have 10 different versions install it you're up and running just immediately like next five minutes. You are up and running. Yeah, actually, provide doctors in Economist. I think I with some with the degree in computer science. I finally found it. Hard to get started by the name of my actions are good. But I think Amon there was a, I was doing a machine learning Workshop in collaboration with someone and they did.
I want to do it in Python. And so that Back person actually taught me bison over to three days, creating bills to pay and that's how I learnt by. Otherwise are the sort of the, the initial learning curve, was actually very, very steep for me that sets and it's like later actually. I got infected with a few organizations, which were completely in the in the python world.
Like, so they asked me to use Python there again, what I remember was like they had virtual machines which I have to be set up by things like that, which it was a, I had no clue how to use. And then there was a dozen aw space where you had Jupiter, but they're like, how do you get files in and out of that direction, was not very
intuitive and things like that. So even as a as somebody with a sort of a fairly strong programming do not software, engineering background, I found it very difficult to kind of gauge is get started in the started invited. I very comforted by the lady you were talking about an accord a spider and things like that. The other thing is that you also have like within python syntax very linked significantly,
right? Like There is the numpy, some context, there's a bind us indexed, then they report by 10 p.m. To deck. And in some places things are, there are some small differences and you can just end up writing, fairly inefficient, good. And the thing is even our, as if you think about it, like every day, you have the tiny worst way of doing things. And also the base, our way of doing things, which a lot of people, especially in Academia. I think they still do baits are
in stuff. So while so what happens, I mean like when you have these were sort of versions literature of, does it make it? Easier and more user-friendly for the quarters. Or is it just that like the communication between coders sort of drops and things like that? No, I think if you are a, if you're a border for quite some time, so these kind of changes are do not trouble you much, but I am but if you're somebody who
is starting new. So like you said that this there's a DOT notation and python. There is a function for you you something. There's a method that you use something else. There's a an object that you define and even in our old you have the same problem, right? What is five, operator? Some people do not understand how they have to buy by operators. One from, tidy was Universe one from court. So I think these are these are
the problems that beginners. I dilute phase, but I Only believe, once you get started, especially if you're a programmer, who who learns from documentation every time there is a new release. You you understand. What is the relief? And I think I diversity team actually does a great job in communicating that especially if they're deprecating a function and also any primary for python for that matter. They also communicate that the problem arises when most people do not look into it.
Most people when there is a new release for example, the last week, last week. I think I lost last week. There was a new cycle own release cycle on one point. I don't know how many, how many programmers, especially working in data. Science Community have had a look at it. I strongly believe data. Scientists still don't think, like software developers a lot.
I don't know which Universe they live in by the mostly if you actually see her always, you know, in the race of, you know, improving accuracy of their model latest algorithm. In this process. They actually leave out this, you know, there are certain good practices that software engineering can actually teach us and we started realizing it. Very now, like after maturity has increased in this domain. But still a lot of people actually leave order.
I think that is a problem. Otherwise developer should like developers who release libraries are doing great job. It's just that we don't notice it actually just take a little digression. Oh, so you may think that they
¶ What data scientists can learn from software engineers
can be, there are things that like data scientists can learn some software from software doing community. Right? Like what are the some of those things? So there are a lot of things data scientists can learn from software, engineering to start with one of the problems that I have as I write very bad code and the way I'm trying to overcome right now is the Jupiter environment like you can, you can do The dog control groups has given its a very popular talk about why.
Jupyter notebook is not a great tool for writing a software code like development. I completely agree with that. Jupyter. Notebook is is like a repeal environment. You write something you iterate on it, you write something in bed, visualization and along with it and then you can publish a report but using jupyter notebook only as a final tool, like a lot of people do this, right? They write, jupyter notebook and they want when they want a python code.
They wouldn't write a python code from the first. They'll just go to the Our notebook, download the python script and then they would try to start, you know, embedding it. I think that's a very bad practice. What we can do is, like, what I started doing is using visual, could use Visual Studio code BS, code extensively. So we escort has jupyter notebook. And we have scored. Let me write a python code as well and the advantages I use vs codes. Jupyter notebook for any Rebel,
like whatever. I want, iterate small things, I use it and then I just copy it and then I paste it in the python script. So I have a decent python script at the end. So this way I am The term and not ruining my python script development experience. At the same time. I'm not losing my ability to iterate on something. Like if I want to make a chart and then see how it looks this is this is one thing that I
would I would strongly say. And the second thing that I think software engineer software developers do a lot is going through documentation. I think this is what I touched upon slightly before data science Community seems to be so obsessed with the new algorithm, you know, like Cutting Edge tools, but I don't know how many people like if you go online like you would see lot of people asking questions. It's like, how do I become data centers?
How do I become data center, but you the initial becoming data scientist is everybody trying, but from the beginner to intermediate stage, you would see very less content on internet. No, boot camp. Trains them. No courses teaches them. The there is very less content in the middle space and I think because of that Gap, a lot of people actually get stuck in the beginner space. They are only searching for job and then they never move ahead. And that again leads to a lot of
bad code that we end up writing. Technical debt is a big problem in order. When we write jupyter notebook, that nobody can understand. A lot of people still do not lay out their dependencies properly. Like, if you are using python for that matter, a lot of people don't have dependencies filed, right? If you have a library dependency, what is the library version? We still do not know if there is a vulnerability, then we are upgrading. Is there any truck?
We don't know. So a lot of these problems come in because we just dumped into directly, a jupyter notebook, and then we don't do anything before. I think like these are a couple of things. There are a lot of things actually enjoy gross. Talk about By Jupiter notebook is bad for software development. Will actually tell you a lot more. Okay, I think we're learning to that dock in the in the show notes. So that liquor lives in can go and check it out. And you mentioned about online
support in things like that. So but what if the in the two years with python, what I found is that them, especially in terms of like your online, help stack Overflow and so on the support for python seal much harder than the our support for our it was very difficult to there. For example, some day. I was like coming from the art world. I was like, aww, Metadata play with python. I asked them some of the people I was working with and they didn't understand what making a
data frame needs. And then like that overflow again, like it was very difficult to our, it gave either like some are links or something, but nothing related to Panda. So is this is my perception of the online support of the two things, correct? And if so, why do you think that regarding online support? I strongly believe python has a better support in terms of, you know, the because of because of the nature of the Language, which is very, you know, data scientist.
The support for python is really good. I would say. But the problem arises, when some are Developer actually comes to python world and then try to use the same keywords that they have used to used to use our. So, for example, melt and reshape is not melt in Python. So melt melt is like, mint different likely, voter and pivoting. I think that is first Gap, and the second thing if you see again, python does not want to enter python, right?
For example are tiny. Tiny was one of the complaints that people make about. Videos that are studio is tightly controlling by diverse, but because they are tightly controlling tidy, was everything in Thai. Diverse has you know some unified theme, but that does not want how it is in Python Universe, right? Soap, and houses from a different group, numpy is from a
different group. Like you learn is from a different group, even though if all of these are like in part of the scientific stack, so it is all from different different entities or different developers. So that makes it slightly different from what they have. They have been using and that is why you Don't see that uniformity there. When, so what do you do? I mean, I is in our user. Let's say I do I do who seek for help in pythons are assuming I use the right keywords.
Like, I use the keywords that are more relevant to python. What is the, what is, is Googling till the answer and Googling for and get going to stack? Overflow links to the underworld? Is it, that python have a different kind of a support system on that? Yeah. So mostly these days I use DuckDuckGo only for one reason. So I think Google's answers are mostly car. With a lot of, you know, SEO strong SEO blog post. So I use DuckDuckGo for that purpose.
Lot of blocks are not optimized for DuckDuckGo. At least II, think. So I use DuckDuckGo and mostly, I go to the stack, Overflow answer. And I would, at least go to couple of answers. One of the problems that you would probably see in Python is
most of the times. If you go to the top answered stack, Overflow answer, you would see some answer, that is not very recent and that is probably because this feature was not available, you know, When five years back, a few years back, when somebody question, and somebody had given an answer to that and somebody Mark the dancer was, right. So, I would strongly encourage anybody to actually go through a
couple of answers below. That also, to understand if there is a new feature that has been added to give you an example, one of my stock or blue answers in our community, how to fill missing values. So, Phil underscore n, a is a function that was added in tidy, Versa lighting. I think a couple like in last within last two years. If you go to that stack, Overflow question, you would see a lot of answers that has nothing to do. This, but this is the easiest way.
And once I added this answer, I think probably this is my most uploaded answer or something because a lot of people started then seeing this answer as one of the best answers, but still this would not be marked as answer. So anybody who's looking for an answer on stack Overflow? I think not just the mark dancer. They should scroll down and he because all these libraries keep
on enhancing things. And when win, this is being added as new answer, lot of beginners might or even practitioners might miss out that the most efficient way of doing this. Just by looking at, you know, some for your five-year-old. Super and one other thing. I think you touched upon research some time back. So again, like as you as you were, everybody else knows like I come from the art world and have sort of exterior experimented with python and sort of gone back.
So my big so I'm very visually the way I think. So if somebody sends me a large Excel sheet unless it's probably sort of properly formatted with conditional, formatting and things are, it's very difficult for me to kind of really make sense of it. So the way I make sense of data is to sort of plotted graphically. And if on the knotting support in Python to be like, fairly weak in some sense. I mean, it's again. So, how do you kind of navigate this sort of difference in in
¶ Graphics and visualisations in R and Python
visualizations between the two platforms? Or is it that? Like if you are heavy on analysis and visualization you should just use our Yeah, I feel the thing that you just mentioned lost and I do that. So every platform has an advantage and disadvantages in that way. I strongly believe, like if you want to do very extensive video, at least I am very comfortable in doing that in our I'll give an example. So I tried plotly. When just plotly, was there like
a couple of years back. So you have to build go object? You have to do, you know, like the code itself. Looks like a Json script or something. So I completely disregarded, plotly, but now, plotly has come up with plotly Express, which they are calling. I was inspired by ggplot now, for every single plot. You have the same syntax, you have data frame. You have x axis, you have y axis at couple of other attribute, you don't touch any of us are a couple of other arguments.
You don't touch any of the other arguments. And just with three data frame, XY you can make a bar chart. You can convert that bar, chart to the scatter plot. You can do all of these things and that is how very much, you know, ggplot used to look right or at least till you have the G blotnick. So, I think now libraries also started Realizing that, you know, how world is the best.
And I don't think anybody would disagree with that statement, that ggplot and tidy was universe is the best for data analysis. Even I've seen a lot of python data and data science between success accepting this, this fact, so if you ask me, I would say if I want to make an analysis, make an analysis report, make something that is shareable that I can, you know, share it across internal organizations Community interactive.
I would still stick to our for that rather than you know, going to Python and My luck that is a place. I would strongly use our Gordon Gordon, okay, I'm done. So that's instead. I mean like well where you belong apart from legit machine learning models, like for example tidy models have for me has been like anything, but tidy, you just seems a very very verbose way of defining a Model. A. I having used Cyclones three
lines of code. It's very difficult to come to come to terms with Payette ID models. One more thing is recently for some work. I had to write a random forest and then I found that the natural the sort of the default. Random Forest as going. Our it has been somebody had it has been adapted from Fortran.
¶ Machine learning in R and Python
So it is sort of I think it's a 32-bit code or something. So because of that if you have a large data set to just hands, right? So please, I guess machine learning is one thing where like, sort of python is sort of his. I don't think there's any comparison again. Well, I hate of our, do you see are catching up in this? Or do you think that like, I mean, it's best for people to be sort of bilingual to kind of address. Steve these two words.
Now, if you're bilingual, you should definitely stick to scikit-learn. There is no to a avoided, but I see our studio doing a lot of lot of lot of work. In terms of Tidy models, are to give you an example, Julia, silicic. So, the the creator of like, one of the creators of Tidy text mining. So, she creates a video every week, and I am seeing a pattern that every week. She uses tidy models.
So, this is a pattern and I see our studio doing a lot of webinars related to tidy models and tidy models, if you know, like, it initially was like they were Initially only that carrot was there, Max Cohen had developed it, but tidy models is something that now they've pushed, they have created a universe like tiny words. They've created a universe around, you know, the study models.
So if you are in our universe, I think there is a strong value in getting into tidy, you know, Teddy models than, you know, using like you said, right? You have like three different ways to do random person bar. Which one do you pick? I think that is the problem. We didn't have initially but we have now I think we should like at least practitioners should solve it by just getting into one universe. That would that would Help them, you know, do everything in
standardized manner. But if you are bilingual, there is no question about, you know, leaving scikit-learn, especially the fact that, you know, deep learning libraries, right? There are a couple of deep learning bindings. In are like, for example, you have carers binding in, are you have helped in the flow binding an hour, but it's still a binding. You still have to have python installed on your machine food reticular, 10 top of it and then
do it rather. Do it in, you know, the this is the same language python itself, right? So why to use bindings, so I mean, so again like I mean as you know, like I mean like be like recruiting fairly heavily over the last one year and since I personally use our I sort of have been specifying are for all the people that I recruit. But what I find is that in India, among all the series that I get, like a very, very large. Majority of the data scientist.
Seem to only know, python even for Eda, they use Python and in my opinion, produce a lot of ugly crafts. And, and so like, I mean, why why? Why is it that like sister of? I mean, it's, it's only been now Lego the last one or two months that I found at least some sort
¶ Why the Indian market in Data Science leans towards Python
of a critical mass of people who are, who are actually using a fair amount of our, in their regular work. It's also, why do you think the data science community in India, especially as link? So heavily towards the python? I mean, that is just my empirical experience over. And why is there a difference in sort of a background of people who, in your opinion, seemed to prefer a prefer python? Part of the work. They do your first question.
I think there was a myth and that myth, says, you know, becoming like a self-fulfilling prophecy. Know the myth was that python is a better language for data science. Okay, so which I strongly disagree, again even today, but this myth was told again and again and again and everybody who is getting started to become a data scientist. They believe this myth and now they actually, you know, learned only Python and Deb.
There was what they were brought into this mess and now it became like a self-fulfilling prophecy because Everybody is now a python developer? And now, everybody will say, okay. See everybody is a python developer. That's why python is based. So it is, it has become like that. I think that myth has been told multiple times in Indians are, and data science Community. Have I run like, for the record
I run Bangalore or user group. We have, we have done, we have partnered with the pi data Bangalore before and then we have run events. But yeah, so have we have people coming? Like there is a Malik lot. I have spoken to a lot of people who use our in their daily workflow, but still people don't want to believe what they do. I want to believe. So this, this myth has become like a fact that is, that is one of the primary reasons why you would see a lot of python-based resumes.
Second thing is, I, I don't know. Somehow, I feel, you know, people who are technically, good being always bashing a heart. I don't know, for what reason, so, I, I am a computer science engineer. I use Python, but still I love art and I would still like if you give me a choice for a lot of things, I would still prefer our but still, I see this. So this is this is a big problem and that is one of the reasons why a lot of people. People are not learning or itself.
I mean like you don't lose anything by learning a new language, honestly. Speaking. It's it's a new tool in Your Arsenal so you can do a lot of things. But yeah, that is why a lot of people again have not learned or but if you will, if you see a lot of veterans you would see them, you would see them, you know, liking the other tool as well like appreciating what the tool is offering whether you use the tool or not.
But at least, you know, you should appreciate what the tool is offering like in the world that we live in. So these are the two reasons why I strongly. Think you would see a lot of Out of python-based resumes at least because of that myth that that people people started talking about, does it also have to do some sort of a background like for example for me like before I got into our python long time, a very long time back. I used to use mostly.
I was mostly in Excel Greg and I used to write like, VBA code to kind of do some interesting things on top of Excel, or I'll even briefly use SAS when one of my employers had a license for it into one or Or hand you think like the python guys are, mostly the sort of people who come from more for software, engineering background rather, from rather than a sort of an analytics program. Yeah, you're right. Actually, you're not just speculating.
If you see students coming from engineering colleges. I wouldn't necessarily say, you know, like only ID background but still students coming from engineering colleges doing well with the python again, because you know, their peers are using python. So they can talk to somebody who uses Python and then get the answer. But if you see non engineering background, I see a strong or at, I strongly encourage them to use our like I so told you right
everywhere. I teach I try to use our because people who are not from engineering background. I think it's very, very easy for them to get started with our initial friction is list. And again, the problem is those kind of people applying for a, you know, typical data science role is also not very common in India. That is another reason why you wouldn't see. Like, for example, if you see outside, India, right?
Somebody who had done some bioscience course, they might be, you know, looking into data science. Not even in Pharmaceuticals, all these things. But if you see in India, especially like, mostly, you see Engineers, we are all engineers and that is how we have been. And that is why you see this strong python here. Now. I mean, let's talk about that
database access rate. So I think it again like the big come here, literally discovered like because to my nine years of Consulting. Before I joined delivery. I people kept asking me if I know how to use big data and I was a kid. I know the principle of big data, but I don't think your org really needs to. Big data, and then I joined delivery and here, obviously, our data size is like absolutely humongous tank.
So like thanks. So for the first time I've had to like kind of continuously have database access in stuff. And if we have sort of solved that problem for the last 10 months of being here is to get a use with our package called tv player where I just write code in our it's okay, if the court is a heat generated with maybe slightly inefficient compared to writing it natively in SQL, but I have personally become a very big fan of Because I find it far easier to debug compared to a
skill. I think SQL because of it's sort of, I don't know if we should call it an in fix for might or something, but you have to keep looking up and down and up and down. If you have to sort of debunk the debug an SQL statement and it gives 100 line segment, which is very easy with the kind of data that we have in my company. It's I find it done. I find it sort of mentally
taxing to you. So but I will gather for the benefit of the rock python guys in my team have been sort of trying to figure out a DVD player. Violent deaths so far not been able to find it and so on. So how easy database activity are in Python.
¶ Working with databases
How my I mean, I assume you have done a done a fair bit of this. Yeah. So DB blade is amazing again for a lot of reasons like you mentioned and again sequel debugging is also very difficult. Like you said, so what does currently working out for me at least in my current organization? The greatest thing that I have found out this data breach. So we have one, you know Big Data System, where we have connected it's like one cluster and within that one cluster.
I B equals Park sequel. I can write our, I can write Python. And so I have the ability to write all the three languages in the same notebook environment so I can just barely switch between languages much much easier. I don't have to you know, make new connections. I think that is that is really good option. But I think if you do not offer this option, only your other solution is. Let's say if you have a spark when you use Five Spot, make a separate connection and then do
things separately. It is spice Pockets, you know, closer to closer to what you would see with the Pandas adding koala is something that people are using call our qualis. I think it's called people are using task is something people are using all these have a very similar pandas are related to a syntax. But but in my job, I don't have to do all those things because I have this flexibility.
But other than that, you cannot, I don't think you have a lot of options to write a native, at least from what I know native Python, and that would be on and the other thing, like you spoke about, right? Making database connection itself, using our studio is quite straightforward, very easy. You see the connection and it's like how you How you would see it? If you are using data reports from database, from SQL client.
It has a very similar nature. If you use our studio for making database connections and you can have multiple connections, that is again, very easy to make button in Python. It may not be easy. But the advantage in Python. If you ask me is most of this database Solutions companies, you will easily get a starter code in Python, but you may not easily get a starter code. Do not like any time you see this thing. So that is another place where I
actually use Python a lot. Any time, I see an API, you go to the documentation. You would actually get a starter code in Python, very fairly simply like you can easily get it and you can get started with python, but it may not be as easy as python in order to do that. One other thing. I mean like, this is again, something, which I've been doing a lot of and like so. I as part of us as the head of the, I ended up having to build a lot of dashboards and reports. I have in my life.
Never figured out these sort of bi packages like Tableau or click on power bi or Any of these have never figured out any of those. I probably have too much of a control sleep for that. So I do things using tiny or like I kind of generate reports using our markdown and things like that. So often it actually have come across a few blog posts and videos that you have made on on streamlet, which I think is a python-based. I'll start offered dashboard create a so can can you talk a
¶ Building dashboards in R and Python
little bit about about creating dashboards how that world Works in both are and in python or rather? If we were to put it this way, Dash. Boards, without the sort of broadly speaking, that Tableau Universe. How do you, how do you program a dashboard? How easy it is in both our and pythonic where you would pick? They say something like a something like a stream later and where you would pick something like a shiny or something. Yeah. Yeah, I have a strong opinion in
this as well. So maybe I'll start with why you should use a programming language for a dashboard over Tableau or power. Bi are any of these proprietary tools forget about that. This is open source for a moment. Even, you know, if it's not, why do you think? Why do I think that you should use? I think, the first most important reason, again, I would talk about technical data, Tableau. If you see you, can you can have a table and you can make some changes in table.
Now, you can have another column or some changes in the Tableau when you Import, the data and third you can have calculated field and for you can have aliases. And when you do like this is the under industry till the chart. You could have changed the label. Okay. Now, if somebody is coming to debug, they will first see the sheet and they have to go back. All these places to understand what like, if I have to change our field name and this is, this is a very difficult task.
And in terms of technical debt, it just gets accumulated and accumulated, and accumulated. And, again, it's a, it's a very bad thing for an organization. So, what these Ironpython offers is simple, right? So if my colleague has built a dashboard. All I have to do is import the code in our studio or import the code in Visual Studio code. I can basically see where all these variables are, you know, reference and then you can you can you can start using it for.
So it is very straightforward. In terms of technical debt, in terms of reusability. I have built something, you take it. You keep it for yourself, change the data source, change, the column names. You have a dashboard up and running, it might look easier on top. Universe, but because of all the complications that I said, it is also not very straightforward. I I prefer Tableau at only one place has a lot of cxo
dashboards right? People had, you know, the higher level, they like things visually appealing. I strongly still believe lot of Chinese developers. Lot of stimuli developers do not have enough tools to make their dashboards. Look as good as Tableau dashboards. I should agree that I think by our a shiny actually recently like a year back, at least. Rider to give you on UI where you can actually select the theme of The Elements like button should be like this.
They should be like this like Dynamic team created. That was there in Chinese, but still, I don't see a lot of lot of shiny developers using it. So, most of the times, if you see Shiny applications as shiny dashboards, it wouldn't look as great as a tableau. So only for visually aesthetic appearance, tab lose fine, but for everything else, I think people should either shiny or streamlet. Okay, now coming back to shiny like, let's say, talk about the
our universe. I think our universe has That is amazing, which is our mob. Don't, especially if you want to build only a client-side dashboard. You can use flex dashboard, right? So you don't need a server to host. It simply HTML file, that looks like a dashboard. So, to it, to even demonstrate this, I actually built a cackle notebook. So, kaggle. Notebook is actually only a client-side notebook. But, ultimately it renders a dashboard at the friend.
So Flex dashboard is really, really amazing. If you do not want server-side interaction. You should always offer client. Sorry, Flex dashboard, where you can just Share the HTML file. Anybody can open it, it would open a browser. And if your organization can afford our studio, connect our studio. Can I ask I've spoken to a couple of people who used our studio connect people have said it's magical, you know, click. You can just publish your report into our studio konnekt.
And then, you know, somebody can make use of it. So Flex dashboard is a free solution. Now, coming to our shiny or shiny. Watch. This was the sole King for quite a long time before simulate came in. So I have made multiple presentations, why are shiny is a very important tool unmatchable. I think this was the time when slowly in Python Universe, people started realizing that we need some alternative. So to powerful tools came and I would say, one is plotly Dash
plotly. As this plotly visualization company. They also pushed Dash. And now they're pushing a lot of Enterprise stuff. Especially Dash pokers, and then we have, streamlet. I still prefer streamlet to plotly dash. The reason I prefer streamlet, is because I find streamlet quite similar to how I used are shiny. It's not in the same way to write the code, but how simple it is to Do you know spin up something?
See the problem in these kind of things is if you are a data scientist, your focus is not to develop a full stack application. That is something that we do a lot of people actually forget when they develop these kind of applications. And a lot of companies actually have a software developer or a data engineering team tied up with. So as a typical data center. It's not necessarily every time you are going to develop a full
stock application. But if there is a need for you to develop realistic application, do you have tools that has very less friction to help you make like an MVP? I would see this like more like an MVP. So I think that's where streamlet under shiny. Actually does a great job, but that doesn't mean you cannot develop full stock production grade application. Now, the question is, what is production grid. Do I want to build the next Facebook using shiny or
streamlet? Know, I would, of course not right. I would probably recruit somebody who has react background, more in stock, and then we'll let them develop it. But if I want to develop a full stack Facebook, next Facebook, if I want to raise Ting. If you want to pitch it in my organization, let's say like I want to give a tool to my internal, let's say some team as part of an initiative that like data sends have figured out
something. Now. I don't have to find a resource to do this pilot. I can myself develop this tool and then give it to them. So it's not that you cannot develop production grade. But what production grade is the first question. The second question is, you can still make these tools work work much better than how most people are developing. If you see Shiny two years back or should your conference Focus
solely on lot of scalable. Applications, there is a library by Colin fake called Golem Geo. Lem Golem is an amazing library that tries to bring some standardization recently. I came across another Library brochure. I think that is also by Colin fade and that is also again trying to make more optimization in this area. So, yeah, to, to quickly some of our shiny and stimulator quite equivalent or from from a developer perspective.
If you ask me very similar, it's just that in different universities available, but I don't think there is anything like our mob done in Python Universal and our McDonald's. Like, Really amazing. Even if you are a software engineer, I see. Why don't you talk a little bit more about our March towards? If I don't think give, they might be a lot of our listeners, who may not be very familiar with the tool. I so, can you talk about how you use are? My daughter's is sort of a
client a step back boarding. Yeah. So this is a very nice blog post written by Minimax are so we can link that in the show notes as well that compares our notebook and jupyter notebook. That had that does a very good comparison. So now our notebook is nothing but a slight extension of our Bond. So we'll just talk about our mark down alone. So our markdown is this is this script type that that is available in our that lets you write our code and also lets you
light, right markdown. So R plus markdown is what you are or McDonald's but it's like oversimplification of saying. So what are mugged on? Lets you do, that. Does it primary lets you create like a lot of different formats and it's not just you know, typical standard report that you can do in your company. For example, let's say, like you have a dock, you want to build It's a simple analytics report and you want that report to be customized.
For example, like, let's say, it could be like a stock market reporter. Something for example, and based on what the stock market value. Do you want to do some natural language change within the document and you want it to be sent out to, you know, bunch of people every day. Now, this is quite easy to implement it in our universe. So you you just need an arm of done and that arm of don't under probably the package like sendmail. And that arm of Don has our script that gets customized.
Like let's say maybe you're building a linear regression model or you're building a cheap. Model that will do some forecasting and all the results are there. Visualized very nicely and send that email as PDF to the customer. Very simple. This is one of the use cases of our arm Ogden, but leaving out the organization part like organization Partners like you build a client site dashboard, you build analytics report, you build HTML Pages.
You build PDFs, but leaving out this part are McDonough still grown a lot. You can publish blog posts using our mock. Don't you can write books using our modern style. I lot of People started writing books, using our mob done with the library called book, don't book. Don is an extension of our Magnum. So, the advantages, when you like mini books, at least I see a couple of people I respect in the industry. They are not very famous liquid. You see them actually riding a mini book.
So that many book is like a reference for me as well reference for the elderly Community as well. So once you the greatest thing that I like about harm of known as want. So this is a very classical example of once you give the tool to somebody how they leverage it, something that you cannot even imagine.
Like I wouldn't imagine publishing a blog post using our mobile, but I'm doing it. So block down is they're booked on, is there a lot of extensions of arm of them, but to basically arm of known as like one script, where you can write markdown where you can write R and you can write anything and all and that gets rendered into some formats. And how do you render it is what the different parameters. Does it render like a book or blog post?
Like a dashboard with, you know, subtle changes in it, and that's something that python Community. Definitely misses. Correct, correct. Okay, so I think they're coming. So I think, what I initially planned this episode, I mean, before I decided to go ahead of high value on it.
¶ Working with R *and* Python at the same time
I thought you should have a sort of a are versus python Qaeda for debate on this. But I think what's happened now, is like, you provided very nice intro into, like, what is it like to kind of use? Both are in Python? Because I see that you use both in your in your regular work. So can we talk about like, I mean, like, I mean your sister spoke a little bit about this, but normally how you get the
best of both worlds. I mean like if you if, you know, both are Bite it. Like how do you sort of like Oneness? Like I mean, you were talking about how you do your Eda and I are and then the modeling in Python, but my thought immediately was like, okay, how do you transfer the data from one to the other? Like, I mean, I can't imagine radical CSV or something as a go-between and so on. So so how do you get a integrate these two into your workflow to kind of get them become a even
better data scientist? Yeah, so I have actually called this as a superpower previously, in couple of my talks where you can combine, you know, it's like, I don't know if you have seen seen superheroes. Like they usually you have like combination. Like, if you see Justice League, they have like multiple Powers together and they formerly the same way. Our has some superpower python as a superpower. And if you can combine both of this, it's like a massive super power.
And that is what reticulate library is 4. Pie R has a very nice Library called reticulate that lets you combine both the walls and like I said at the start, right? So after reticulate came in, you actually started seeing a lot of python Library coming into our world just by using binding. So at the start like we said, right? You know, if you are bilingual, is better to use that that
language itself. So now the question is, you don't want to use binding, but if you want to use the same language, then you can write python code inside. Our studio using reticulate, and you can interchange objects
between both the languages. Let me give you a very simple example, you know, let's say I want to do an LP, okay natural language processing and I strongly think natural language processing, couple of libraries like Spacey hugging phase Transformers are like super awesome in natural language. Processing. Even if I have tidied X, I would probably prefer these languages for me to do natural language
processing. But before I do natural language processing, I want to import data may be like you said, I would probably use a DVD player. I want to. I want to do some Eda like unigram by G or like whatever I want to do some kind of cleaning. I would probably use tiny dicks to Green do cleaning because tidy text as a very English syntax, right? When you compare it with nld care somewhere. It's a very English syntax. So, after I do all these things
now, I would take this data. And immediately the same code, I would pass it on to my python environment and all these other. Within the same code, if you use reticulate, everything is in the same code and now within python, I would probably let say do POS tagging any a named entity recognition and probably let's say, like, even I'm going to use cyclone and extra boost to build a text classification model. Now I have the result.
I can interpret it. So in one code, I basically used our and python basically managed to use best of both the worlds. And this is simply Possible, only because I'm Oozing reticulate, reticulate handle. Please, I have a data frame in or how does it, how should it be in Python world? I don't have to be worried about it.
Reticulate as a simple map, like a table that would say, okay, if you have a list in Python, it's a, it's a, it's a vector in R. So they just give you the mapping to understand for you, to mentally, understand. What is this object here? And what is this object here? But other than that, you as a developer, do not have to do anything. Just basically have the installation and reticulate also lets you call. Specific python version. What does it mean? Let's say the initial problem
that I talked about. Right? Let's say you have a condom environment and you have a bunch of libraries there or let's say you have created a virtual environment. Where I think that is quite common with python developers to create virtual environment to keep the library's only within the environment, ready platelets. You specifically, called only, that environment, use libraries from that environment, keep your entire session within the undead environment.
And the greatest thing about reticulate, again, is reticulate can be combined with our shiny and, and our mock known as well. Which means I can now finally create a report, send an email also, which uses both R and python code. Or it's okay. That's, that's very interesting.
I, I personally have aware of the existence of ridiculous, but I haven't really used it though of a time finding cases where I have to use it, because I run a team, which sort of which works both in our and in Python. So I come stage will have to integrate our code and things like that. So we love will have to do that. So by the end of the super, as we were sort of talking about like reticulated, ironpython directly, one of the other things is like, what about other
¶ What about Excel and Julia?
languages? Like, for example, I think is right in the beginning of the We spoke a little bit about Excel. So can we talk about a little bit of how do you interface between? Let's say our and python on one side and Excel on the other. What do you use each one more. And also I don't know if you have any experience with with Julia, which I think again, becoming popular with a lot of with a lot of data scientist. Yeah, okay. So starting with Julia, so I tried a little bit.
I, at least if you ask me, personally, I'm more of an applied person. So, for my, whatever I do is mostly driven by business question, so I still don't see, you know, Julia fitting in my work line anytime soon. Probably. Like, next, 2-3 years. I think our and python would completely keep me safe. At least from the job market or whatever. It is straight, but the next half a decade. So that way I don't use Julia and Julius pitch if you see their pictures to solve to language.
Like one language for Prototype own language for protection. So I think Julia would be more applicable for people who are facing the to language problem. So let's say like if I'm an iot developer or something. I mean, it's Computing. I develop something in Python. Then I have to translate it to into, you know, like that that devices code.
Maybe Julia will help me. But again, if you actually see tensorflow ecosystem by torch ecosystem, lot of these tools are also developing tools to help you stick to python ecosystem and then export some, you know, the model into the H come. Booting or that side of the world. So both are actually growing a lot. So I don't see, you know, Julia being a main tool for a typical data scientist who's like me using it maybe separately for machine learning engineer show.
That is about Julia and I'm sorry. Sorry. What is the first one? Yeah, so I think Excel is good. If you, if you accept, if you are not using dates, so you would probably have seen lot of names around Excel dates xlsx LS painful. Many times. I would again put extra light Tableau. The problem that I Basement
Excel is reusability. Lot of things that you do in Excel. Like if I have to redo it again, it's very difficult for me. But again, there are places where I would strongly prefer Excel first. The first point is like you said, right? If I have to do d f dot head in condos or a the dataframe head, like top ten rows. I want to see. I think I can make more better use of my same time. If I do the same thing on Excel, right? So, if I have to see a part of the data, like you said, right,
conditional? Formatting, I'm known for doing a lot of fun. Ditional formatting and then, you know, making colorful colors but it actually gives you a lot of advances that you might otherwise miss when you simply look at, you know, the plane screen that you use order or python for. So there are places where I have I want to look at data. Simple pivoting is sometimes you know much much easier and to do in Excel.
So we all have to accept, right? If the data is not very big I can just use pivot, make a simple chart. So Excel is so useful in that these days I find making charts very thematic using Excel Tex. Okay, whenever I say Excel, it's either, G sheet or Excel. I don't know.
Differentiate between a spreadsheet based tools, so making charts that that are quite fitting in my organization scheme or the team that I want to produce Penny, minimalistic charts, and maybe it's because I didn't put much effort in, you know, creating a theme myself. But yeah, XL XL. I started finding it more easier. And so these are these are certain places. But again, I would strongly discourage using Excel in data science for flow. The reason I would say is still
a lot of people. I see them using, you know, like Excel library in or like pandas read underscore Excel to read XLS. Five, I think we should try to keep as much as clean possible. Which means like we should deal mostly with the CSV. We shouldn't, you know, get into Excel world because that world is really messy in terms of putting something inside your pipeline. If something goes wrong, for example, let's say there is a g sheet and the output of G sheet is what you are using inside.
You're like Founders workflow. Now. The problem is that g? It has different authorization. If somebody leaves the organization, right? What do you do with the g-h it? So the like the more, you know, proprietary Solutions you get into your data. Workflow, I think the Messier, it becomes the technical debt increases. So keeping it simple as a school. Like for an automated workflow. I think we should keep it as simple as possible as, you know, standardized as possible or python tsp.
But if you want to use Excel, these are the cases I would use Excel for. The other has been great. Is we re done is give a fascinating conversation adding to end. I mean, is the question that I in most of the most of my conversations with. So let's say if you are, if you are kind of an aspiring data scientist and from the programming perspective lately,
how would you? Or let's say you have your seen some sort of very interesting statistics in yourself or like some sort of interested in looking at numbers and so on and you want to be a data scientist. So how do you approach it from the programming perspective? Like what you learn? What's how do you sequence a tank as you get into the work environment? How do you navigate between the different programming languages that are available?
Yeah. Yeah, the the very first thing that I tell everybody is don't get into language words, like people who are part taking part in language words or you know, like language phonetics. I don't you shouldn't be language phonetic, you like both the languages you would like one language, but just don't get into language words. It doesn't help anybody and people who are doing it. They are already professionals. Probably they have they have
achieved something. So, if you're a beginner and when you get into language, or you don't get anything out of it, you're basically losing out a super power that is available in some other language. That's the first thing. The second thing is. There is no point in, you know, trying to figure out which is better or not. Just get started with some language, whether it is order
python. So if you want some clue or guidance, I would say, like if you are not from engineering background pekar, if you are from engineering background, pick python, just purely because, you know, you will have a Community Learning something with the communities, much better than learning, something individual.
So pick order Python, and if you are not from engineering to God, if you are from engineering pick Python, and after you pick that, I would strongly suggest first to go through the language being It's this is again a mistake that I've seen a lot of data science Enthusiast. Doing just straight away. Get started with machine learning, don't do that. Just get started with the language Basics.
Understand the data types, understand, you know, how to write for Loop out erisa fails in, in, in whatever language that you have picked. I think like you said, write these things come very handy, probably sometime later in your career. You don't want to be in your later, in your career at the time thinking. How do I write is here? How do I write is here? I mean, that is very basic part of the language, right? You're not just learning the language for data sense.
You're actually learning the language. It's not like I'm asking you to do develops with python, but at least you should be able to read a python code and make some sense out of it. Like, what does it mean to learn the basics of language? Once you learn the basics of language, get started with basic data manipulation in our basic tidy basic ideas, especially, especially the deployer under tidy or and in Python pandas numpy.
And after you do that, I would suggest immediately get started with data visualization, not machine learning. And the reason I'm saying, is from my teaching experience. I've seen when people get into t. Our machine learning. It just goes deep into deep. It's like a rabbit hole. You can never get out of it because now you will try to learn. Linear regression, logistic regression. Then you start going into decision tree. So it's like a big rabbit hole. We as human beings.
We need some gratification. So visualizations are very good way to give you gratification when you are trying to learn the get started with visualization Library. Start from, you know, static visualization Library move into interactive. Visualization library. Now, switch back to machine learning start from, you know, classical algorithms go deep, and then learn Full stack development. Also, I think we are in 2021 and we, as data scientist. We should be.
We should we should have the ability to build at least an MVP. If you are in our loan or markdown and are shiny. If you are in Python, learn either Dash or streamlet, I have slightly strong preference for streamlet. I think this would make you. This would make you really competitive and market. And this is, let's say, you get the job. And after you get the job. I think your primary duty is to go through documentation and understanding how people have
written code. Internet is Like an amazing place. You have a lot of Open Source Code. I think at this point, we should stop watching tutorials when I make noodles, but I still say, I think, at this point, we should stop watching tutorials and start reading. Somebody else's code, spend more time on stack of lighting that will make you an intermediate or better programmer from me. Thank you for listening to data set. If you like the show.
Please leave a comment, share and subscribe to the podcast. You can find this podcast on Apple podcast Spotify or wherever else you go to get your podcasts. Once again, this is kind of exciting of thank you.
