Course Intro

Course Intro

First Week Orientation


Weekly content


Content


Course description


About Lecturer

Changjun LEE

Associate Professor (Head of Culture & Tech)

School of Convergence. SKKU.

As a computational social scientist, I bring a unique interdisciplinary perspective to the fields of economics, innovation studies, and convergence technologies. With a background in natural sciences, including a bachelor’s degree in biology and chemistry, I went on to earn a Ph.D. in technology management, economics, and policy. My research focuses on utilizing computational methods to tackle a wide range of social phenomena, including technology evolution & regional growth, knowledge management, and technology & convergence innovation. I am passionate about using technology and data to drive innovation and solve real-world problems.

  • Research interest: Media & Innovation, Immersive media and users’ behavior, Technology management, Anything can be solved using Data Science

  • Teaching: Culture & Technology | Data Science in CNT | STAT101

  • Things I love

    • Research

    • Chat & Coffee & MBTI

    • Travel

    • Play (Things)


Weekly Design


Research & Statistics

Entering grad school (Doing your master / Ph.D.) is an exciting milestone. It’s a chance to dive deep into your field, contribute to groundbreaking research, and refine your expertise. However, regardless of your discipline—be it psychology, biology, economics, or even education—one skill remains a constant necessity: statistics.


Why is Statistics Important in Grad School?

Graduate school is often synonymous with research. Whether you’re collecting data, interpreting results, or reading existing literature, you’ll need to understand how to analyze and make sense of numbers. That’s where statistics comes in. Statistics provides the tools to turn raw data into meaningful insights, enabling you to make evidence-based conclusions.

1. Informed Decision-Making1
Research is all about making informed decisions. Is the difference between two groups significant? (How likely is it that the results you found are due to chance? 우연히 일어난 일이 아니라는 근거는?) Statistics gives you the methods to answer these questions with confidence.


2. Data Interpretation (논문을 읽으려면 꼭 알아야 하는 통계)
Understanding research papers is a core part of grad school. But to critically evaluate these papers, you need to grasp the statistics used in them. Whether it’s regression analysis, ANOVA, or chi-square tests, you need to know how the data was analyzed to assess whether the conclusions are sound. (읽다보면 패턴이 보일 것)


3. Research Design (끝까지 상상하고 디자인 하라)
When designing your own studies, statistical knowledge helps you choose the right sample size, measurement tools, and analysis methods. It ensures that your research is both scientifically rigorous and ethical, minimizing biases and errors.


The Role of Statistics in Research

Statistics isn’t just about crunching numbers. It plays a pivotal role in every stage of the research process, from conceptualization to publication.

  • Formulating Hypotheses: At the start of your research, you’ll develop hypotheses that are testable using statistical methods. Understanding the statistical underpinnings of these hypotheses helps you design studies that can accurately assess the relationships you’re interested in.

    잘못된 가설: “새로운 약물이 혈압에 영향을 줄 것이다.”

    • 구체성이 부족: “영향을 줄 것이다”라는 표현이 모호해서 혈압을 올릴 것인지, 낮출 것인지, 또는 그 효과가 얼마나 클 것인지 구체적으로 언급하지 않았음.

    • 통계적 검증의 어려움: 명확한 가설이 없으면 어떤 통계 분석 방법을 사용해야 할지도 불분명해짐. 혈압이 증가하거나 감소할 것이라는 구체적인 예측이 없으면 통계적으로 검증하기가 어렵기 때문.

    올바른 가설: “새로운 약물이 고혈압 환자의 혈압을 유의미하게 감소시킬 것이다.”

  • Data Collection: Statistical principles guide the way you collect data, ensuring that your methods minimize bias and maximize reliability. Sampling techniques, for example, are essential in ensuring that your study results can be generalized to the broader population.

  • Data Analysis: Once you’ve collected your data, statistics helps you determine the best way to analyze it. Different research questions require different statistical tests, and knowing which test to use—and how to use it—can make or break your study.

  • Reporting Results: Properly reporting your findings involves more than just stating whether your results were significant. You’ll need to provide effect sizes, confidence intervals, and possibly even power analyses to show that your results are both meaningful and reliable. Statistics helps you communicate your findings clearly and effectively. (논문을 읽을 때와 마찬가지로 쓸 때 깊이가 생기기 시작함. 아는 만큼 보이기 때문)


How to Learn Statistics as a Grad Student

For many, statistics can be intimidating. However, like any skill, it can be learned with dedication and practice. Here are some tips to help you master statistics during your graduate studies.


1. Start with the Basics
If you’re new to statistics, don’t worry. Start with the basics—mean, median, mode, standard deviation—and gradually build your knowledge. Courses, textbooks, and online resources like Coursera can help you get started. (온라인 에듀 플랫폼? 도움이 된다!)


2. Apply What You Learn
Statistics is best learned by doing. Apply what you learn to real datasets. Whether it’s data from your research, a class project, or publicly available datasets, practicing with real data will help solidify your understanding. (응용 없는 배움은? 고급 취미)


3. Use Statistical Software
In grad school, you’ll likely need to use statistical software like R and Python. Learning to use these tools is essential, as they’ll help you manage and analyze large datasets efficiently. Don’t be afraid to dive into tutorials and online forums to help you get comfortable with the software.


4. Seek Help (Create a group study)
Grad school can be overwhelming, and it’s okay to seek help. Attend statistics workshops, study groups, or even work with a tutor if needed. Collaborating with peers and seeking guidance from professors can help clarify complex concepts.


5. Make Statistics a Daily Habit
Just like any other skill, becoming proficient in statistics requires daily practice. Dedicate time each day to work on problems, analyze data, or review concepts. Consistency is key to retaining what you learn and becoming comfortable with statistical methods.


The Importance of Daily Grinding

The most effective way to master statistics—and research methods more broadly—is through consistent, daily effort. By integrating statistics into your daily routine, you’ll gradually build confidence and proficiency. Daily grinding may not be glamorous, but it’s a proven strategy for success in grad school.

Here’s why:

  • Retention: Daily practice helps reinforce what you learn, making it easier to recall concepts when you need them.

  • Problem-Solving: Regular exposure to different statistical problems improves your problem-solving skills and allows you to tackle more complex analyses over time.

  • Confidence: As you practice more, your confidence grows, reducing the anxiety that often accompanies statistical analysis.


Data Science & R

Data Science is a modern academic area helping people to draw useful information and intuition so that making them to make reasonable decisions

Why Data Science is So Important to Learn?

In the age of information, data science has emerged as a pivotal skill set across industries, transforming the way decisions are made and offering insights that were previously inaccessible. Its importance stems from the ability to analyze and derive meaningful insights from data, which can drive strategic business decisions, lead to technological innovations, and solve complex problems. Here’s why learning data science is crucial, even for those not pursuing a career in computer science.

Data-Driven Decision Making

One of the foremost reasons for learning data science is its role in data-driven decision making. With an increasing amount of data generated every day, the ability to sift through, analyze, and interpret this data can be the difference between staying ahead or falling behind in any industry. Data science skills enable individuals and organizations to make informed decisions based on empirical evidence rather than intuition or speculation.

Cross-Disciplinary Application

Data science is not confined to the tech industry. It finds applications in healthcare, finance, retail, education, and more. For instance, in healthcare, data science can predict disease outbreaks, in finance, it can assess risk and in retail, it can help understand consumer behavior. This cross-disciplinary nature of data science means that learning it can open up opportunities in virtually any field, making it a valuable skill set for all majors.

Enhancing Problem-Solving Skills

Learning data science enhances critical thinking and problem-solving skills. It involves identifying patterns, making predictions, and solving complex problems using a data-driven approach. These skills are transferable and valuable in any career path, making data science learning beneficial beyond the technical aspects.

Future Job Market

The demand for data science skills is growing rapidly. According to the U.S. Bureau of Labor Statistics, the job market for data science and analytical roles is projected to grow much faster than the average for all occupations. Learning data science can thus provide a competitive edge in the job market, even for those in non-computer science fields.

Empowering Innovation

Data science is at the heart of innovation today. From developing new products and services to improving existing processes, the insights derived from data science can lead to significant advancements and efficiencies. By understanding data science, individuals in non-computer science fields can contribute to innovation within their industries.

For non-computer science majors: You are the one who has to learn this.

The importance of data science transcends traditional boundaries of computer science and technology. It is a critical skill for the future, enabling individuals to navigate and excel in a data-driven world. For non-computer science majors, learning data science not only enhances employability and problem-solving skills but also opens up new avenues for innovation and impact in their respective fields. In essence, data science is a universal language of the future, and learning it is key to unlocking potential across the spectrum of professional endeavors.


  • Again, Data Never Sleeps

https://www.domo.com/data-never-sleeps


So Data Scientist is..

  • 데이터를 가지고 길을 만드는 사람.

  • Requirements

    • Domain Expertise is getting more important as the world is getting more complicated

    • This course will cover coding part and a little bit of statistical skills

    • Don’t worry! Practice makes perfect.


Why R is the Best Language for Non-Computer Majors to Learn Data Science

In the realm of data science, the choice of programming language is pivotal. For non-computer majors venturing into this field, R stands out as a particularly accessible and powerful tool. Here’s why R is often considered the best language for those new to data science.

1. Designed for Statistics and Data Analysis

R was specifically created for statistical analysis and data visualization. Unlike general-purpose programming languages that can be used for data science, R’s syntax and functions are tailored to statistical analysis, making it more intuitive for analyzing data. For students and professionals without a computer science background, this focus makes R an excellent entry point into data science.

2. Comprehensive Libraries and Packages

R boasts a vast ecosystem of packages designed for data science tasks, including dplyr for data manipulation, ggplot2 for data visualization, and caret for machine learning. These packages simplify complex tasks into more manageable functions, allowing users to perform sophisticated data analysis with relatively simple commands. This extensive library support means that non-computer majors can accomplish more with less coding expertise.

3. Strong Community and Support

The R community is known for its inclusivity and support, especially for beginners. Numerous online forums, such as R-bloggers and Stack Overflow, provide a platform for learners to seek help, exchange ideas, and stay updated on the latest developments in R. This community support is invaluable for non-computer majors who may require guidance as they navigate their data science journey.

4. Free and Open Source

R is a free, open-source software, making it accessible to everyone without the need for expensive licenses or subscriptions. This democratizes the learning process, allowing individuals from diverse backgrounds to explore data science without financial barriers. Moreover, being open-source encourages users to contribute to the development of R, further enriching its capabilities and resources.

5. Versatile Data Visualization Capabilities

Visual data representation is crucial in data science for understanding complex datasets and communicating findings effectively. R’s superior data visualization capabilities, particularly through the ggplot2 package, allow users to create high-quality, publication-ready graphs and plots. This is particularly beneficial for non-computer majors, who might rely more heavily on visual representations to understand and present data insights.

6. Industry Adoption and Academic Support

R is widely adopted in both academia and industry for research and data analysis. This widespread use means that learning R can open up opportunities in various fields, including biostatistics, epidemiology, economics, and social sciences. For non-computer majors, the ability to apply their data science skills directly to their field of study or industry is a significant advantage.

For non-computer majors looking to delve into data science, R offers a unique blend of accessibility, specialized functionality, and community support. Its design for statistical analysis, along with a rich ecosystem of packages and resources, makes R an ideal starting point for those new to programming and data analysis. By learning R, non-computer majors can not only gain valuable data science skills but also apply these skills directly to their fields of interest, enhancing their research capabilities and career prospects.


How About Python?

A good quesiton! R and Python are both powerful and popular programming languages in the data science community, each with its unique strengths. When comparing R to Python, especially from the perspective of those with non-computer science backgrounds or specific analytical needs, several aspects of R stand out, highlighting its strengths:

1. Specialization in Statistical Analysis

  • R’s Foundation: R is specifically designed for statistical analysis and graphical models. It was developed by statisticians for statisticians. This specialization gives it an edge in handling complex statistical data analyses out of the box, without the need for extensive programming knowledge or additional libraries.

  • Comprehensive Statistical Packages: R has a comprehensive collection of packages for various statistical analyses, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and more. These packages are often written by statisticians and subject matter experts, ensuring they are tailored for rigorous statistical analysis.

2. Superior Data Visualization

  • ggplot2 and Beyond: R’s ggplot2 package is renowned for its capability to create advanced and beautiful data visualizations with ease. The grammar of graphics implemented by ggplot2 allows for the layering of data, aesthetic mappings, and statistical transformations, facilitating the creation of complex plots intuitively.

  • Rich Ecosystem for Visualization: Beyond ggplot2, R offers a rich ecosystem of visualization packages like lattice, plotly (for interactive plots), and shiny (for interactive web apps), providing a wide range of options for presenting data insights effectively.

3. Integrated Development Environment (IDE) Support

  • RStudio: RStudio is a powerful and popular IDE specifically designed for R. It provides an integrated environment that makes data analysis, visualization, and programming more accessible. RStudio enhances the R programming experience with features like code completion, easy package management, and markdown (quarto, nowadays) support for reproducible research.

4. Community and Support

  • Dedicated Community: R has a vibrant and welcoming community, particularly attractive to those in academia, research, and various scientific disciplines. The community offers extensive resources, forums, and groups for support, making it easier for newcomers to get started and for experts to dive deep into specific statistical challenges.

  • CRAN Repository: The Comprehensive R Archive Network (CRAN) is a repository of over 16,000 R packages (as of my last update in April 2023). This vast and well-organized repository ensures that R users have access to tools and libraries for nearly every statistical and graphical method imaginable.

5. Data Handling and Analysis

  • Data Wrangling: While Python’s pandas library is powerful for data manipulation, R’s dplyr and data.table packages offer syntax that is arguably more intuitive for data transformation and aggregation, especially for those with a background in SQL or those new to programming.

  • Analysis Workflow: R is designed with an analysis-first approach, emphasizing data exploration and analysis workflows. This makes R particularly suited for iterative data exploration, hypothesis testing, and modeling in a way that is highly accessible to researchers and analysts.

  • R is intuitive for analysis: R may not work with a wide variety of projects, but it is the best choice for analysis and inference work. If you plan to work in a specialized field, you’ll want a specialized programming language. R also offers a powerful environment ideally suited to the types of data visualizations data scientists employ.

Comparative Consideration

While Python is a general-purpose programming language with broad applications ranging from web development to software engineering and has significant strengths in machine learning and deep learning with libraries like TensorFlow and PyTorch, R’s focused design for statistical analysis and data visualization makes it exceptionally well-suited for data scientists, statisticians, and researchers. This focus, combined with its comprehensive statistical packages, superior data visualization capabilities, and supportive community, positions R as a strong choice for those particularly interested in the statistical and analytical aspects of data science.

Source: https://www.edx.org/resources/r-vs-python-for-data-science-explainer-learning-tips


Install gadgets

  • Install R, R Studio, & Rtools

  • Things you need to know

    • Don’t Use OneDrive.

      • Use Github instead

      • Many people get an error when installing because of OneDrive

    • Set Windows user name to English

      • If Korean characters are mixed in the installation path, there is a high probability of error occurrence
  • Installation Order

    • Step 1 ‑ Download the file

      • Download R, Rtools, Rstudio installation files
    • Step 2 - Install R

      • Unified installation path: All will be installed in the C:/R folder

      • Run in administrator mode when running the R installation file

      • After installing R, grant write permission to the R folder, Right-click and turn off read only

    • Step 3 ‑ Install Rtools

      • Rtools? RTools is a collection of software tools and libraries that are necessary for building and compiling packages in the R programming language, especially on Windows.

      • Administrator mode execution installation and folder setting as C:\R\rtools40

      • Create environment variable RTOOLS40_HOME after installation: Value - C:\R\rtools40\

      • Add %RTOOLS40_HOME%\usr\bin\ to the Path variable.

    • Step 4 - Install Rstudio

      • Right-click and run as administrator - installation path C:\R\Rstudio

      • Check rtools connection with Sys.which(“make”) command after installation


Footnotes

  1. An informed decision is a choice that individuals make once they have all the information related to the decision topic. It involves analyzing potential outcomes, benefits and risks associated with each option, then deciding which choice is the best for you.↩︎