diff --git a/.gitignore b/.gitignore index 536c0ce..02c4c48 100644 --- a/.gitignore +++ b/.gitignore @@ -36,6 +36,9 @@ * In repkit, ignore reproot files in test files /**/tests/**/reproot.yaml +* in reprun, unignore the images folder +!src/vignettes/img/*.png + ####################### # For performance reasons, if a folder is already ignored, then # GitHub does not check the content for that folder for matches @@ -56,6 +59,9 @@ !src/stata.toc !src/*.pkg +# Python +!/**/*.py + # Markdown and web docs files !/**/*.md !src/dev/assets/*.png diff --git a/README.md b/README.md index 3f218fd..40c4977 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,7 @@ Currently, this toolkit has the following commands: | [reproot](https://worldbank.github.io/repkit/reference/reproot.html) | This command allows teams to dynamically set root-paths with no manual user-specific set-up, in both single-rooted and multi-rooted projects. | | [reproot_setup](https://worldbank.github.io/repkit/reference/reproot_setup.html) | This command helps setting up the environment setting file used in `reproot` | | [reprun](https://worldbank.github.io/repkit/reference/reprun.html) | This command is used to automate reproducibility checks by running a do-file or a set of do-files and compare all state values (RNG-value, datasignature etc.) between the two runs. This command is currently only release as a beta-version. | +| lint | `lint` is an opinionated detector that attempts to improve the readability and organization of Stata do files. The command is written based on the good coding practices of the Development Impact Evaluation Unit at The World Bank.| # Installation diff --git a/src/ado/lint.ado b/src/ado/lint.ado new file mode 100644 index 0000000..93bbb7f --- /dev/null +++ b/src/ado/lint.ado @@ -0,0 +1,562 @@ +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org + +capture program drop lint + program lint + + version 14.1 + + syntax anything [using/], /// + /// Options + [ /// + Verbose /// + NOSUMmary /// + Indent(string) /// + Linemax(string) /// + Space(string) /// + Correct(string) /// + Excel(string) /// + AUTOmatic /// + replace /// + force /// + debug /// + ] + +/******************************************************************************* +******************************************************************************** + + PART 1: Prepare inputs + +******************************************************************************** +*******************************************************************************/ + +/******************************************************************************* + Set defaults +*******************************************************************************/ + + * set indent size = 4 if missing + if missing("`indent'") local indent "4" + + * set whitespaces for tab (space) = indent size if space is missing + if missing("`space'") local space "`indent'" + + * set linemax = 80 if missing + if missing("`linemax'") local linemax "80" + + * if !missing("`excel'") cap erase `excel' + if !missing("`excel'") cap rm `excel' + + * set excel = "" if excel is missing + if missing("`excel'") local excel "" + + * set a constant for the suppress option being used + local suppress_flag "1" + if !missing("`verbose'") local suppress_flag "0" + + * set a constant for the summary option being used + local summary_flag "1" + if !missing("`nosummary'") local summary_flag "0" + + * In debug mode, print status + if !missing("`debug'") di "Inputs prepared" + + +/******************************************************************************* + Prepare file paths +*******************************************************************************/ + +// Check format of do-file to be linted ---------------------------------------- + + * File or Folder to be detected + gettoken anything : anything + + * Check if main input is a file or a folder + local input = `"`anything'"' + + _testpath "`input'", ext(`"".do", ".ado""') argument(lint's main argument) exists `debug' + local folder = "`r(folder)'" + local file = "`r(file)'" + +// Check do-file with corrections ---------------------------------------------- + + if !missing("`using'") { + + * Can only be used when linting a do-file + if missing("`file'") { + di as error "{phang}Option [using] cannot be used when linting a directory. To use this option, specify a do-file as lint's main argument.{p_end}" + error 198 + } + + _testpath "`using'", ext(`"".do", ".ado""') argument(lint's [using] argument) `debug' + local output = "`r(file)'" + + * Unless force is used, the output file should have a different name than the input + if missing("`force'") & ("`input'" == "`output'") { + di as error "{phang}It is recommended to use different file names for lint's main argument and its [using] argument. This is because it is slightly possible that the corrected do-file created by the command will break something in your code, and you may want to keep a backup. If you want still to replace the current do-file with the do-file corrected by lint, use the option [force]. {p_end}" + error 198 + } + } + +// Check Excel with corrections ------------------------------------------------ + + if !missing("`excel'") { + + _checkopenpyxlinstall + + _testpath "`excel'", ext(`"".xls", ".xlsx""') argument(lint's [excel] argument) `debug' + local excel = "`r(file)'" + } + +// In debug mode, print file paths --------------------------------------------- + + if !missing("`debug'") { + di "Folder: `folder'" + di "File: `file'" + di "Excel: `excel'" + di "Input: `input'" + di "Output: `output'" + } + +// Check if python is installed ------------------------------------------------ + + _checkpyinstall + + * Check that the Python function is defined + qui: findfile stata_linter_detect.py + if c(os) == "Windows" { + local ado_path = subinstr(r(fn), "\", "/", .) + } + else { + local ado_path = r(fn) + } + +// Check that versions of all auxiliary files are the same --------------------- + +_checkversions + +/******************************************************************************* +******************************************************************************** + + PART 2: Execute linter + +******************************************************************************** +*******************************************************************************/ + +/******************************************************************************* + Detect issues +*******************************************************************************/ + + * Check a single do-file + if !missing("`file'") { + + if missing("`using'") { + local header header + } + + if (!missing("`verbose'") | (`summary_flag' == 1) | !missing("`excel'") | !missing("`using'")) { + local footer footer + } + + _detect, /// + file("`file'") excel("`excel'") ado_path("`ado_path'") /// + indent("`indent'") linemax("`linemax'") space("`space'") /// + suppress_flag("`suppress_flag'") summary_flag("`summary_flag'") /// + `header' `footer' + } + + * Check all do-files in a folder + else if !missing("`folder'") { + + local files: dir "`folder'" files "*.do" + + foreach file of local files { + + _detect, /// + file("`folder'/`file'") excel("`excel'") ado_path("`ado_path'") /// + indent("`indent'") linemax("`linemax'") space("`space'") /// + suppress_flag("`suppress_flag'") summary_flag("`summary_flag'") /// + header footer + } + } + + * In debug mode, print status + if !missing("`debug'") noi di "Exiting detect function" + +/******************************************************************************* + Correct issues +*******************************************************************************/ + + if !missing("`using'") { + + _correct, /// + input("`input'") output("`output'") /// + indent("`indent'") space("`space'") linemax("`linemax'") /// + `replace' `force' `automatic' `debug' + + } + +end + +/******************************************************************************* +******************************************************************************** + + PART 3: Auxiliary functions + +******************************************************************************** +*******************************************************************************/ + +// Correct --------------------------------------------------------------------- + +capture program drop _correct + program _correct + + syntax, /// + input(string) output(string) /// + indent(string) space(string) linemax(string) /// + [replace force automatic debug] + + * Check that the Python function is defined + qui: findfile stata_linter_correct.py + if c(os) == "Windows" { + local ado_path = subinstr(r(fn), "\", "/", .) + } + else { + local ado_path = r(fn) + } + + * Display a message if the correct option is added, so the output can be separated + display as text " " + display as result _dup(60) "-" + display as result "Correcting {bf:do-file}" + display as result _dup(60) "-" + display as text " " + + * Import relevant python libraries + python: import sys, os + python: from sfi import Macro + python: sys.path.append(os.path.dirname(r"`ado_path'")) + python: from stata_linter_correct import * + python: import stata_linter_detect as sld + python: import stata_linter_utils as slu + + * Checking which issues are present in the dofile so we ask for their correction + python: Macro.setLocal('_delimiter', str(slu.detect_delimit_in_file(r"`input'"))) + python: Macro.setLocal('_hard_tab', str(slu.detect_hard_tab_in_file(r"`input'"))) + python: Macro.setLocal('_bad_indent', str(slu.detect_bad_indent_in_file(r"`input'", "`indent'", "`space'"))) + python: Macro.setLocal('_long_lines', str(slu.detect_line_too_long_in_file(r"`input'", "`linemax'"))) + python: Macro.setLocal('_no_space_before_curly', str(slu.detect_no_space_before_curly_bracket_in_file(r"`input'"))) + python: Macro.setLocal('_blank_before_curly', str(slu.detect_blank_line_before_curly_close_in_file(r"`input'"))) + python: Macro.setLocal('_dup_blank_line', str(slu.detect_duplicated_blank_line_in_file(r"`input'"))) + + * If no issue was found, the function ends here. + * Otherwise _correct continues. + if ("`_delimiter'" == "False" & /// + "`_hard_tab'" == "False" & /// + "`_bad_indent'" == "False" & /// + "`_long_lines'" == "False" & /// + "`_no_space_before_curly'" == "False" & /// + "`_blank_before_curly'" == "False" & /// + "`_dup_blank_line'" == "False") { + display as result `"{phang}Nothing to correct.{p_end}"' + display as result `"{phang}The issues lint is able to correct are not present in your dofile.{p_end}"' + display as result `"{phang}No output files were generated.{p_end}"' + } + else { + + * Counter of number of issues being corrected + local _n_to_correct 0 + + * Correct the output file, looping for each python command + foreach fun in delimit_to_three_forward_slashes /// + tab_to_space /// + indent_in_bracket /// + too_long_line /// + space_before_curly /// + remove_blank_lines_before_curly_close /// + remove_duplicated_blank_lines { + + * If the issue is not present, we continue with the next one + if ("`_delimiter'" == "False" & "`fun'" == "delimit_to_three_forward_slashes") { + continue + } + else if ("`_hard_tab'" == "False" & "`fun'" == "tab_to_space") { + continue + } + else if ("`_bad_indent'" == "False" & "`fun'" == "indent_in_bracket") { + continue + } + else if ("`_long_lines'" == "False" & "`fun'" == "too_long_line") { + continue + } + else if ("`_no_space_before_curly'" == "False" & "`fun'" == "space_before_curly") { + continue + } + else if ("`_blank_before_curly'" == "False" & "`fun'" == "remove_blank_lines_before_curly_close") { + continue + } + else if ("`_dup_blank_line'" == "False" & "`fun'" == "remove_duplicated_blank_lines") { + continue + } + + if missing("`automatic'") { + + noi di "" + global confirmation "" //Reset global + + while (upper("${confirmation}") != "Y" & upper("${confirmation}") != "N" & upper("${confirmation}") != "BREAK") { + if ("${confirmation}" != "") { + noi di as txt "{pstd} Invalid input. {p_end}" + noi di as txt "{pstd} Please type {bf:Y} or {bf:N} and hit enter. Type {bf:BREAK} and hit enter to exit. {p_end}" + noi di "" + } + if ("`fun'" == "delimit_to_three_forward_slashes") { + di as result "{pstd} Avoid using [delimit], use three forward slashes (///) instead. {p_end}" + } + else if ("`fun'" == "tab_to_space") { + di as result "{pstd} Avoid using hard tabs, use soft tabs (white spaces) instead. {p_end}" + } + else if ("`fun'" == "indent_in_bracket") { + di as result "{pstd} Indent commands inside curly brackets. {p_end}" + } + else if ("`fun'" == "space_before_curly") { + di as result "{pstd} Use white space before opening curly brackets. {p_end}" + } + else if ("`fun'" == "too_long_line") { + di as result "{pstd} Limit line length to `linemax' characters. {p_end}" + } + else if ("`fun'" == "remove_blank_lines_before_curly_close") { + di as result "{pstd} Remove redundant blank lines before closing brackets. {p_end}" + } + else if ("`fun'" == "remove_duplicated_blank_lines") { + di as result "{pstd} Remove duplicated blank lines. {p_end}" + } + noi di as txt "{pstd} Do you want to correct this? To confirm type {bf:Y} and hit enter, to abort type {bf:N} and hit enter. Type {bf:BREAK} and hit enter to stop the code. See option {help lint:automatic} to not be prompted before creating files. {p_end}", _request(confirmation) + } + + // Copy user input to local + local createfile = upper("${confirmation}") + + // If user wrote "BREAK" then exit the code + if ("`createfile'" == "BREAK") error 1 + } + + // if automatic is used, always run the corresponding function + else { + local createfile "Y" + } + + * If option [manual] was used and input was [N], function won't be used for this issue + if ("`createfile'" == "N") { + noi di as result "" + } + * If option input was [Y], or if option [automatic] was used, run the function + else if ("`createfile'" == "Y") { + + local _n_to_correct = `_n_to_correct' + 1 + + * If this is the first issue to correct, create the output file + if `_n_to_correct' == 1 { + + if (missing("`force'")) { + qui copy "`input'" "`output'", replace + } + } + + python: `fun'(r"`output'", r"`output'", "`indent'", "`space'", "`linemax'") + } + } + + * Print link to corrected output file if it was created + if `_n_to_correct' > 0 { + display as result `"{phang}Corrected do-file saved to {browse "`output'":`output'}.{p_end}"' + } + } + + +end + +// Detect ---------------------------------------------------------------------- + +capture program drop _detect + program _detect + + syntax , /// + file(string) ado_path(string) /// + indent(string) linemax(string) space(string) /// + suppress_flag(string) summary_flag(string) /// + [excel(string) header footer] + + * Import relevant python functions + python: import sys, os + python: sys.path.append(os.path.dirname(r"`ado_path'")) + python: from stata_linter_detect import * + + * Stata result header + if !missing("`header'") { + di as result "" + di as result "Linting file: `file'" + di as result "" + } + + * Actually run the Python code + python: r = stata_linter_detect_py("`file'", "`indent'", "`suppress_flag'", "`summary_flag'", "`excel'", "`linemax'", "`space'") + + * Stata result footer + if !missing("`footer'") { + + display as result _dup(85) "-" + + if "`excel'" != "" { + display as result `"{phang}File {browse "`excel'":`excel'} created.{p_end}"' + } + + display as result `"{phang}For more information about coding guidelines visit the {browse "https://dimewiki.worldbank.org/Stata_Linter":Stata linter wiki.}{p_end}"' + } + + + +end + +// File Paths ------------------------------------------------------------------ + +cap program drop _testpath + program _testpath, rclass + + syntax anything, argument(string) ext(string) [details(string) debug exists] + + if !missing("`debug'") di "Entering subcommand _filepath" + + * Standardize file path + local path = subinstr(`"`anything'"', "\", "/", .) + + * If a folder, test that folder exists + if !regex(`"`path'"', "\.") { + _testdirectory `path' , argument(`argument') details(`details') `debug' + local folder `path' + } + + * If a file, parse information + else { + _testfile `path' , argument(`argument') ext(`"`ext'"') `exists' `debug' + local file `path' + } + + return local folder "`folder'" + if !missing("`debug'") di `"Folder: `folder'"' + + return local file "`file'" + if !missing("`debug'") di `"File: `file'"' + + if !missing("`debug'") di "Exiting subcommand _filepath" + +end + +// Test file format ------------------------------------------------------------ + +cap program drop _testfile + program _testfile, rclass + + syntax anything, ext(string) argument(string) [debug exists] + + if !missing("`debug'") di "Entering subcommand _testfile" + + + if !missing("`exists'") { + confirm file `anything' + } + + * Get index of separation between file name and file format + local r_lastdot = strlen(`anything') - strpos(strreverse(`anything'), ".") + + * File format starts at the last period and ends at the end of the string + local suffix = substr(`anything', `r_lastdot' + 1, .) + + if !inlist("`suffix'", `ext') { + di as error `"{phang}File `anything' is not a valid input for `argument'. Only the following file extensions are accepted: `ext'.{p_end}"' + error 198 + } + +end + +// Check if folder exists ------------------------------------------------------ + +cap program drop _testdirectory + program _testdirectory + + syntax anything, argument(string) [details(string) debug] + + if !missing("`debug'") di "Entering subcommand _testdirectory" + + * Test that the folder for the report file exists + mata : st_numscalar("r(dirExist)", direxists(`anything')) + if `r(dirExist)' == 0 { + noi di as error `"{phang}Directory `anything', used `argument', does not exist. `details'{p_end}"' + error 601 + } + +end + + +// Error checks ---------------------------------------------------------------- + +capture program drop _checkpyinstall + program _checkpyinstall + + * Check if python is installed + cap python search + if _rc { + noi di as error `"{phang}For this command, Python installation is required. Refer to {browse "https://blog.stata.com/2020/08/18/stata-python-integration-part-1-setting-up-stata-to-use-python/":this page} for how to integrate Python to Stata. {p_end}"' + exit + } + + * Check if pandas package is installed + cap python which pandas + if _rc { + noi di as error `"{phang}For this command to run, the Python package "pandas" needs to be installed. Refer to {browse "https://blog.stata.com/2020/09/01/stata-python-integration-part-3-how-to-install-python-packages/":this page} for how to install Python packages. {p_end}"' + exit + } + +end + +capture program drop _checkopenpyxlinstall + program _checkopenpyxlinstall + + * Check if openpyxl package is installed + cap python which openpyxl + if _rc { + noi di as error `"{phang}For this command to run, the Python package "openpyxl" needs to be installed. Refer to {browse "https://blog.stata.com/2020/09/01/stata-python-integration-part-3-how-to-install-python-packages/":this page} for how to install Python packages. {p_end}"' + exit + } + +end + +// Check that version of lint.ado and Python scripts are the same + +capture program drop _checkversions + program _checkversions + + * IMPORTANT: Every time we have a package update, update the version number here + * Otherwise we'd be introducing a major bug! + local version_ado 1.02 + + * Check versions of .py files + python: from sfi import Macro + python: import stata_linter_detect as sld + python: import stata_linter_correct as slc + python: Macro.setLocal('version_detect', sld.VERSION) + python: Macro.setLocal('version_correct', slc.VERSION) + + * Checking that versions are the same + cap assert "`version_ado'" == "`version_detect'" + if _rc { + noi di as error `"{phang}For this command to run, the versions of all its auxiliary files need to be the same. Please update the command to the newest version with: {bf:ssc install stata_linter, replace} , restart Stata, and try again{p_end}"' + error + } + cap assert "`version_ado'" == "`version_correct'" + if _rc { + noi di as error `"{phang}For this command to run, the versions of all its auxiliary files need to be the same. Please update the command to the newest version with: {bf:ssc install stata_linter, replace} , restart Stata, and try again{p_end}"' + error + } + +end + +************************************************************* Have a lovely day! diff --git a/src/ado/repado.ado b/src/ado/repado.ado index 328e875..a7d97d0 100644 --- a/src/ado/repado.ado +++ b/src/ado/repado.ado @@ -1,4 +1,4 @@ -*! version 2.2 20240730 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org cap program drop repado program define repado, rclass diff --git a/src/ado/repadolog.ado b/src/ado/repadolog.ado index 095a27a..9a5e8ec 100644 --- a/src/ado/repadolog.ado +++ b/src/ado/repadolog.ado @@ -1,4 +1,4 @@ -*! version 2.2 20240730 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org cap program drop repadolog program define repadolog diff --git a/src/ado/repkit.ado b/src/ado/repkit.ado index 40e3e16..a5a0c01 100644 --- a/src/ado/repkit.ado +++ b/src/ado/repkit.ado @@ -1,4 +1,4 @@ -*! version 2.2 20240730 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org cap program drop repkit program define repkit, rclass @@ -6,8 +6,8 @@ cap program drop repkit version 14.1 * UPDATE THESE LOCALS FOR EACH NEW VERSION PUBLISHED - local version "2.2" - local versionDate "20240730" + local version "3.0" + local versionDate "20240923" local cmd "repkit" syntax [anything] diff --git a/src/ado/reproot.ado b/src/ado/reproot.ado index 36b6728..55d3e40 100644 --- a/src/ado/reproot.ado +++ b/src/ado/reproot.ado @@ -1,4 +1,4 @@ -*! version 2.2 20240730 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org cap program drop reproot program define reproot, rclass diff --git a/src/ado/reproot_parse.ado b/src/ado/reproot_parse.ado index cb6ccb8..17e6cae 100644 --- a/src/ado/reproot_parse.ado +++ b/src/ado/reproot_parse.ado @@ -1,4 +1,4 @@ -*! version 2.2 20240730 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org cap program drop reproot_parse program define reproot_parse, rclass diff --git a/src/ado/reproot_search.ado b/src/ado/reproot_search.ado index c556f26..5a9b5a2 100644 --- a/src/ado/reproot_search.ado +++ b/src/ado/reproot_search.ado @@ -1,4 +1,4 @@ -*! version 2.2 20240730 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org cap program drop reproot_search program define reproot_search, rclass diff --git a/src/ado/reproot_setup.ado b/src/ado/reproot_setup.ado index 9328655..645b7b7 100644 --- a/src/ado/reproot_setup.ado +++ b/src/ado/reproot_setup.ado @@ -1,4 +1,4 @@ -*! version 2.2 20240730 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org cap program drop reproot_setup program define reproot_setup diff --git a/src/ado/reprun.ado b/src/ado/reprun.ado index 815a6d6..af8b207 100644 --- a/src/ado/reprun.ado +++ b/src/ado/reprun.ado @@ -1,4 +1,4 @@ -*! version 2.2 20240730 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org cap program drop reprun program define reprun, rclass @@ -130,21 +130,31 @@ qui { output_writetitle , outputcolumns("`outputcolumns'") noi write_and_print_output, h_smcl(`h_smcl') /// l1(`"{phang}Done checking file:{p_end}"') /// - l2(`"{pstd}{c BLC}{hline 1}> `dofile'{p_end}"') l3("{hline}") - file close `h_smcl' + l2(`"{pstd}{c BLC}{hline 1}> `dofile'{p_end}"') l3("{hline}") + + if "`mmmflag'" != "" { + noi write_and_print_output, h_smcl(`h_smcl') /// + l1(" ") /// + l2(`"{pstd}{red:Reproducibility Warning:} Your code contains many-to-many merges on lines:`mmmflag'.{p_end}"') /// + l3(`"{pstd}As the {mansection D merge:Stata Manual} says: {it:if you think you need to perform an m:m merge, then we suspect you are wrong}.{p_end}"') /// + l4(`"{pstd}Reference the above section of the Stata Manual for troubleshooting.{p_end}"') + + } + + if "`sssflag'" != "" { + noi write_and_print_output, h_smcl(`h_smcl') /// + l1(" ") /// + l2(`"{pstd}{red:Reproducibility Warning:} Your code set the sortseed on lines:`sssflag'.{p_end}"') /// + l3(`"{pstd}As the {mansection D sort:Stata Manual} says: {it:You must be sure that the ordering really does not matter. If that is the case, then why did you sort in the first place?}{p_end}"') /// + l4(`"{pstd}Reference the above section of the Stata Manual for troubleshooting.{p_end}"') + + } - if "`mmmflag'" != "" { - noi di as res `"{pstd}{red:Reproducibility Warning:} Your code contains many-to-many merges on lines:`mmmflag'.{p_end}"' - noi di as res `"{pstd}As the {mansection D merge:Stata Manual} says: {it:if you think you need to perform an m:m merge, then we suspect you are wrong}.{p_end}"' - noi di as res `"{pstd}Reference the above section of the Stata Manual for troubleshooting.{p_end}"' - } + - if "`sssflag'" != "" { - noi di as res `" "' - noi di as res `"{pstd}{red:Reproducibility Warning:} Your code set the sortseed on lines:`sssflag'.{p_end}"' - noi di as res `"{pstd}As the {mansection D sort:Stata Manual} says: {it:You must be sure that the ordering really does not matter. If that is the case, then why did you sort in the first place?}{p_end}"' - noi di as res `"{pstd}Reference the above section of the Stata Manual for troubleshooting.{p_end}"' - } + + + file close `h_smcl' /***************************************************************************** Write smcl file to disk and clean up intermediate files unless debugging diff --git a/src/ado/reprun_dataline.ado b/src/ado/reprun_dataline.ado index c2fe6e8..19612ec 100644 --- a/src/ado/reprun_dataline.ado +++ b/src/ado/reprun_dataline.ado @@ -1,4 +1,4 @@ -*! version 2.2 20240730 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org +*! version 3.0 20240923 - DIME Analytics & LSMS Team, The World Bank - dimeanalytics@worldbank.org, lsms@worldbank.org * Command intended to exclusively be run from the run files * that the command iedorep is generating diff --git a/src/ado/stata_linter_correct.py b/src/ado/stata_linter_correct.py new file mode 100644 index 0000000..affb708 --- /dev/null +++ b/src/ado/stata_linter_correct.py @@ -0,0 +1,437 @@ +# version 1.02 06apr2023 DIME Analytics dimeanalytics@worldbank.org +# Import packages ============ +import os +import re +import sys +import stata_linter_detect as sld + +# Version Global +## VERY IMPORTANT: Update the version number here every time there's an update +## in the package. Otherwise this will cause a major bug +VERSION = "1.02" + +# Function to update comment delimiter ============= +# (detection works only when comment delimiter == 0) +def update_comment_delimiter(comment_delimiter, line): + ''' + This function detects if a line is opening a comment section + in a Stata dofile. Comment sections are delimited by the + charaters "/*" and "*/" + ''' + # if "/*" and "*/" are in the same line, never mind + if re.search(r"\/\*.*\*\/", line): + comment_delimiter += 0 + # if "/*" (opening) detected, add 1 + elif re.search(r"\/\*", line): + comment_delimiter += 1 + # if "*/" (closing) detected, subtract 1 + elif (re.search(r"\*\/", line) != None) & (comment_delimiter > 0): + comment_delimiter -= 1 + return(comment_delimiter) + +# Functions for auto-correction =================== + +# Convert delimit to three forward slashes ------------------- +def delimit_to_three_forward_slashes(input_file, output_file, indent, tab_space, linemax): + output_list = [] + with open(input_file, "r") as reader: + input_lines = reader.readlines() + delimit_on = 0 + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + # update comment_delimiter + comment_delimiter = update_comment_delimiter(comment_delimiter, line) + if comment_delimiter > 0: + output_list.append(line) + elif comment_delimiter == 0: + # check if "#delimit (something other than cr)" is included in a line + if re.search(r"^#delimit(?! cr)", line.lstrip()): + delimit_on = 1 + # store the character used for line breaks (ignoring comments) + # (if not specified, default is ";") + line_split = re.split(r"//", line)[0].strip().split(" ") + if len(line_split) > 1: + delimit_symbol = line_split[1] + else: + delimit_symbol = ";" + # check if "#delimit cr" appears in a line, which means + # the end of delimit function + elif re.search(r"^#delimit cr", line.lstrip()): + delimit_on = 0 + # for other lines, if delimit_on = 0, then just use the line, and + # if delimit_on = 1, then add "///" at the end of line but before + # any comments + else: + if delimit_on == 0: + output_list.append(line) + elif delimit_on == 1: + # get any non-comment part of the line and + # strip any redundant whitespaces at the end + line_split_for_comment = re.split(r"//", line) + line_main = line_split_for_comment[0] + if len(line_split_for_comment) > 1: + line_comment = line_split_for_comment[1] + line_main_rstrip = line_main.rstrip() + # if the line is not blank, add appropriate line break commands (///) + if len(line_main_rstrip) > 0: + # if the line does not end with the delimit symbol (such as ";"), + # then that means the command continues to the next line, + # so add a line break + if line_main_rstrip[-1] != delimit_symbol: + output_line = line_main_rstrip + " ///" + # if the line does end with the delimit symbol, then + # just remove the last symbol in the line + elif line_main_rstrip[-1] == delimit_symbol: + output_line = line_main_rstrip[:-1] + + # replace all the remaining delimit symbols to "\n" + output_line = re.sub(delimit_symbol, "\n", output_line) + + # if there is any comment in the line, then + # just append the comment + if len(line_split_for_comment) > 1: + output_line = output_line + " //" + line_comment + # if there is no comment in the line, then + # just add a newline command (\n) at the end + elif len(line_split_for_comment) == 1: + output_line = output_line + " \n" + + output_list.append(output_line) + + # if the line is blank, just append the blank line + elif len(line_main_rstrip) == 0: + output_list.append(line) + + with open(output_file, "w") as writer: + for output_line in output_list: + writer.write(output_line) + + +# Convert hard tabs to soft tabs (= whitespaces) ---------------------- +def tab_to_space(input_file, output_file, indent, tab_space, linemax): + output_list = [] + with open(input_file, "r") as reader: + input_lines = reader.readlines() + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + # replace the hard tabs detected in a line to soft tabs (whitespaces) + spaces = ' ' * int(tab_space) + pattern = r'^( *)(\t+)([^\t].*\n{0,1})' + match = re.match(pattern, line) + if match: + output_list.append(match.group(1) + + match.group(2).replace('\t', spaces) + + match.group(3)) + else: + output_list.append(line) + with open(output_file, "w") as writer: + for output_line in output_list: + writer.write(output_line) + +# Use indents in brackets after for and while loops or if/else conditions -------------------- +def indent_in_bracket(input_file, output_file, indent, tab_space, linemax): + with open(input_file, "r") as reader: + input_lines = reader.readlines() + loop_start = [] + bracket_start = [] + bracket_pair = [] + nest_level = 0 + max_nest_level = 0 + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + # update comment_delimiter + comment_delimiter = update_comment_delimiter(comment_delimiter, line) + if comment_delimiter == 0: + # get the main command of the line (ignoring comments at the end) and remove + # redundant whitespaces + line_rstrip = re.sub(r"(\/\/)|(\/\*).*", r"", line).rstrip() + # if the line is not blank or has any command other than comments, + # do the followings + if len(line_rstrip) > 0: + # check if the line starts with commands that potentially have curly brackets + # (but ignore if this line is the continuation from the previous line, + # because then the expression here should not have curly brackets) + if ( + (re.search(r"^(qui[a-z]*\s+)?(foreach |while |forv|if |else |cap)", line.lstrip()) != None) & + (re.search(r"\/\/\/", input_lines[max(line_index - 1, 0)]) == None) + ): + # if the line ends with an open curly bracket, + # then tag it (here the depth of the nests are stored as well) + if line_rstrip[-1] == "{": + loop_start.append(line_index) + bracket_start.append(line_index) + nest_level += 1 + max_nest_level = max(max_nest_level, nest_level) + # if the line does not end with an open curly bracket but includes line breaks, + # then search for the line including the open curly bracket in the following lines + # and tag the line + elif (line_rstrip[-1] != "{") & (re.search(r"\/\/\/", line) != None): + loop_start.append(line_index) + for i in range(line_index, len(input_lines)): + temp_line_rstrip = re.sub(r"\/\/.*", r"", input_lines[i]).rstrip() + if temp_line_rstrip[-1] == "{": + bracket_start.append(i) + break + nest_level += 1 + max_nest_level = max(max_nest_level, nest_level) + # check if the line ends with a closing curly bracket + # (ignore it if that is not used for global macro) + if (line_rstrip[-1] == "}") & (not re.search(r"\$.?{", line)): + bracket_pair.append([loop_start.pop(), line_index, nest_level, bracket_start.pop()]) + nest_level -= 1 + # for each depth of nests, add appropriate indentations + for nest_level in range(1, max_nest_level + 1): + for pair in bracket_pair: + if pair[2] == nest_level: + # get the position of where to start indentations + start_indent = len(input_lines[pair[0]]) - len(input_lines[pair[0]].lstrip()) + # for each line in the nest, do the followings + for j in range(pair[0] + 1, pair[1]): + # if the line is blank, ignore it + if len(input_lines[j].lstrip()) == 0: + pass + # if the line is not blank, then add indentations at the beginning of the line + elif len(input_lines[j].lstrip()) > 0: + input_lines[j] = " " * (start_indent + int(indent)) + (input_lines[j].lstrip()) + with open(output_file, "w") as writer: + for output_line in input_lines: + writer.write(output_line) + +# Split too long line (> linemax characters) to multiple lines +# (but do not break strings in double quotes (""), parentheses, or curly brackets) -------------------- +def too_long_line(input_file, output_file, indent, tab_space, linemax): + output_list = [] + with open(input_file, "r") as reader: + input_lines = reader.readlines() + newline_flag = 0 + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + # update comment_delimiter + comment_delimiter = update_comment_delimiter(comment_delimiter, line) + if comment_delimiter > 0: + output_list.append(line) + elif comment_delimiter == 0: + # do nothing if any of the following conditions are met + if ( + (len(line) <= int(linemax)) | # the line is not too long, or + ((line.lstrip() + " ")[0] == "*") | # the line is a comment + ((line.lstrip() + " ")[:2] == "//") # line contains a comment + ): + output_list.append(line) + # otherwise, do the followings + else: + # separate the comment part and the command part of the line + line_split_for_comment = re.split(r"//", line) + line_main = line_split_for_comment[0] + if "\n" in line_main: + line_main = line_main.rstrip() + "\n" + else: + line_main = line_main.rstrip() + if len(line_split_for_comment) > 1: + line_comment = line_split_for_comment[1] + line_indent = ( + len(line_main.rstrip()) - + len(line_main.rstrip().expandtabs(int(indent)).lstrip()) + ) + + i = 0 + break_line = [] + potential_break_line = [] + double_quote_count = 0 + parenthesis_count = 0 + curly_count = 0 + # looking at each character of a line, tag where to break the line + for j, c in enumerate(line_main.lstrip()): + + position = j + len(line_main) - len(line_main.lstrip()) + + if c == '''"''': + double_quote_count = 1 - double_quote_count + elif c == "(": + parenthesis_count += 1 + elif c == ")": + parenthesis_count -= 1 + elif c == "{": + curly_count += 1 + elif c == "}": + curly_count -= 1 + + # We check "potential" break lines first + if ((c == "," or c == " ") and # break line at "," or " " + (double_quote_count == 0) and # ignore if in double quotes + (parenthesis_count == 0) and # ignore if in parentheses + (curly_count == 0)# ignore if in curly brackets + ): + + if c == " ": + + position2 = line_indent + i + 4 + potential_break_line.append(position) + + # If the soon-to-be new line is equal to the linemax, + # we add the last potential line break position + if position2 >= int(linemax): + break_line.append(potential_break_line[-1]) + i = int(indent) + position - potential_break_line[-1] + else: + i += 1 + + elif c == ",": + + position2 = line_indent + i + 5 + + # If the soon-to-be new line is equal to the linemax, + # we add the last potential line break position + if position2 >= int(linemax): + break_line.append(potential_break_line[-1]) + i = int(indent) + position - potential_break_line[-1] + else: + i += 1 + + potential_break_line.append(position + 1) + + else: + + position2 = line_indent + i + 4 + if position2 >= int(linemax): + break_line.append(potential_break_line[-1]) + i = int(indent) + position - potential_break_line[-1] + else: + i += 1 + + # break lines + line_split = [] + break_line_index = [0] + break_line_index.extend(break_line) + break_line_index.append(len(line_main)) + for k in range(len(break_line_index) - 1): + # if no line break is needed, just append the line + if (break_line_index == 2): + line_split.append( + line_main[break_line_index[k]:break_line_index[k + 1]].rstrip() + ) + # otherwise, break the line according to the positions of characters tagged above + else: + line_split.append(line_main[break_line_index[k]:break_line_index[k + 1]]) + + # if no line break is needed, then just append the line + # with appropriate indentations (and commends if needed) + if len(line_split) == 1: + if len(line_split_for_comment) > 1: + output_list.append( + " " * line_indent + line_split[0].lstrip() + " //" + line_comment + ) + elif len(line_split_for_comment) == 1: + output_list.append(" " * line_indent + line_split[0].lstrip() + "\n") + # otherwise, break the line + elif len(line_split) > 1: + for i, temp_line in enumerate(line_split): + # the first line + if i == 0: + new_line = " " * line_indent + temp_line.lstrip() + " ///\n" + # from the second to the last to the second line + elif (i > 0) & (i < len(line_split) - 1): + # if the previous line does not include a line break, then + # add an appropriate indentations + if newline_flag == 0: + new_line = " " * (line_indent + int(indent)) + temp_line.lstrip() + " ///\n" + # if the previous line does include a line break, then + # assuming that the indentation is correctly done, + # add no indentations + elif newline_flag == 1: + new_line = " " * (line_indent) + temp_line.lstrip() + " ///\n" + # the last line + elif (i == len(line_split) - 1): + # if the previous line does not include a line break, then + # add an appropriate indentations + if newline_flag == 0: + new_line = " " * (line_indent + int(indent)) + temp_line.lstrip() + # if the previous line does include a line break, then + # assuming that the indentation is correctly done, + # add no indentations + elif newline_flag == 1: + new_line = " " * (line_indent) + temp_line.lstrip() + # if there is any comment in the original line, add it at the end + if len(line_split_for_comment) > 1: + new_line = new_line + " //" + line_comment + output_list.append(new_line) + # flag if the line includes a line break, which will be used + # in the next line + if "///" in line: + newline_flag = 1 + else: + newline_flag = 0 + with open(output_file, "w") as writer: + for output_line in output_list: + writer.write(output_line) + +# Add a white space before a curly bracket +# (but not if the curly bracket is used for global macro, as in "${}") -------------------- +def space_before_curly(input_file, output_file, indent, tab_space, linemax): + output_list = [] + with open(input_file, "r") as reader: + input_lines = reader.readlines() + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + # update comment_delimiter + comment_delimiter = update_comment_delimiter(comment_delimiter, line) + if comment_delimiter > 0: + output_list.append(line) + elif comment_delimiter == 0: + # replace "{" with " {" if there is no whitespace + # before an open curly bracket, but ignore if + # "${" since this is for global macro + output_list.append(re.sub(r"([^ $]){", r"\1 {", line)) + with open(output_file, "w") as writer: + for output_line in output_list: + writer.write(output_line) + +# Remove blank lines before curly brackets are closed -------------------- +def remove_blank_lines_before_curly_close(input_file, output_file, indent, tab_space, linemax): + output_list = [] + with open(input_file, "r") as reader: + input_lines = reader.readlines() + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + # update comment_delimiter + comment_delimiter = update_comment_delimiter(comment_delimiter, line) + if comment_delimiter > 0: + output_list.append(line) + elif comment_delimiter == 0: + if len(line.strip()) == 0: + for i in range(line_index + 1, len(input_lines)): + if len(input_lines[i].strip()) == 0: + pass + elif len(input_lines[i].strip()) > 0: + line_rstrip = " " + re.sub(r"//.*", r"", input_lines[i]).rstrip() + if (line_rstrip[-1] == "}") & (not re.search(r"\$.*{", input_lines[i])): + break + else: + output_list.append(line) + break + elif len(line.strip()) > 0: + output_list.append(line) + with open(output_file, "w") as writer: + for output_line in output_list: + writer.write(output_line) + + +# Remove duplicated blank lines -------------------- +def remove_duplicated_blank_lines(input_file, output_file, indent, tab_space, linemax): + output_list = [] + with open(input_file, "r") as reader: + input_lines = reader.readlines() + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + # update comment_delimiter + comment_delimiter = update_comment_delimiter(comment_delimiter, line) + if comment_delimiter > 0: + output_list.append(line) + elif comment_delimiter == 0: + if sld.detect_duplicated_blank_line(line_index, line, input_lines): + pass + else: + output_list.append(line) + with open(output_file, "w") as writer: + for i, output_line in enumerate(output_list): + writer.write(output_line) diff --git a/src/ado/stata_linter_detect.py b/src/ado/stata_linter_detect.py new file mode 100644 index 0000000..0716c4b --- /dev/null +++ b/src/ado/stata_linter_detect.py @@ -0,0 +1,838 @@ +# version 1.02 06apr2023 DIME Analytics dimeanalytics@worldbank.org +# Import packages ==================== +import os +import re +import sys +import pandas as pd +import argparse + +# Version Global +## VERY IMPORTANT: Update the version number here every time there's an update +## in the package. Otherwise this will cause a major bug +VERSION = "1.02" + +# simple run entry point +def run(): + parser = argparse.ArgumentParser(description='Lint a Stata do-file.') + parser.add_argument('filename', metavar='file', type=str, nargs='?', + help='The name of the file to lint.') + parser.add_argument('--indent', type=int, nargs='?', default=4, + help="Number of spaces to use for each indentation" + ) + parser.add_argument('--suppress', action='store_true', + help="Suppress line item printout" + ) + parser.add_argument('--summary', action='store_true', + help="Print a summary of bad practices detected" + ) + parser.add_argument('--linemax', type=int, nargs='?', default=80, + help="Maximum number of characters per line" + ) + parser.add_argument('--excel_output', type=str, nargs='?', default="", + help="If specified, save results to Excel workbook" + ) + + + args=parser.parse_args() + return stata_linter_detect_py( + input_file=args.filename, + indent=args.indent, + suppress="1" if args.suppress else "0", + summary="1" if args.summary else "0", + excel=args.excel_output, + linemax=args.linemax, + tab_space=args.indent + ) + +# Style =================== + +# Avoid to use abstract index names ---------------- +def abstract_index_name( + line_index, line, input_lines, indent, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + if re.search(r"^(qui[a-z]*\s+)?(foreach|forv)", line.lstrip()): + list_of_words = line.split() + # get the index used in for loops + for word in list_of_words: + if re.search(r"^(foreach)", word): + index_in_loop = list_of_words[list_of_words.index(word) + 1] + break + elif re.search(r"^(forv)", word): + index_in_loop = list_of_words[list_of_words.index(word) + 1].split("=")[0] + break + # warn if the number of characters in the index is just 1 + if len(set(index_in_loop)) == 1: + print_output = ( + '''In for loops, index names should describe what the code is looping over. ''' + + '''Do not use an abstract index such as "{:s}".'''.format(index_in_loop) + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + style_dictionary["abstract_index_name"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + + return([style_dictionary, excel_output_list]) + +def loop_open(line): + + ''' + Detect if a line is opening a loop + ''' + line_rstrip = re.sub(r"((\/\/)|(\/\*)).*", r"", line).rstrip() + if len(line_rstrip) > 0: + # check if the line includes for-loop, while-loop, or if/else statements + if ( + (re.search(r"^(qui[a-z]*\s+)?(foreach |forv|if |else )", line.lstrip()) != None) & + (line_rstrip[-1] == "{") + ): + return True + return False + + +def loop_close(line): + + ''' + Detects if a line is closing a loop + ''' + relevant_part = re.split('//', line)[0].rstrip() + + if len(relevant_part) > 0: + + if relevant_part[-1] =='}': + return True + else: + return False + + else: + return False + +def bad_indent_in_loop(line, open_loop_line, indent, tab_space): + + ''' + Detect if a line is correctly indented by checking the indentation of + the first line of the loop + ''' + line_ws = line.expandtabs(tab_space) + line_left_spaces1 = len(open_loop_line) - len(open_loop_line.lstrip()) + line_left_spaces2 = len(line_ws) - len(line_ws.lstrip()) + if (line_left_spaces2 - line_left_spaces1 < indent) & (len(line_ws.strip()) > 0): + return True + else: + return False + +# Use proper indentations in for-loops, while-loops, and if/else statements ---------------- +def detect_bad_indent(line_index, line, input_lines, indent, tab_space): + + if loop_open(line): + line_ws = line.expandtabs(tab_space) + j = 1 + embedded_loops = 0 + + # Checking the lines inside the loop + while j + line_index < len(input_lines): + next_line = input_lines[line_index + j] + + # (next) line is opening another loop + if loop_open(next_line): + embedded_loops += 1 + j += 1 + continue + + # (next) line is closing a loop + if loop_close(next_line): + if embedded_loops > 0: + # closing an embedded loop + embedded_loops -= 1 + else: + # closing the main loop + break + + # (next) line is inside an embedded loop, we don't check it here. + # it will be checked when this function is applied on its + # correcponding loop level + if embedded_loops > 0: + j += 1 + continue + + # for other cases, we check they're non-blank lines and then + # correct indentation + if ( + (len(next_line.strip()) > 0) & + (re.search(r"^(\*|\/\/)", next_line.lstrip()) == None) + ): + if bad_indent_in_loop(next_line, line_ws, indent, tab_space): + return True + + j += 1 + + # No bad indentations detected + return False + +def proper_indent( + line_index, line, input_lines, indent, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + if detect_bad_indent(line_index, line, input_lines, indent, tab_space): + + print_output = ( + '''After declaring for loop statement or if-else statement, ''' + + '''add indentation ({:d} whitespaces).'''.format(indent) + ) + + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + style_dictionary["proper_indent"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + + return([style_dictionary, excel_output_list]) + +# Use indentations after line breaks (///) ---------------- +def indent_after_newline( + line_index, line, input_lines, indent, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + # check if the previous line doesn't have "///" or if it's first line in dofile + if not re.search(r"\/\/\/", input_lines[max(line_index - 1, 0)]) or line_index == 0: + # no "///" found, the function finishes here + return([style_dictionary, excel_output_list]) + + else: + # Now we check which of the previous lines contained "///" + # we then check indentation spaces with respect of the first + # line with "///" + i = 0 + while re.search(r"\/\/\/", input_lines[line_index - (i + 1)]): + i += 1 + pass + + first_line = input_lines[line_index - i].expandtabs(tab_space) + first_line_indent = len(first_line) - len(first_line.lstrip()) + + line_ws = line.expandtabs(tab_space) + line_left_spaces = len(line_ws) - len(line_ws.lstrip()) + + if line_left_spaces - first_line_indent < indent: + print_output = ( + '''After new line statement ("///"), add indentation ({:d} whitespaces).'''.format(indent) + ) + + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + style_dictionary["indent_after_newline"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + + return([style_dictionary, excel_output_list]) + +# No whitespaces around math symbols ---------------- +def no_space_before_symbol(line): + + line = line.split('///')[0] + groups = line.split('"') + pattern = r"(?:[a-z]|[A-Z]|[0-9]|_|\)|')(?:<|>|=|\+|-|\*|\^)" + + for i, group in enumerate(groups): + + if i % 2 == 0: + if re.search(pattern, group): + return True + + return False + +def no_space_after_symbol(line): + + line = line.split('///')[0] + groups = line.split('"') + pattern = r"(?:(?:<|>|=|\+|-|\*|\^)(?:[a-z]|[A-Z]|_|\(|`|\.|$))|(?:(?:<|>|=|\+|\*|\^)(?:[0-9]))" + + for i, group in enumerate(groups): + + if i % 2 == 0: + if re.search(pattern, group): + return True + + return False + +def whitespace_symbol( + line_index, line, input_lines, indent, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + # warn if no whitespaces around math symbols + if no_space_before_symbol(line) or no_space_after_symbol(line): + print_output = ( + '''Before and after math symbols (>, <, =, +, etc), it is recommended to use whitespaces. ''' + + '''(For example, do "gen a = b + c" instead of "gen a=b+c".)''' + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + style_dictionary["whitespace_symbol"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + return([style_dictionary, excel_output_list]) + +# For missing values "var < ." or "var != ." are used (!missing(var) is recommended) ---------------- +def has_condition_missing(line): + + if re.search(r"(<|<=|!=|~=)( )*(\.(?![0-9]))", line): + return True + else: + return False + +def condition_missing( + line_index, line, input_lines, indent, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + # warn if "var < ." or "var != ." or "var ~= ." are used + if has_condition_missing(line): + print_output = ( + '''Use "!missing(var)" instead of "var < ." or "var != ." or "var ~= ."''' + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + style_dictionary["condition_missing"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + return([style_dictionary, excel_output_list]) + +# Using "#delimit" should be avoided +def detect_delimit(line): + + if re.search(r"#delimit(?! cr)", line): + return True + else: + return False + +def dont_use_delimit( + line_index, line, input_lines, indent, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + # warn if "#delimit" is used + if detect_delimit(line): + print_output = ( + '''Avoid to use "delimit". For line breaks, use "///" instead.''' + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + style_dictionary["dont_use_delimit"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + return([style_dictionary, excel_output_list]) + +def check_cd(line): + + if re.search(r"^cd\s", line.lstrip()): + return True + else: + return False + +# Using "cd" should be avoided +def dont_use_cd( + line_index, line, input_lines, indent, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + # warn if "#cd" is used + if check_cd(line): + print_output = ( + '''Do not use "cd" but use absolute and dynamic file paths.''' + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + style_dictionary["dont_use_cd"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + return([style_dictionary, excel_output_list]) + +# If a line is too lone, it should be broken into multiple lines +def detect_line_too_long(line, linemax): + + # if the last char is a line break, we leave it out + if len(line) > 0 and line[-1] == '\n': + line = line[:-1] + + if (len(line) > linemax): + return True + else: + return False + +def too_long_line( + line_index, line, input_lines, indent, linemax, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + # warn if the line is too long (and line breaks are not used yet) + if detect_line_too_long(line, linemax): + print_output = ( + '''This line is too long ({:d} characters). '''.format(len(line)) + + '''Use "///" for line breaks so that one line has at most {:d} characters.'''.format(linemax) + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + style_dictionary["too_long_line"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + return([style_dictionary, excel_output_list]) + +# "if" condition should be explicit +def detect_implicit_if(line): + + search_if = re.search(r"(?:^|\s)(?:if|else if)\s", line.lstrip()) + + if search_if != None: + + line = line[search_if.span()[0]:] + if ( + (re.search(r"missing\(", line) == None) & + (re.search(r"inrange\(", line) == None) & + (re.search(r"inlist\(", line) == None) & + (re.search(r"=|<|>", line) == None) + ): + return True + + return False + +def explicit_if( + line_index, line, input_lines, indent, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + # warn if "if" statement is used but the condition is not explicit + if detect_implicit_if(line): + print_output = ( + '''Always explicitly specify the condition in the if statement. ''' + + '''(For example, declare "if var == 1" instead of "if var".) ''' + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + style_dictionary["explicit_if"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + + return([style_dictionary, excel_output_list]) + +# Use parentheses for global macros +def parentheses_for_global_macro( + line_index, line, input_lines, indent, + suppress, style_dictionary, excel_output_list, + tab_space + ): + + # warn if global macros are used without parentheses + if re.search(r"\$[a-zA-Z]", line): + print_output = ( + '''Always use "${}" for global macros. ''' + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + style_dictionary["parentheses_for_global_macro"] += 1 + excel_output_list.append([line_index + 1, "style", print_output]) + return([style_dictionary, excel_output_list]) + +# Check =================== + +# Ask if missing variables are properly taken into account +def check_missing_expression(line): + + if re.search(r"(<|!=|~=)( )*(\.(?![0-9]))|!missing\(.+\)", line): + return True + else: + return False + +def check_expression(line): + + if re.search(r"(~=|!=|>|>=)(?! *\.(?![0-9]))", line): + return True + else: + return False + + +def check_missing( + line_index, line, input_lines, indent, + suppress, check_dictionary, excel_output_list, + tab_space + ): + # ask if missing variables are properly taken into account + + expression = check_expression(line) + missing_expression = check_missing_expression(line) + + if expression and not missing_expression: + print_output = ( + '''Are you taking missing values into account properly? ''' + + '''(Remember that "a != 0" or "a > 0" include cases where a is missing.)''' + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + check_dictionary["check_missing"] += 1 + excel_output_list.append([line_index + 1, "check", print_output]) + return([check_dictionary, excel_output_list]) + +# Ask if the user may be using backslashes in file paths +def check_global(line): + + if re.search(r"^global\s", line.lstrip()): + return True + else: + return False + +def check_local(line): + if re.search(r"^local\s", line.lstrip()): + return True + else: + return False + +def check_backslash(line): + if re.search(r"\\", line): + return True + else: + return False + +def backslash_in_path( + line_index, line, input_lines, indent, + suppress, check_dictionary, excel_output_list, + tab_space + ): + # warn if anything is sandwiched by backslashes, + # which suggests that the user may be using backslashes for file paths + changes_dir = check_cd(line) + is_local = check_local(line) + is_global = check_global(line) + has_backslash = check_backslash(line) + + if (changes_dir | is_local | is_global) & has_backslash: + print_output = ( + '''Are you using backslashes ("\\") for a file path? ''' + + '''If so, use forward slashes ("/") instead.''' + ) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + check_dictionary["backslash_in_path"] += 1 + excel_output_list.append([line_index + 1, "check", print_output]) + return([check_dictionary, excel_output_list]) + +def bang_not_tilde( + line_index, line, input_lines, indent, + suppress, check_dictionary, excel_output_list, + tab_space + ): + + # warn if tilde is used, which suggests + # that the user may be using tilde for negation + if re.search(r"~=\s*([^\s.]|\.[0-9]+)", line): + print_output = ( + '''Are you using tilde (~) for negation? ''' + + '''If so, for negation, use bang (!) instead of tilde (~).''' + ) + + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + + check_dictionary["bang_not_tilde"] += 1 + excel_output_list.append([line_index + 1, "check", print_output]) + return([check_dictionary, excel_output_list]) + +def detect_hard_tab(line): + + if re.search(r"\t", line): + return True + else: + return False + +def detect_no_space_before_curly_bracket(line): + + if re.search(r"([^ $]){", line): + return True + else: + return False + +def detect_blank_line_before_curly_close(line_index, line, dofile_lines): + + if len(line.strip()) > 0 or line_index == len(dofile_lines) - 1: + # non-blank lines or last line in the dofile + return False + + # only blank lines from this point + else: + next_line = dofile_lines[line_index+1] + next_line_rstrip = " " + re.sub(r"//.*", r"", next_line).rstrip() + + # Checking if next line is a closing bracket + if (next_line_rstrip[-1] == "}") & (not re.search(r"\$.*{", next_line)): + return True + else: + return False + +def detect_duplicated_blank_line(line_index, line, dofile_lines): + + #if len(line.strip()) > 0 or line_index == len(dofile_lines) - 1: + if len(line.strip()) > 0: + # non-blank lines + return False + + # only blank lines from this point + else: + # Check if there is not next line -- note that Python doesn't show + # empty next lines as an empty last element + if line_index+1 >= len(dofile_lines): + return True + + # Check if next line is also blank: + next_line = dofile_lines[line_index+1] + if len(next_line.strip()) == 0: + return True + else: + return False + +# Function to update comment delimiter ====================== +# (detection works only when comment delimiter == 0) +def update_comment_delimiter(comment_delimiter, line): + # if "/*" and "*/" are in the same line, never mind + if re.search(r"\/\*.*\*\/", line): + pass + # if "/*" (opening) detected, add 1 + elif re.search(r"\/\*", line): + comment_delimiter += 1 + # if "*/" (closing) detected, subtract 1 + elif (re.search(r"\*\/", line) != None) & (comment_delimiter > 0): + comment_delimiter -= 1 + return(comment_delimiter) + +# Run linter program to detect bad coding practices =================== +def stata_linter_detect_py( + input_file, indent, + suppress, summary, excel, linemax, + tab_space + ): + + excel_output_list = [] + + # style ============ + # Any hard tabs in the do file + with open(input_file, "r") as f: + input_lines = f.readlines() + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + + comment_delimiter = update_comment_delimiter(comment_delimiter, line) + + if comment_delimiter == 0: + hard_tab = "No" + if detect_hard_tab(line): + hard_tab = "Yes" + print_output = ( + '''Use {:d} white spaces instead of tabs. '''.format(int(indent)) + + '''(This may apply to other lines as well.)''' + ) + excel_output_list.append([line_index + 1, "style", print_output]) + if suppress != "1": + print( + '''(line {:d}): '''.format(line_index + 1) + + print_output + ) + break + + # Other line-by-line bad practices + style_dictionary = { + "abstract_index_name": 0, + "proper_indent": 0, + "indent_after_newline": 0, + "whitespace_symbol": 0, + "condition_missing": 0, + "explicit_if": 0, + "dont_use_delimit": 0, + "dont_use_cd": 0, + "too_long_line": 0, + "parentheses_for_global_macro": 0 + } + + with open(input_file, "r") as f: + input_lines = f.readlines() + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + # update comment delimiter + comment_delimiter = update_comment_delimiter(comment_delimiter, line) + + if re.search(r"^(\*|\/\/)", line.lstrip()) != None: + pass + elif comment_delimiter > 0: + pass + else: + style_dictionary, excel_output_list = abstract_index_name( + line_index, line, input_lines, int(indent), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + style_dictionary, excel_output_list = proper_indent( + line_index, line, input_lines, int(indent), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + style_dictionary, excel_output_list = indent_after_newline( + line_index, line, input_lines, int(indent), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + style_dictionary, excel_output_list = whitespace_symbol( + line_index, line, input_lines, int(indent), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + style_dictionary, excel_output_list = condition_missing( + line_index, line, input_lines, int(indent), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + style_dictionary, excel_output_list = explicit_if( + line_index, line, input_lines, int(indent), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + style_dictionary, excel_output_list = dont_use_delimit( + line_index, line, input_lines, int(indent), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + style_dictionary, excel_output_list = dont_use_cd( + line_index, line, input_lines, int(indent), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + style_dictionary, excel_output_list = too_long_line( + line_index, line, input_lines, int(indent), int(linemax), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + style_dictionary, excel_output_list = parentheses_for_global_macro( + line_index, line, input_lines, int(indent), + suppress, style_dictionary, excel_output_list, + int(tab_space) + ) + # check ============ + check_dictionary = { + "check_missing": 0, + "backslash_in_path": 0, + "bang_not_tilde": 0, + } + + with open(input_file, "r") as f: + input_lines = f.readlines() + comment_delimiter = 0 + for line_index, line in enumerate(input_lines): + + # update comment delimiter + comment_delimiter = update_comment_delimiter(comment_delimiter, line) + + if re.search(r"^(\*|\/\/)", line.lstrip()) != None: + pass + elif comment_delimiter > 0: + pass + else: + check_dictionary, excel_output_list = check_missing( + line_index, line, input_lines, int(indent), + suppress, check_dictionary, excel_output_list, + int(tab_space) + ) + check_dictionary, excel_output_list = backslash_in_path( + line_index, line, input_lines, int(indent), + suppress, check_dictionary, excel_output_list, + int(tab_space) + ) + check_dictionary, excel_output_list = bang_not_tilde( + line_index, line, input_lines, int(indent), + suppress, check_dictionary, excel_output_list, + int(tab_space) + ) + print("") + + if summary == "1": + print("-------------------------------------------------------------------------------------") + print("{:69s} {:30s}".format("Bad practice", "Occurrences")) + print("-------------------------------------------------------------------------------------") + + print("{:69s} {:10s}".format("Hard tabs used instead of soft tabs: ", hard_tab)) + print("{:60s} {:10d}".format("One-letter local name in for-loop: ", style_dictionary["abstract_index_name"])) + print("{:60s} {:10d}".format("Non-standard indentation in { } code block: ", style_dictionary["proper_indent"])) + print("{:60s} {:10d}".format("No indentation on line following ///: ", style_dictionary["indent_after_newline"])) + print("{:60s} {:10d}".format("Use of . where missing() is appropriate: ", style_dictionary["condition_missing"])) + print("{:60s} {:10d}".format("Missing whitespaces around operators: ", style_dictionary["whitespace_symbol"])) + print("{:60s} {:10d}".format("Implicit logic in if-condition: ", style_dictionary["explicit_if"])) + print("{:60s} {:10d}".format("Delimiter changed: ", style_dictionary["dont_use_delimit"])) + print("{:60s} {:10d}".format("Working directory changed: ", style_dictionary["dont_use_cd"])) + print("{:60s} {:10d}".format("Lines too long: ", style_dictionary["too_long_line"])) + print("{:60s} {:10d}".format("Global macro reference without { }: ", style_dictionary["parentheses_for_global_macro"])) + print("{:60s} {:10d}".format("Potential omission of missing values in expression: ", check_dictionary["check_missing"])) + print("{:60s} {:10d}".format("Backslash detected in potential file path: ", check_dictionary["backslash_in_path"])) + print("{:60s} {:10d}".format("Tilde (~) used instead of bang (!) in expression: ", check_dictionary["bang_not_tilde"])) + + output_df = pd.DataFrame(excel_output_list) + if excel != "": + if (output_df.empty == True): + output_df = pd.DataFrame(columns = ["Line", "Type", "Problem"]) + output_df.columns = ["Line", "Type", "Problem"] + if os.path.exists(excel): + with pd.ExcelWriter(excel, engine = "openpyxl", mode = "a") as writer: + output_df.to_excel(writer, index = False, sheet_name = os.path.basename(input_file)[:20]) + else: + with pd.ExcelWriter(excel) as writer: + output_df.to_excel(writer, index = False, sheet_name = os.path.basename(input_file)[:20]) + + return( not output_df.empty ) diff --git a/src/ado/stata_linter_utils.py b/src/ado/stata_linter_utils.py new file mode 100644 index 0000000..1eab14d --- /dev/null +++ b/src/ado/stata_linter_utils.py @@ -0,0 +1,118 @@ +# version 1.02 06apr2023 DIME Analytics dimeanalytics@worldbank.org +# Import packages ==================== +import re +import pandas as pd +import stata_linter_detect as sld + +# functions + +def read_dofile(file, include_comments=False): + + ''' + Returns a list of the lines in the dofile + Omits comment lines or commented-out code by default + ''' + + with open(file, "r") as f: + dofile_lines = f.readlines() + + if include_comments: + return dofile_lines + + dofile_lines2 = [] + comment_delimiter = 0 + + for line in dofile_lines: + + comment_delimiter = sld.update_comment_delimiter(comment_delimiter, line) + + if comment_delimiter == 0: + # Removing end-of-line comments + filtered_line = re.sub(r"\s*((\/\/)|(\/\*)).*", r"", line) + dofile_lines2.append(filtered_line) + + return dofile_lines2 + +def detect_duplicated_blank_line_in_file(file): + + dofile_lines = read_dofile(file, include_comments=True) + + for line_index, line in enumerate(dofile_lines): + + if sld.detect_duplicated_blank_line(line_index, line, dofile_lines): + return True + + return False + +def detect_blank_line_before_curly_close_in_file(file): + + dofile_lines = read_dofile(file, include_comments=True) + + for line_index, line in enumerate(dofile_lines): + + if sld.detect_blank_line_before_curly_close(line_index, line, dofile_lines): + return True + + return False + +def detect_no_space_before_curly_bracket_in_file(file): + + dofile_lines = read_dofile(file) + + for line in dofile_lines: + + if sld.detect_no_space_before_curly_bracket(line): + return True + + return False + +def detect_line_too_long_in_file(file, linemax): + + dofile_lines = read_dofile(file) + linemax = int(linemax) + + for line in dofile_lines: + + if sld.detect_line_too_long(line, linemax): + return True + + return False + +def detect_bad_indent_in_file(file, indent, tab_space): + + dofile_lines = read_dofile(file) + indent = int(indent) + tab_space = int(tab_space) + + for line_index, line in enumerate(dofile_lines): + + if sld.detect_bad_indent(line_index, line, dofile_lines, indent, tab_space): + return True + + return False + +def detect_hard_tab_in_file(file): + + dofile_lines = read_dofile(file) + + for line in dofile_lines: + + if sld.detect_hard_tab(line): + return True + + # No hard tabs detected in any line + return False + +def detect_delimit_in_file(file): + + dofile_lines = read_dofile(file) + + for line in dofile_lines: + + if sld.detect_delimit(line): + # whenever the first delimiter is detected, return True + # and interrupt script + return True + + # if delimiters were never detected, return False + return False diff --git a/src/mdhlp/lint.md b/src/mdhlp/lint.md new file mode 100644 index 0000000..f76e5ca --- /dev/null +++ b/src/mdhlp/lint.md @@ -0,0 +1,245 @@ +# Title + +__lint__ - detects and corrects bad coding practices in Stata do-files following the [DIME Analytics Stata Style Guide](https://worldbank.github.io/dime-data-handbook/coding.html#the-dime-analytics-stata-style-guide). + +For this command to run, you will need Stata version 16 or greater, Python, and the Python package [Pandas](https://pandas.pydata.org) installed. + +To install Python and integrate it with Stata, refer to [this page](https://blog.stata.com/2020/08/18/stata-python-integration-part-1-setting-up-stata-to-use-python/). + +To install Python packages, refer to [this page](https://blog.stata.com/2020/09/01/stata-python-integration-part-3-how-to-install-python-packages/). + +# Syntax + +__lint__ "_input_file_" [using "_output_file_"] , [_options_] + +The lint command can be broken into two functionalities: + +1. __Detection__ identifies bad coding practices in a Stata do-files + +2. __Correction__ corrects bad coding practices in a Stata do-file. + +If an _output_file_ is specified with __using__, then the linter will apply the __Correction__ functionality and will write a new file with corrections. If not, the command will only apply the __Detection__ functionality, returning a report of suggested corrections and potential issues of the do-file in Stata's Results window. Users should note that not all the bad practices identified in __Detection__ can be amended by __Correction__. + + +| _options_ | Description | +|-----------|-------------| +| __**v**erbose__ | Report bad practices and issues found on each line of the do-file. | +| __**nosum**mary__ | Suppress summary table of bad practices and potential issues. | +| __**i**ndent__(_integer_) | Number of whitespaces used when checking indentation coding practices (default: 4). | +| __**s**pace__(_integer_) | Number of whitespaces used instead of hard tabs when checking indentation practices (default: same as indent). | +| __**l**inemax__(_integer_) | Maximum number of characters in a line when checking line extension practices (default: 80). | +| __**e**xcel__(_filename_) | Save an Excel file of line-by-line results. | +| __force__ | Allow the output file name to be the same as the name of the input file; overwriting the original do-file. __The use of this option is not recommended__ because it is slightly possible that the corrected do-file created by the command will break something in your code and you should always keep a backup of it. | +| __**auto**matic__ | Correct all bad coding practices without asking if you want each bad coding practice to be corrected or not. By default, the command will ask the user about each correction interactively after producing the summary report. | +| __replace__ | Overwrite any existing _output_ file. | + + +# Description + +## Detect functionality + +__Bad style practices and potential issues detected:__ + +__Use whitespaces instead of hard tabs__ + +- Use whitespaces (usually 2 or 4) instead of hard tabs. + + +__Avoid abstract index names__ + +- In for-loop statements, index names should describe what the code is looping over. For example, avoid writing code like this: + +``` + foreach i of varlist cassava maize wheat { } +``` + +- Instead, looping commands should name the index local descriptively: + +``` + foreach crop of varlist cassava maize wheat { } +``` + + +__Use proper indentations__ + +- After declaring for-loop statements or if-else statements, add indentation with whitespaces (usually 2 or 4) in the lines inside the loop. + + +__Use indentations after declaring newline symbols (///)__ + +- After a new line statement (///), add indentation (usually 2 or 4 whitespaces). + + +__Use the "!missing()" function for conditions with missing values__ + +- For clarity, use `!missing(var)` instead of `var < .` or `var != .` + + +__Add whitespaces around math symbols (+, =, <, >)__ + +- For better readability, add whitespaces around math symbols. For example, do `gen a = b + c if d == e` instead of `gen a=b+c if d==e`. + + +__Specify the condition in an "if" statement__ + +- Always explicitly specify the condition in the if statement. For example, declare `if var == 1` instead of just using `if var`. + + +__Do not use "#delimit", instead use "///" for line breaks__ + +- More information about the use of line breaks [here](https://worldbank.github.io/dime-data-handbook/coding.html#line-breaks). + + +__Do not use cd to change current folder__ + +- Use absolute and dynamic file paths. More about this [here](https://worldbank.github.io/dime-data-handbook/coding.html#writing-file-paths). + + +__Use line breaks in long lines__ + +- For lines that are too long, use `///` to divide them into multiple lines. It is recommended to restrict the number of characters in a line to 80 or less. + + +__Use curly brackets for global macros__ + +- Always use `${ }` for global macros. For example, use `${global_name}` instead of `$global_name`. + + +__Include missing values in condition expressions__ + +- Condition expressions like `var != 0` or `var > 0` are evaluated to true for missing values. Make sure to explicitly take missing values into account by using `missing(var)` in expressions. + + +__Check if backslashes are not used in file paths__ + +- Check if backslashes `(\)` are not used in file paths. If you are using them, then replace them with forward slashes `(/)`. Users should note that the linter might not distinguish perfectly which uses of a backslash are file paths. In general, this flag will come up every time a backslash is used in the same line as a local, glocal, or the cd command. + + +__Check if tildes (~) are not used for negations__ + +- If you are using tildes `(~)` are used for negations, replace them with bangs `(!)`. + +## Correct functionality + +__Coding practices to be corrected:__ + + +Users should note that the Correct feature does not correct all the bad practices detected. It only corrects the following: + +- Replaces the use of `#delimit` with three forward slashes `(///)` in each line affected by `#delimit` + +- Replaces hard tabs with soft spaces (4 by default). The amount of spaces can be set with the `tab_space()` option + +- Indents lines inside curly brackets with 4 spaces by default. The amount of spaces can be set with the `indent()` option + +- Breaks long lines into multiple lines. Long lines are considered to have more than 80 characters by default, but this setting can be changed with the option `linemax()`. Note that lines can only be split in whitespaces that are not inside parentheses, curly brackets, or double quotes. If a line does not have any whitespaces, the linter will not be able to break a long line. + +- Adds a whitespace before opening curly brackets, except for globals + +- Removes redundant blank lines after closing curly brackets + +- Removes duplicated blank lines + +If the option `automatic` is omitted, Stata will prompt the user to confirm that they want to correct each of these bad practices only in case they are detected. If none of these are detected, it will show a message saying that none of the bad practices it can correct were detected. + + +# Examples + +The following examples illustrate the basic usage of lint. Additional examples can be found [here](https://github.com/worldbank/repkit/blob/add-linter/src/vignettes/lint-examples.md) + +## Detecting bad coding practices + +The basic usage is to point to a do-file that requires revision as follows: + +``` +lint "test/bad.do" +``` + +For the detection feature you can use all the options but _automatic_, _force_, and _replace_, which are part of the correction functionality. + +__**Options:**__ + +1. Show bad coding practices line-by-line + +``` +lint "test/bad.do", verbose +``` + + 2. Remove the summary of bad practices + +``` +lint "test/bad.do", nosummary +``` + +3. Specify the number of whitespaces used for detecting indentation practices (default: 4): + +``` +lint "test/bad.do", indent(2) +``` + + +4. Specify the number of whitespaces used instead of hard tabs for detecting indentation practices (default: same value used in indent): + +``` +lint "test/bad.do", tab_space(6) +``` + + +5. Specify the maximum number of characters in a line allowed when detecting line extension (default: 80): + +``` +lint "test/bad.do", linemax(100) +``` + + +6. Export to Excel the results of the line by line analysis + +``` +lint "test/bad.do", excel("test_dir/detect_output.xlsx") +``` + +7. You can also use this command to test all the do-files in a folder: + +``` +lint "test/" +``` + +## Correcting bad coding practices + +The basic usage of the correction feature requires to specify the input do-file and the output do-file that will have the corrections. If you do not include any options, the linter will ask you confirm if you want a specific bad practice to be corrected for each bad practice detected: + +1. Basic correction use (the linter will ask what to correct): + +``` +lint "test/bad.do" using "test/bad_corrected.do" +``` + + +2. Automatic use (Stata will correct the file automatically): + +``` +lint "test/bad.do" using "test/bad_corrected.do", automatic +``` + + +3. Use the same name for the output file (note that this will overwrite the input file, this is not recommended): + +``` +lint "test/bad.do" using "test/bad.do", automatic force +``` + + +4. Replace the output file if it already exists + +``` +lint "test/bad.do" using "test/bad_corrected.do", automatic replace +``` + + +# Feedback, bug reports and contributions + +Read more about these commands on [this repo](https://github.com/worldbank/repkit) where this package is developed. Please provide any feedback by [opening an issue](https://github.com/worldbank/repkit/issues). PRs with suggestions for improvements are also greatly appreciated. + +# Authors + +DIME Analytics, The World Bank dimeanalytics@worldbank.org diff --git a/src/mdhlp/reprun.md b/src/mdhlp/reprun.md index 9d37df7..7cb2688 100644 --- a/src/mdhlp/reprun.md +++ b/src/mdhlp/reprun.md @@ -53,6 +53,12 @@ The __**d**ebug__ option allows the user to save all of the underlying materials By default, __reprun__ invokes __clear__ and __set seed 12345__ to match the default Stata state before beginning Run 1. __**noc**lear__ prevents this behavior. It is not recommended unless you have a rare issue that you need to check at the very beginning of the file, because most projects should very quickly set these states appropriately for reproducibility. +## Note on Reproducibility of certain commands + +`by` and `bysort`: Users will often use `by` and `bysort` or equivalent commands to produce “group-level” statistics. The syntax used is usually something like `bysort groupvarname : egen newvarname = function(varlist)`. However, we note that such an approach necessarily introduces an instability in the sort order within each group. `reprun` will flag these instances as indeterminate sorts, since they can introduce issues later in the code when code is order-dependent; and will do so right away, for functions like `rank()` or other approaches like `bysort groupvarname : egen newvarname = n`. To avoid this, and to write truly reproducible code, users should use the less common but fully reproducible unique sorting syntax of `bysort groupvarname (uniqueidvar) ...` to ensure a unique sort with by-able commands. For commands with `by()` options, users should check whether this syntax is available, or remember to re-sort uniquely before any further processes are done. If `bysort` or the equivalent is called in intermediate or user-written commands that cannot be made to return the data sorted uniquely, those lines will continue to be flagged by ‘reprun‘. There is not a technical solution to this, to the best of our knowledge; therefore, the flag will remain as a reminder that the user should implement a unique sort after the indicated lines. + +`merge m:m` and `set sortseed`: These commands will be flagged interactively by `reprun` with warnings following the results table, regardless of whether any instability is obviously introduced according to the Stata RNG states. This is because `merge m:m` and `set sortseed`, while they often appear to work reproducibly, generally have the function of creating false stability that masks underlying issues in the code. In the case of `merge m:m`, the data that is produced is always sort-dependent in both datasets, and almost always meaningless as a result. In the case of `set sortseed`, the command often works to hide an instability in the underlying code that is sort-dependent. Users should instead remove all instances of these commands, and fix whatever issues in the process are causing their results to depend on the (indeterminate) sort order of the data + # Examples ## Example 1 diff --git a/src/repkit.pkg b/src/repkit.pkg index 59b330f..e3d7742 100644 --- a/src/repkit.pkg +++ b/src/repkit.pkg @@ -1,6 +1,6 @@ * This package file is generated in the adodown workflow. Do not edit directly. *** version -v 2.2 +v 3.0 *** title d 'REPKIT': a module facilitating collaboration and computational reproducibility *** description @@ -21,9 +21,10 @@ d Contact: dimeanalytics@@worldbank.org, lsms@@worldbank.org d URL: https://github.com/worldbank/repkit d *** date -d Distribution-Date: 20240730 +d Distribution-Date: 20240923 d *** adofiles +f ado/lint.ado f ado/reproot_setup.ado f ado/repadolog.ado f ado/reproot_search.ado @@ -35,6 +36,7 @@ f ado/repado.ado f ado/repkit.ado *** helpfiles +f sthlp/lint.sthlp f sthlp/reproot_setup.sthlp f sthlp/repadolog.sthlp f sthlp/reproot.sthlp @@ -43,6 +45,9 @@ f sthlp/repado.sthlp f sthlp/repkit.sthlp *** ancillaryfiles +f ado/stata_linter_detect.py +f ado/stata_linter_correct.py +f ado/stata_linter_utils.py *** end e diff --git a/src/sthlp/lint.sthlp b/src/sthlp/lint.sthlp new file mode 100644 index 0000000..d5e0ca2 --- /dev/null +++ b/src/sthlp/lint.sthlp @@ -0,0 +1,268 @@ +{smcl} +{* *! version 3.0 20240923}{...} +{hline} +{pstd}help file for {hi:lint}{p_end} +{hline} + +{title:Title} + +{phang}{bf:lint} - detects and corrects bad coding practices in Stata do-files following the {browse "https://worldbank.github.io/dime-data-handbook/coding.html#the-dime-analytics-stata-style-guide":DIME Analytics Stata Style Guide}. +{p_end} + +{phang}For this command to run, you will need Stata version 16 or greater, Python, and the Python package {browse "https://pandas.pydata.org":Pandas} installed. +{p_end} + +{phang}To install Python and integrate it with Stata, refer to {browse "https://blog.stata.com/2020/08/18/stata-python-integration-part-1-setting-up-stata-to-use-python/":this page}. +{p_end} + +{phang}To install Python packages, refer to {browse "https://blog.stata.com/2020/09/01/stata-python-integration-part-3-how-to-install-python-packages/":this page}. +{p_end} + +{title:Syntax} + +{phang}{bf:lint} {c 34}{it:input_file}{c 34} [using {c 34}{it:output_file}{c 34}] , [{it:options}] +{p_end} + +{phang}The lint command can be broken into two functionalities: +{p_end} + +{phang}1. {bf:Detection} identifies bad coding practices in a Stata do-files +{p_end} + +{phang}2. {bf:Correction} corrects bad coding practices in a Stata do-file. +{p_end} + +{phang}If an {it:output_file} is specified with {bf:using}, then the linter will apply the {bf:Correction} functionality and will write a new file with corrections. If not, the command will only apply the {bf:Detection} functionality, returning a report of suggested corrections and potential issues of the do-file in Stata{c 39}s Results window. Users should note that not all the bad practices identified in {bf:Detection} can be amended by {bf:Correction}. +{p_end} + +{synoptset 16}{...} +{p2coldent:{it:options}}Description{p_end} +{synoptline} +{synopt: {bf:{ul:v}erbose}}Report bad practices and issues found on each line of the do-file.{p_end} +{synopt: {bf:{ul:nosum}mary}}Suppress summary table of bad practices and potential issues.{p_end} +{synopt: {bf:{ul:i}ndent}({it:integer})}Number of whitespaces used when checking indentation coding practices (default: 4).{p_end} +{synopt: {bf:{ul:s}pace}({it:integer})}Number of whitespaces used instead of hard tabs when checking indentation practices (default: same as indent).{p_end} +{synopt: {bf:{ul:l}inemax}({it:integer})}Maximum number of characters in a line when checking line extension practices (default: 80).{p_end} +{synopt: {bf:{ul:e}xcel}({it:filename})}Save an Excel file of line-by-line results.{p_end} +{synopt: {bf:force}}Allow the output file name to be the same as the name of the input file; overwriting the original do-file. {bf:The use of this option is not recommended} because it is slightly possible that the corrected do-file created by the command will break something in your code and you should always keep a backup of it.{p_end} +{synopt: {bf:{ul:auto}matic}}Correct all bad coding practices without asking if you want each bad coding practice to be corrected or not. By default, the command will ask the user about each correction interactively after producing the summary report.{p_end} +{synopt: {bf:replace}}Overwrite any existing {it:output} file.{p_end} +{synoptline} + +{title:Description} + +{dlgtab:Detect functionality} + +{pstd}{bf:Bad style practices and potential issues detected:} +{p_end} + +{pstd}{bf:Use whitespaces instead of hard tabs} +{p_end} + +{pstd}- Use whitespaces (usually 2 or 4) instead of hard tabs. +{p_end} + +{pstd}{bf:Avoid abstract index names} +{p_end} + +{pstd}- In for-loop statements, index names should describe what the code is looping over. For example, avoid writing code like this: +{p_end} + +{input}{space 8} foreach i of varlist cassava maize wheat { } +{text} +{pstd}- Instead, looping commands should name the index local descriptively: +{p_end} + +{input}{space 8} foreach crop of varlist cassava maize wheat { } +{text} +{pstd}{bf:Use proper indentations} +{p_end} + +{pstd}- After declaring for-loop statements or if-else statements, add indentation with whitespaces (usually 2 or 4) in the lines inside the loop. +{p_end} + +{pstd}{bf:Use indentations after declaring newline symbols (///)} +{p_end} + +{pstd}- After a new line statement (///), add indentation (usually 2 or 4 whitespaces). +{p_end} + +{pstd}{bf:Use the {c 34}!missing(){c 34} function for conditions with missing values} +{p_end} + +{pstd}- For clarity, use {inp:!missing(var)} instead of {inp:var < .} or {inp:var != .} +{p_end} + +{pstd}{bf:Add whitespaces around math symbols (+, =, <, >)} +{p_end} + +{pstd}- For better readability, add whitespaces around math symbols. For example, do {inp:gen a = b + c if d == e} instead of {inp:gen a=b+c if d==e}. +{p_end} + +{pstd}{bf:Specify the condition in an {c 34}if{c 34} statement} +{p_end} + +{pstd}- Always explicitly specify the condition in the if statement. For example, declare {inp:if var == 1} instead of just using {inp:if var}. +{p_end} + +{pstd}{bf:Do not use {c 34}#delimit{c 34}, instead use {c 34}///{c 34} for line breaks} +{p_end} + +{pstd}- More information about the use of line breaks {browse "https://worldbank.github.io/dime-data-handbook/coding.html#line-breaks":here}. +{p_end} + +{pstd}{bf:Do not use cd to change current folder} +{p_end} + +{pstd}- Use absolute and dynamic file paths. More about this {browse "https://worldbank.github.io/dime-data-handbook/coding.html#writing-file-paths":here}. +{p_end} + +{pstd}{bf:Use line breaks in long lines} +{p_end} + +{pstd}- For lines that are too long, use {inp:///} to divide them into multiple lines. It is recommended to restrict the number of characters in a line to 80 or less. +{p_end} + +{pstd}{bf:Use curly brackets for global macros} +{p_end} + +{pstd}- Always use {inp:} for global macros. For example, use {inp:} instead of {inp:}. +{p_end} + +{pstd}{bf:Include missing values in condition expressions} +{p_end} + +{pstd}- Condition expressions like {inp:var != 0} or {inp:var > 0} are evaluated to true for missing values. Make sure to explicitly take missing values into account by using {inp:missing(var)} in expressions. +{p_end} + +{pstd}{bf:Check if backslashes are not used in file paths} +{p_end} + +{pstd}- Check if backslashes {inp:(%} are not used in file paths. If you are using them, then replace them with forward slashes {inp:(/)}. Users should note that the linter might not distinguish perfectly which uses of a backslash are file paths. In general, this flag will come up every time a backslash is used in the same line as a local, glocal, or the cd command. +{p_end} + +{pstd}{bf:Check if tildes (~) are not used for negations} +{p_end} + +{pstd}- If you are using tildes {inp:(~)} are used for negations, replace them with bangs {inp:(!)}. +{p_end} + +{dlgtab:Correct functionality} + +{pstd}{bf:Coding practices to be corrected:} +{p_end} + +{pstd}Users should note that the Correct feature does not correct all the bad practices detected. It only corrects the following: +{p_end} + +{pstd}- Replaces the use of {inp:#delimit} with three forward slashes {inp:(///)} in each line affected by {inp:#delimit} +{p_end} + +{pstd}- Replaces hard tabs with soft spaces (4 by default). The amount of spaces can be set with the {inp:tab_space()} option +{p_end} + +{pstd}- Indents lines inside curly brackets with 4 spaces by default. The amount of spaces can be set with the {inp:indent()} option +{p_end} + +{pstd}- Breaks long lines into multiple lines. Long lines are considered to have more than 80 characters by default, but this setting can be changed with the option {inp:linemax()}. Note that lines can only be split in whitespaces that are not inside parentheses, curly brackets, or double quotes. If a line does not have any whitespaces, the linter will not be able to break a long line. +{p_end} + +{pstd}- Adds a whitespace before opening curly brackets, except for globals +{p_end} + +{pstd}- Removes redundant blank lines after closing curly brackets +{p_end} + +{pstd}- Removes duplicated blank lines +{p_end} + +{pstd}If the option {inp:automatic} is omitted, Stata will prompt the user to confirm that they want to correct each of these bad practices only in case they are detected. If none of these are detected, it will show a message saying that none of the bad practices it can correct were detected. +{p_end} + +{title:Examples} + +{pstd}The following examples illustrate the basic usage of lint. Additional examples can be found {browse "https://github.com/worldbank/repkit/blob/add-linter/src/vignettes/lint-examples.md":here} +{p_end} + +{dlgtab:Detecting bad coding practices} + +{pstd}The basic usage is to point to a do-file that requires revision as follows: +{p_end} + +{input}{space 8}lint "test/bad.do" +{text} +{pstd}For the detection feature you can use all the options but {it:automatic}, {it:force}, and {it:replace}, which are part of the correction functionality. +{p_end} + +{pstd}{bf:{ul:Options:}} +{p_end} + +{pstd}1. Show bad coding practices line-by-line +{p_end} + +{input}{space 8}lint "test/bad.do", verbose +{text} +{pstd} 2. Remove the summary of bad practices +{p_end} + +{input}{space 8}lint "test/bad.do", nosummary +{text} +{pstd}3. Specify the number of whitespaces used for detecting indentation practices (default: 4): +{p_end} + +{input}{space 8}lint "test/bad.do", indent(2) +{text} +{pstd}4. Specify the number of whitespaces used instead of hard tabs for detecting indentation practices (default: same value used in indent): +{p_end} + +{input}{space 8}lint "test/bad.do", tab_space(6) +{text} +{pstd}5. Specify the maximum number of characters in a line allowed when detecting line extension (default: 80): +{p_end} + +{input}{space 8}lint "test/bad.do", linemax(100) +{text} +{pstd}6. Export to Excel the results of the line by line analysis +{p_end} + +{input}{space 8}lint "test/bad.do", excel("test_dir/detect_output.xlsx") +{text} +{pstd}7. You can also use this command to test all the do-files in a folder: +{p_end} + +{input}{space 8}lint "test/" +{text} +{dlgtab:Correcting bad coding practices} + +{pstd}The basic usage of the correction feature requires to specify the input do-file and the output do-file that will have the corrections. If you do not include any options, the linter will ask you confirm if you want a specific bad practice to be corrected for each bad practice detected: +{p_end} + +{pstd}1. Basic correction use (the linter will ask what to correct): +{p_end} + +{input}{space 8}lint "test/bad.do" using "test/bad_corrected.do" +{text} +{pstd}2. Automatic use (Stata will correct the file automatically): +{p_end} + +{input}{space 8}lint "test/bad.do" using "test/bad_corrected.do", automatic +{text} +{pstd}3. Use the same name for the output file (note that this will overwrite the input file, this is not recommended): +{p_end} + +{input}{space 8}lint "test/bad.do" using "test/bad.do", automatic force +{text} +{pstd}4. Replace the output file if it already exists +{p_end} + +{input}{space 8}lint "test/bad.do" using "test/bad_corrected.do", automatic replace +{text} +{title:Feedback, bug reports and contributions} + +{pstd}Read more about these commands on {browse "https://github.com/worldbank/repkit":this repo} where this package is developed. Please provide any feedback by {browse "https://github.com/worldbank/repkit/issues":opening an issue}. PRs with suggestions for improvements are also greatly appreciated. +{p_end} + +{title:Authors} + +{pstd}DIME Analytics, The World Bank dimeanalytics@worldbank.org +{p_end} diff --git a/src/sthlp/repado.sthlp b/src/sthlp/repado.sthlp index f011097..124aaa2 100644 --- a/src/sthlp/repado.sthlp +++ b/src/sthlp/repado.sthlp @@ -1,5 +1,5 @@ {smcl} -{* *! version 2.2 20240730}{...} +{* *! version 3.0 20240923}{...} {hline} {pstd}help file for {hi:repado}{p_end} {hline} diff --git a/src/sthlp/repadolog.sthlp b/src/sthlp/repadolog.sthlp index 3528196..12e6dff 100644 --- a/src/sthlp/repadolog.sthlp +++ b/src/sthlp/repadolog.sthlp @@ -1,5 +1,5 @@ {smcl} -{* *! version 2.2 20240730}{...} +{* *! version 3.0 20240923}{...} {hline} {pstd}help file for {hi:repadolog}{p_end} {hline} diff --git a/src/sthlp/repkit.sthlp b/src/sthlp/repkit.sthlp index f6aa3b5..0942afb 100644 --- a/src/sthlp/repkit.sthlp +++ b/src/sthlp/repkit.sthlp @@ -1,5 +1,5 @@ {smcl} -{* *! version 2.2 20240730}{...} +{* *! version 3.0 20240923}{...} {hline} {pstd}help file for {hi:repkit}{p_end} {hline} diff --git a/src/sthlp/reproot.sthlp b/src/sthlp/reproot.sthlp index 83db911..96b421a 100644 --- a/src/sthlp/reproot.sthlp +++ b/src/sthlp/reproot.sthlp @@ -1,5 +1,5 @@ {smcl} -{* *! version 2.2 20240730}{...} +{* *! version 3.0 20240923}{...} {hline} {pstd}help file for {hi:reproot}{p_end} {hline} diff --git a/src/sthlp/reproot_setup.sthlp b/src/sthlp/reproot_setup.sthlp index 2f946c0..a90adbb 100644 --- a/src/sthlp/reproot_setup.sthlp +++ b/src/sthlp/reproot_setup.sthlp @@ -1,5 +1,5 @@ {smcl} -{* *! version 2.2 20240730}{...} +{* *! version 3.0 20240923}{...} {hline} {pstd}help file for {hi:reproot_setup}{p_end} {hline} diff --git a/src/sthlp/reprun.sthlp b/src/sthlp/reprun.sthlp index 3b0df57..d829216 100644 --- a/src/sthlp/reprun.sthlp +++ b/src/sthlp/reprun.sthlp @@ -1,5 +1,5 @@ {smcl} -{* *! version 2.2 20240730}{...} +{* *! version 3.0 20240923}{...} {hline} {pstd}help file for {hi:reprun}{p_end} {hline} @@ -75,6 +75,14 @@ {pstd}By default, {bf:reprun} invokes {bf:clear} and {bf:set seed 12345} to match the default Stata state before beginning Run 1. {bf:{ul:noc}lear} prevents this behavior. It is not recommended unless you have a rare issue that you need to check at the very beginning of the file, because most projects should very quickly set these states appropriately for reproducibility. {p_end} +{dlgtab:Note on Reproducibility of certain commands} + +{pstd}{inp:by} and {inp:bysort}: Users will often use {inp:by} and {inp:bysort} or equivalent commands to produce “group-level” statistics. The syntax used is usually something like {inp:bysort groupvarname : egen newvarname = function(varlist)}. However, we note that such an approach necessarily introduces an instability in the sort order within each group. {inp:reprun} will flag these instances as indeterminate sorts, since they can introduce issues later in the code when code is order-dependent; and will do so right away, for functions like {inp:rank()} or other approaches like {inp:bysort groupvarname : egen newvarname = n}. To avoid this, and to write truly reproducible code, users should use the less common but fully reproducible unique sorting syntax of {inp:bysort groupvarname (uniqueidvar) ...} to ensure a unique sort with by-able commands. For commands with {inp:by()} options, users should check whether this syntax is available, or remember to re-sort uniquely before any further processes are done. If {inp:bysort} or the equivalent is called in intermediate or user-written commands that cannot be made to return the data sorted uniquely, those lines will continue to be flagged by ‘reprun‘. There is not a technical solution to this, to the best of our knowledge; therefore, the flag will remain as a reminder that the user should implement a unique sort after the indicated lines. +{p_end} + +{pstd}{inp:merge m:m} and {inp:set sortseed}: These commands will be flagged interactively by {inp:reprun} with warnings following the results table, regardless of whether any instability is obviously introduced according to the Stata RNG states. This is because {inp:merge m:m} and {inp:set sortseed}, while they often appear to work reproducibly, generally have the function of creating false stability that masks underlying issues in the code. In the case of {inp:merge m:m}, the data that is produced is always sort-dependent in both datasets, and almost always meaningless as a result. In the case of {inp:set sortseed}, the command often works to hide an instability in the underlying code that is sort-dependent. Users should instead remove all instances of these commands, and fix whatever issues in the process are causing their results to depend on the (indeterminate) sort order of the data +{p_end} + {title:Examples} {dlgtab:Example 1} diff --git a/src/tests/lint/lint.do b/src/tests/lint/lint.do new file mode 100644 index 0000000..2b43df6 --- /dev/null +++ b/src/tests/lint/lint.do @@ -0,0 +1,42 @@ + * Always important to version control built-in Stata functions + version 14.1 + + noi di "`c(username)'" + + * Do not use reproot when testing commands in repkit + + if "`c(username)'" == "WB462869" { + global repkit_clone "C:\Users\wb462869\github\repkit" + } + * Fill in your root path here + if "`c(username)'" == "bbdaniels" { + global repkit_clone "/Users/bbdaniels/GitHub/repkit" + } + + if "`c(username)'" == "wb558768" { + global repkit_clone "C:/Users/wb558768/Documents/GitHub/repkit" + } + + * Use locals for all non-root paths + local testfldr "${repkit_clone}/src/tests" + + * Use the /dev-env folder as a dev environment + cap mkdir "`testfldr'/dev-env" + repado using "`testfldr'/dev-env" + + * Make sure repkit is installed also in the dev environment + cap which repkit + if _rc == 111 ssc install repkit + + * Make sure the version of repkit in the dev environment us up to date with all edits. + cap net uninstall repkit + net install repkit, from("${repkit_clone}/src") replace + + ************************ + * Run tests + + lint "`testfldr'/lint/test-files/bad.do" + + lint "`testfldr'/lint/test-files/simple.do" + + diff --git a/src/tests/lint/test-files/bad.do b/src/tests/lint/test-files/bad.do new file mode 100644 index 0000000..c8b86b8 --- /dev/null +++ b/src/tests/lint/test-files/bad.do @@ -0,0 +1,84 @@ +* Rules ===================== +* Hard tabs should not be used +* "delimit" should not be used +* In brackets after "for" or "if", indentation should be used +* Too long lines should be divided into multiple lines +* Before an opening curly bracket "{", put a whitespace +* Remove blank lines before closing brackets +* Remove duplicated blank lines + +* Stata codes to be corrected ================= + +* All hard tabs are replaced with soft tabs (= whitespaces) + + * delimit is corrected and three forward slashes will be used instead + #delimit ; + + foreach something in something something something something something something + something something{ ; // some comment + do something ; + } ; + + #delimit cr + + * Add indentation in brackets + if something { + do something + if another == 1 { + do that + } + } + + foreach ii in potato potato cassava maize potato /// + cassava maize potato cassava maize potato cassava maize potato cassava maize potato cassava maize potato cassava maize potato cassava maize { + if something ~= 1 & something != . { + do something // some very very very very very very very very very very very very very very very very very very very very very very long comment + } + } + + * Split a long line into multiple lines + * (for now, too long comments are not corrected) + foreach ii in potato potato cassava maize potato cassava maize potato cassava maize potato cassava maize potato cassava maize potato cassava maize potato cassava maize potato cassava maize { + if something ~= 1 & something != . { + do something // some very very very very very very very very very very very very very very very very very very very very very very long comment + } + } + + * Add a whitespace before an opening curly bracket "{" + if something ~= 1 & something != .{ + do something + } + + * Remove blank lines before a closing bracket "}" + if something ~= 1 & something != .{ + + do something + + } + + * Remove duplicated blank lines + if something ~= 1 & something != .{ /* some comment */ + + + do something + + + } + + * Forvalues with quietly option + qui forv i = 1/`theN' { + ivregress 2sls indiv_theta_mean hh_faultdist /// + ( m_indiv_edu_binary m_edu_fault = instrument i_d ) /// + `fault_controls' `other_controls' `mother_controls' /// + if group != `i' /// + , cl(village_code) + noi noi di "`i'/`theN' done!" + + mat a = r(table) + local lower = a[5,2] + local upper = a[6,2] + + replace b_alt = _b[m_edu_fault] if group == `i' + replace b_min = `lower' if group == `i' + replace b_max = `upper' if group == `i' + } diff --git a/src/tests/lint/test-files/bad_corrected.do b/src/tests/lint/test-files/bad_corrected.do new file mode 100644 index 0000000..9213ca8 --- /dev/null +++ b/src/tests/lint/test-files/bad_corrected.do @@ -0,0 +1,82 @@ +* Rules ===================== +* Hard tabs should not be used +* "delimit" should not be used +* In brackets after "for" or "if", indentation should be used +* Too long lines should be divided into multiple lines +* Before an opening curly bracket " {", put a whitespace +* Remove blank lines before closing brackets +* Remove duplicated blank lines + +* Stata codes to be corrected ================= + +* All hard tabs are replaced with soft tabs (= whitespaces) + + * delimit is corrected and three forward slashes will be used instead + + foreach something in something something something something something something /// + something something { // some comment + do something + } + + * Add indentation in brackets + if something { + do something + if another == 1 { + do that + } + } + + foreach ii in potato potato cassava maize potato /// + cassava maize potato cassava maize potato cassava maize /// + potato cassava maize potato cassava maize potato cassava maize /// + potato cassava maize { + if something ~= 1 & something != . { + do something // some very very very very very very very very very very very very very very very very very very very very very very long comment + } + } + + * Split a long line into multiple lines + * (for now, too long comments are not corrected) + foreach ii in potato potato cassava maize potato cassava maize /// + potato cassava maize potato cassava maize potato cassava maize potato /// + cassava maize potato cassava maize potato cassava maize { + if something ~= 1 & something != . { + do something // some very very very very very very very very very very very very very very very very very very very very very very long comment + } + } + + * Add a whitespace before an opening curly bracket " {" + if something ~= 1 & something != . { + do something + } + + * Remove blank lines before a closing bracket "}" + if something ~= 1 & something != . { + + do something + } + + * Remove duplicated blank lines + if something ~= 1 & something != . { /* some comment */ + + do something + } + + * Forvalues with quietly option + qui forv i = 1/`theN' { + ivregress 2sls indiv_theta_mean hh_faultdist /// + ( m_indiv_edu_binary m_edu_fault = instrument i_d ) /// + `fault_controls' `other_controls' `mother_controls' /// + if group != `i' /// + , cl(village_code) + noi noi di "`i'/`theN' done!" + + mat a = r(table) + local lower = a[5,2] + local upper = a[6,2] + + replace b_alt = _b[m_edu_fault] if group == `i' + replace b_min = `lower' if group == `i' + replace b_max = `upper' if group == `i' + } + diff --git a/src/tests/lint/test-files/simple.do b/src/tests/lint/test-files/simple.do new file mode 100644 index 0000000..5c39b96 --- /dev/null +++ b/src/tests/lint/test-files/simple.do @@ -0,0 +1,6 @@ +set obs 3 +gen x = _n + +summary x, det + +exit, clear \ No newline at end of file diff --git a/src/tests/lint/test-files/stata_tests.do b/src/tests/lint/test-files/stata_tests.do new file mode 100644 index 0000000..ce2d94e --- /dev/null +++ b/src/tests/lint/test-files/stata_tests.do @@ -0,0 +1,4 @@ +lint bad.do + +lint simple.do + diff --git a/src/tests/lint/test-files/test_detect.py b/src/tests/lint/test-files/test_detect.py new file mode 100644 index 0000000..2482d74 --- /dev/null +++ b/src/tests/lint/test-files/test_detect.py @@ -0,0 +1,42 @@ +from stata_linter_detect import stata_linter_detect_py +import subprocess + +class TestCLI: + def test_cli_bad(self): + assert subprocess.run(["stata_linter_detect", "test/bad.do"]).returncode == 1 + def test_cli_simple(self): + assert subprocess.run(["stata_linter_detect", "test/simple.do"]).returncode == 0 + +class TestDetect: + def test_basic(self): + assert stata_linter_detect_py( + input_file="test/bad.do", + indent=4, + suppress="0", + summary="0", + excel="", + linemax=80, + tab_space=4 + ) == 1 + + def test_excel(self): + assert stata_linter_detect_py( + input_file="test/bad.do", + indent=4, + suppress="0", + summary="0", + excel="linter.xlsx", + linemax=80, + tab_space=4 + ) == 1 + + def test_simple(self): + assert stata_linter_detect_py( + input_file="test/simple.do", + indent=4, + suppress="0", + summary="0", + excel="", + linemax=80, + tab_space=4 + ) == 0 diff --git a/src/vignettes/ado-management-with-repado.md b/src/vignettes/ado-management-with-repado.md index 10da9a1..2ad6c42 100644 --- a/src/vignettes/ado-management-with-repado.md +++ b/src/vignettes/ado-management-with-repado.md @@ -372,7 +372,7 @@ is tracked in the repository, this difference becomes less significant. * Set user root folder global root "C:\Users\user123\github\myproject" -* Set PLUS to adopath and list it first, then list BASE first. +* Set the project adopath as an unnamed path and list it first, then list BASE first. * This means that BASE is first and PLUS is second. adopath ++ "${root}/code/ado" adopath ++ BASE diff --git a/src/vignettes/img/reprun-ex-3-fix.png b/src/vignettes/img/reprun-ex-3-fix.png new file mode 100644 index 0000000..554b263 Binary files /dev/null and b/src/vignettes/img/reprun-ex-3-fix.png differ diff --git a/src/vignettes/img/reprun-ex-3.png b/src/vignettes/img/reprun-ex-3.png new file mode 100644 index 0000000..0d1bca8 Binary files /dev/null and b/src/vignettes/img/reprun-ex-3.png differ diff --git a/src/vignettes/img/reprun-ex-4.png b/src/vignettes/img/reprun-ex-4.png new file mode 100644 index 0000000..f6bd58f Binary files /dev/null and b/src/vignettes/img/reprun-ex-4.png differ diff --git a/src/vignettes/img/reprun-ex-5.png b/src/vignettes/img/reprun-ex-5.png new file mode 100644 index 0000000..9c90530 Binary files /dev/null and b/src/vignettes/img/reprun-ex-5.png differ diff --git a/src/vignettes/img/reprun-ex-6.png b/src/vignettes/img/reprun-ex-6.png new file mode 100644 index 0000000..49d1259 Binary files /dev/null and b/src/vignettes/img/reprun-ex-6.png differ diff --git a/src/vignettes/img/reprun-ex-7.png b/src/vignettes/img/reprun-ex-7.png new file mode 100644 index 0000000..7727a2c Binary files /dev/null and b/src/vignettes/img/reprun-ex-7.png differ diff --git a/src/vignettes/lint-examples.md b/src/vignettes/lint-examples.md new file mode 100644 index 0000000..bddc780 --- /dev/null +++ b/src/vignettes/lint-examples.md @@ -0,0 +1,191 @@ +# lint - Stata command for do file linter + +This section presents examples for the `lint` command. For more information on the command, please refer to the [helpfile.](https://github.com/worldbank/repkit/blob/add-linter/src/mdhlp/lint.md) + +## Installation + +### Installing in Stata + +To install `lint`, type `ssc install repkit` and restart Stata. + + +### Python stand-alone installation + +To install the linter to run directly with Python and not via Stata, clone this repository and then run the following command on your terminal: + +```python +pip install -e src/ +``` + +This will also install `pandas` and `openpyxl` if they are not currently installed. + +## Requirements + +1. Stata version 16 or higher. +2. Python 3 or higher + +For setting up Stata to use Python, refer to [this web page](https://blog.stata.com/2020/08/18/stata-python-integration-part-1-setting-up-stata-to-use-python/). +`lint` also requires the Python package `pandas` and `openpyxl`. +Refer to [this web page](https://blog.stata.com/2020/09/01/stata-python-integration-part-3-how-to-install-python-packages/) to know more about installing Python packages. + +## Content + +`lint` is an opinionated detector that attempts to improve the readability and organization of Stata do files. +The command is written based on the good coding practices of the Development Impact Evaluation Unit at The World Bank. +For these standards, refer to [DIME's Stata Coding practices](https://dimewiki.worldbank.org/wiki/Stata_Coding_Practices) and _Appendix: The DIME Analytics Coding Guide_ of [Development Research in Practice](https://worldbank.github.io/dime-data-handbook/). + +The `lint` command can be broken into two functionalities: + +1. **detection** identifies bad coding practices in one or multiple Stata do-files +2. **correction** corrects a few of the bad coding practices detected in a Stata do-file + +> _Disclaimer_: Please note that this command is not guaranteed to correct codes without changing results. +It is strongly recommended that after using this command you check if results of the do file do not change. + +## Syntax and basic usage + +```stata +lint "input_file" using "output_file", options +``` + +### 1. Detection + +To detect bad practices in a do-file you can run the following: + +```stata +lint "test/bad.do" +``` + +and on your Stata console you will get a summary of bad coding practices that were found in your code: + +```stata +------------------------------------------------------------------------------------- +Bad practice Occurrences +------------------------------------------------------------------------------------- +Hard tabs used instead of soft tabs: Yes +One-letter local name in for-loop: 3 +Non-standard indentation in { } code block: 7 +No indentation on line following ///: 1 +Missing whitespaces around operators: 0 +Implicit logic in if-condition: 1 +Delimiter changed: 1 +Working directory changed: 0 +Lines too long: 5 +Global macro reference without { }: 0 +Use of . where missing() is appropriate: 6 +Backslash detected in potential file path: 0 +Tilde (~) used instead of bang (!) in expression: 5 +------------------------------------------------------------------------------------- +``` + +If you want to get the lines where those bad coding practices appear you can use the option `verbose`. For example: + +```stata +lint "test/bad.do", verbose +``` + +Gives the following information before the regular output of the command. + +```stata +(line 14): Use 4 white spaces instead of tabs. (This may apply to other lines as well.) +(line 15): Avoid to use "delimit". For line breaks, use "///" instead. +(line 17): This line is too long (82 characters). Use "///" for line breaks so that one line has at m +> ost 80 characters. +(line 25): After declaring for loop statement or if-else statement, add indentation (4 whitespaces). +(line 25): Always explicitly specify the condition in the if statement. (For example, declare "if var +> == 1" instead of "if var".) +... +``` + +You can also pass a folder path to detect all the bad practices in all the do-files that are in the same folder. + +### 2. Correction + +If you would like to correct bad practices in a do-file you can run the following: + +```stata +lint "test/bad.do" using "test/bad_corrected.do" +``` + +In this case, the lint command will create a do-file called `bad_corrected.do`. +Stata will ask you if you would like to perform a set of corrections for each bad practice detected, one by one. +You can add the option `automatic` to perform the corrections automatically and skip the manual confirmations. +It is strongly recommended that the output file has a different name from the input file, as the original do-file should be kept as a backup. + +As a result of this command, a piece of Stata code as the following: + +```stata +#delimit ; + +foreach something in something something something something something something + something something{ ; // some comment + do something ; +} ; + +#delimit cr + +``` + +becomes: + +```stata +foreach something in something something something something something something /// + something something { // some comment + do something +} +``` + +and + +```stata +if something ~= 1 & something != . { +do something +if another == 1 { +do that +} +} +``` + +becomes + +```stata +if something ~= 1 & something != . { + do something + if another == 1 { + do that + } +} +``` + +### Other options + +You can use the following options with the `lint` command: + +- Options related to the **detection** feature: + - `verbose`: show all the lines where bad practices appear. + - `nosummary`: suppress the summary of bad practices. + - `excel()`: export detection results to Excel. + +- Options exclusive to the **correction** feature: + - `automatic`: correct all bad coding practices without asking if you want each bad coding practice detected to be corrected or not. + - `replace`: replace the existing output file. + - `force`: allow the output file name to be the same as the name of the input file (not recommended). + +- Options for **both** features: + - `indent()`: specify the number of whitespaces used for indentation (default is 4). + - `linemax()`: maximum number of characters in a line (default: 80) + - `tab_space()`: number of whitespaces used instead of hard tabs (default is 4). + +## Recommended use + +To minimize the risk of crashing a do-file, the `correction` feature works based on fewer rules than the `detection` feature. +That is, we can can detect more bad coding practices with `lint "input_file"` in comparison to `lint "input_file" using "output_file"`. +Therefore, after writing a do-file, you can first `detect` bad practices to check how many bad coding practices are contained in the do-file and later decide whether you would like to use the correction feature. + +If there are not too many bad practices, you can go through the lines flagged by the `detection` feature and manually correct them. +This also avoids potential crashes by the `correction` feature. + +If there are many bad practices detected, you can use the `correction` feature first to correct some of the flagged lines, and then you can `detect` again and `correct` the remaining bad practices manually. +We strongly recommend not overwriting the original input do-file so it can remain as a backup in case `correct` introduces unintended changes in the code. +Additionally, we recommend checking that the results of the do-file are not changed by the correction feature. + diff --git a/src/vignettes/reprun-examples.md b/src/vignettes/reprun-examples.md new file mode 100644 index 0000000..4ebcc31 --- /dev/null +++ b/src/vignettes/reprun-examples.md @@ -0,0 +1,154 @@ +# reprun - examples + +This section presents examples and outputs for the `reprun` command, demonstrating its use for ensuring reproducibility in analyses. For detailed information on the command, please refer to the [helpfile.](https://worldbank.github.io/repkit/reference/reprun.html) + +## Example 1 + +This is the most basic usage of `reprun`. Specified in any of the following ways, either in the Stata command window or as part of a new do-file, `reprun` will execute the complete do-file "_myfile.do_" once (Run 1), and record the "seed RNG state", "sort order RNG", and "data checksum" after the execution of every line, as well as the exact data in certain cases. `reprun` will then execute "_myfile.do_" a second time (Run 2), and find all _changes_ and _mismatches_ in these states throughout Run 2. A table of mismatches will be reported in the Results window, as well as in a SMCL file in a new directory called `/reprun/` in the same location as "_myfile.do_". + + +```stata +reprun "path/to/folder/myfile.do" +``` + +or + +```stata +local myfolder "/path/to/folder" +reprun "`myfolder'/myfile.do" +``` + +## Example 2 + +This example is similar to example 1, but the `/reprun/` directory containing the SMCL file will be stored in the location specified by the `using` argument. + +```stata +reprun "path/to/folder/myfile.do" using "path/to/report" +``` + +or + +```stata +local myfolder "/path/to/folder" +reprun "`myfolder'/myfile.do" using "`myfolder'/report" +``` + +## Example 3 + +Assume "_myfile1.do_" contains the following code: + +```stata +sysuse census, clear +isid state, sort +gen group = runiform() < .5 +``` + +Running a reproducibility check on this do-file using `reprun` will generate a table listing _mismatches_ in Stata state between Run 1 and Run 2. + +``` +reprun "path/to/folder/myfile1.do" +``` + +A table of mismatches will be reported in the Results window, as well as in a SMCL file in a new directory called `/reprun/` in the same location as "_myfile1.do_" and will look like: + +![](img/reprun-ex-3.png) + + +The table shows that Line 3 is flagged. Line 3 (`gen group = runiform() < .5`) generates a new variable `group` based on a random uniform distribution. The RNG state will differ between Run 1 and Run 2 unless the random seed is explicitly set before this command. As a result, a mismatch in the "seed RNG state" as well as "data checksum" will be flagged. + +The issue can be resolved by setting a seed before the command: + +```stata +sysuse census, clear +isid state, sort +set seed 346290 +gen group = runiform() < .5 +``` + +Running the reproducibility check on the modified do-file using `reprun` will confirm that there are no mismatches in Stata state between Run 1 and Run 2: + +![](img/reprun-ex-3-fix.png) + +## Example 4 + +Using the `verbose` option generates more detailed tables where any lines across Run 1 and Run 2 mismatch **or** change for any value. + +```stata +reprun "path/to/folder/myfile1.do", verbose +``` + +In addition to the output in Example 3, it will also report line 2 for **changes** in "sort order RNG" and "data checksum: + +![](img/reprun-ex-4.png) + +## Example 5 + +Assume "_myfile2.do_" contains the following code: + +```stata +sysuse auto, clear +sort mpg +gen sequence = _n +``` + +Running a reproducibility check on this do-file using __reprun__ will generate a table listing _mismatches_ in Stata state between Run 1 and Run 2. + +```stata +reprun "path/to/folder/myfile2.do" +``` + +In "_myfile2.do_", Line 2 sorts the data by the non-unique variable `mpg`, causing the sort order to vary between runs. This results in a mismatch in the "sort order RNG". Consequently, Line 2 and Line 3 (`gen sequence = _n`) will be flagged for "data checksum" mismatches due to the differences in sort order, leading to discrepancies in the generated `sequence` variable, as shown in the results below: + +![](img/reprun-ex-5.png) + +The issue can be resolved by sorting the data on a unique combination of variables: + +```stata +sysuse auto, clear +sort mpg make +gen sequence = _n +``` + +## Example 6 + +Using the `compact` option generates less detailed tables where only lines with mismatched seed or sort order RNG changes during Run 1 or Run 2, and mismatches between the runs, are flagged and reported. + +``` +reprun "path/to/folder/myfile2.do", compact +``` + +The output will be similar to Example 5, except that line 3 will no longer be flagged for "data checksum": + +![](img/reprun-ex-6.png) + +## Example 7 + +`reprun` will perform a reproducibility check on a do-file, including all do-files it calls recursively. For example, the main do-file might contain the following code that calls on "_myfile1.do_" (Example 3) and "_myfile2.do_" (Example 5): + +```stata +local myfolder "/path/to/folder" +do "`myfolder'/myfile1.do" +do "`myfolder'/myfile2.do" +``` + +```stata +reprun ""path/to/folder/main.do" +``` + +`reprun` on "_main.do_" performs reproducibility checks across "_main.do_", as well as "_myfile1.do_", and "_myfile2.do_" and the result will look like: + +![](img/reprun-ex-7.png) + +The output will include tables for each do-file, illustrating the following process: + +- __main.do__: The initial check reveals no mismatches in "_main.do_", indicating no discrepancies introduced directly by it. + +- __Sub-file 1__ ("_myfile1.do_") : `reprun` steps into "_myfile1.do_", where Line 3 is flagged for mismatches, as shown in Example 3. This table will show the issues specific to "_myfile1.do_". + +- __Return to "main.do"__" : After checking "_myfile1.do_", `reprun` returns to "_main.do_". Here, Line 2 is flagged because it calls "_myfile1.do_", reflecting the issues from the sub-file. + +- __Sub-file 2__ ("_myfile2.do_"): `reprun` then steps into "_myfile2.do_", where Line 2 is flagged for mismatches, as detailed in Example 5. + +- __Return to "main.do" (final check) __: After checking "_myfile2.do"_, `reprun` returns to "_main.do_". Line 3 in "_main.do_" is flagged due to the issues in "_myfile2.do_" propagating up. + +In summary, `reprun` provides a comprehensive view by stepping through each do-file, showing where mismatches occur and how issues in sub-files impact the main do-file.