calyx_estimates.txt

4th Feb:-
    Time sheet:-             Completed                               2 hour
    Calyx setup:-            Completed        
    CP-4143:-                Completed
    Reading Docmentation     in progress(scoring algo pending)

5th Feb:-
    CP-4140:-
        WiFi not working.
        Company reporting google docs and visualization.
        Understading changes made in sg_DS-212.
        Comparing scripts
        Understading Manage.py        
        CP-4198
        Fix rdp_stage_order

6th Feb:-
    Review points of jp_CP-4198
    tested changes of branches CP-3218(Done), CP-3222(No need), CP-3254(Done) ,CP-3260(Already merged)
    Validating datapoints in company reporting.
    Understading data(datapoint view, datapoint model)
    Understading AppTableImporter, test_data.py
    Resolved Conflicts
    KT
    sql folders to cover:-
        denormalization  DONE
        clean            DONE
        transform
        deliver


    to-do:-
        Understand Engagement reports

7th Feb:-
    Validating datapoints.

8th Feb:-
    Validating datapoints.

11th Feb:- 
    Validating datapoints
    Making necessary changes to merge code into master
    Updated metadata
    Metadata for 1 datapoint_view found missing.(BEST_SOURCE_TIME_RD)
    read about emr cluster
    JPL prize distribution.
    ML documentation(intro and supervised learning part 1)

12th Feb:-
    ML Documentation(SL part2,3, Unsupervised learning, neural nets and deep learning, reinforcement learning)
    scikit-learn

13th Feb:-
    read about different libraries for machine learning(scikit, tensorflow, theano, keaser)
    Scikit-learn documentation
    Discussing approaches for college clustering.
    CP-4154:-
        Understading scoring algo

14th Feb:-
    Understading scoring algo
    Understading scripts
    found cause of the problem

18th Feb:-
    CP-4154:-
        testing solution.
        insert problem.
        elastic search problem.
        Tested changes.

    CP-4298:-
        Completed and tested.

    network issue.(Discussion with prabhasis)

19th Feb:-
    CP-4298:-
        Review changes
        tested changes

    CP-4297:-
        read about and setup mongo
        setup db
        got and imported into mongodb data
        Suggestions:-
            Can use location
    
    resolved wifi issue.

20th Feb:-
    CP-4297:-
        Investigating factors which can be used in suggesting pools to companies.
        explored mongodb query language
        explored data cleaning
        Suggestions:-
            make graphs for location, skills, education

21st Feb:-
    make sql for files to be delivered at s3 bucket.
    
22nd Feb:-    
    resolved sys.ps2 error
    resolved metadata issue
    resolved column mismatch issue
    CP-4412:-
    cp-4429:-
BEST_SOURCE_TIME_POS

25rd feb:-
    CP-4429:- 
        There are some datapoints for which we dont have any data.
        I am setting them inactive.
            BEST_SOURCE_TIME_POS, Data Point : BEST_SOURCE_WRT_TIME, for deliver table : Reporting_Deliver.data_points_014 is 0
            BEST_SOURCE_TIME_POS, Data Point : BEST_SOURCE_WRT_TIME, for deliver table : Reporting_Deliver.data_points_014 is 0
            OFFERED_CANDIDATES_POS_SOURCE, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_016 is 0
            OFFERED_CANDIDATES_POS_SOURCE, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_016 is 0
            TOTAL_TIME_INVESTED_POS_SOURCE, Data Point : TOTAL_TIME_INVESTED, for deliver table : Reporting_Deliver.data_points_016 is 0
            TOTAL_TIME_INVESTED_POS_SOURCE, Data Point : TOTAL_TIME_INVESTED, for deliver table : Reporting_Deliver.data_points_016 is 0
            OFFERED_CANDIDATES_POS_TENTH_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_TENTH_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_WORK_EX, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_022 is 0
            OFFERED_CANDIDATES_POS_WORK_EX, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_022 is 0
            OFFERED_CANDIDATES_POS_GENDER, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_023 is 0
            OFFERED_CANDIDATES_POS_GENDER, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_023 is 0
            OFFERED_CANDIDATES_POS_COLLEGE, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_025 is 0
            OFFERED_CANDIDATES_POS_COLLEGE, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_025 is 0
            TIME_INVESTED_SELECTIONS_POS_SOURCE, Data Point : TIME_INVESTED_SELECTIONS, for deliver table : Reporting_Deliver.data_points_016 is 0
            TIME_INVESTED_SELECTIONS_POS_SOURCE, Data Point : TIME_INVESTED_SELECTIONS, for deliver table : Reporting_Deliver.data_points_016 is 0
            TIME_INVESTED_REJECTIONS_POS_SOURCE, Data Point : TIME_INVESTED_REJECTIONS, for deliver table : Reporting_Deliver.data_points_016 is 0
            TIME_INVESTED_REJECTIONS_POS_SOURCE, Data Point : TIME_INVESTED_REJECTIONS, for deliver table : Reporting_Deliver.data_points_016 is 0
            OFFERED_CANDIDATES_POS_TWELFTH_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_TWELFTH_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_GRADUATION_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_GRADUATION_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table: Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_TENTH_NORM_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_TENTH_NORM_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table: Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_TWELFTH_NORM_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table :Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_TWELFTH_NORM_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_GRADUATION_NORM_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_021 is 0
            OFFERED_CANDIDATES_POS_GRADUATION_NORM_ACADEMIC_GROUP, Data Point : OFFERED_CANDIDATES, for deliver table : Reporting_Deliver.data_points_021 is 0
            PCT_OF_TOTAL_APPLICATIONS_POS_SOURCE, Data Point : PCT_OF_TOTAL_APPLICATIONS, for deliver table : Reporting_Deliver.data_points_016 is 0
            PCT_OF_TOTAL_APPLICATIONS_POS_SOURCE, Data Point : PCT_OF_TOTAL_APPLICATIONS, for deliver table : Reporting_Deliver.data_points_016 is 0
            PCT_OF_SELECTED_APPLICATONS_POS_SOURCE, Data Point : PCT_OF_SELECTED_APPLICATONS, for deliver table : Reporting_Deliver.data_points_016 is 0
            PCT_OF_SELECTED_APPLICATONS_POS_SOURCE, Data Point : PCT_OF_SELECTED_APPLICATONS, for deliver table : Reporting_Deliver.data_points_016 is 0
        CP-4429:- resolved conflicts

26th Feb:-
    CP-4429:-
        foreign key constraint issue.

    CP-4412:-


    Understand ML models
    ask shubham to correct regex in degree and specilization
    clean_degree_specilization code can be optimized by adding courses regex and specilization regex into a tuple
    add ctc
    Remove common students between candidate and student tables
    ask varun about null entries of resume_id in candidate_application
    investigate candidates data.(check join in communities)

    ask varun about candidates data in offered_calyx_data.sql
    replace null values with ''

27th feb:-
    fixing sql queries.
    started on model Understading in job_category_model
    final counts:-
        -- 12303
        -- 12345
        
        -- 174825
        -- 164436
    Meeting with Amit sir.

28th Feb:-
    Understading Models in job_category_model branch

1st March:-
    Testing models for demo purpose
    CP-4504:- added salary_range support

4th March:-
    Testing models for demo purpose.
    Updated postgres
    New db dump
    trained models again.
    made predictions again.

5th March:-
    investigated duplicate students in offered candidates(because one student can be placed multiple times)
    added degree filter in get_final_results
    investigated diff in count of all_students data in old dump and new dump
    generated results again
    Demo
    Discussion with Amit sir

6th March:-
    Figuring approaches for skills nullifing
    Discussion with Amit sir
    Making knn per job category(clean data)
    Discussion with Amit sir(Demo)
    Continue on knn

7th March:-
    Fixing cleaning degree.
    Fixed Memory Error.
    Working on similar colleges algo.
    mongo issue(data not showing)

8th March:-
    working on similar colleges algo(reviewing shubhams code)
    working on tf-idf approach for job_category_model
    meeting with amit sir

12th March:-
    explored tf-idf from medium.com
    generated results
    results generated not so good
        reason:- tfidf also depends on length of document
    random sampling data
    meeting with amit sir

13th March:-
    Review points on CP-4504 and CP-4298
    made job category results again
    over sampled data and made prediction on them(Not possible because tfidf in itself is trying to normalize it by trying to give more weightage to documents having less terms)
    Understading college clustering
    read correlation(how can we normalize our data in correlation algo we use on colleges clustering)
    understanding PCA(ask why use PCA)
    Understand Kmeans

14th March:-
    review point of CP-4298:-
        place placement drive != 3 at one place
    place status!=10 at one place(do not include it in ticket)
    add is_active check at all places.

    ask shubham about the doubt.

    meeting with amit sir
    similar colleges algo:-
        tried StandardScaler
    
    meeting with Amit sir

15th March:-
    scrap entrance exams data.
        useful links:-
            https://engineering.careers360.com/articles/jee-main-cutoff
    
    use linkedin skills to improve skills set
    improved algo
    worked on improving tfidf approach

18th March:-
    CP-4298:- review points(unneccesay changes asked by colluge).
    improving skill extraction from job description.
    similar colleges approach.
    explored tensorflow to improve similar_colleges.

19th March:-
    working on similar college algo(approach including courses, companies and job titles)

26th March:-
    worked on similar colleges algo
        Found nan issue
        worked on how to one hot encode large data
    resolved issues in company reporting
    dixcussion with amit sir
    meeting over feedback forums
    CP-4731:- check why data mismatch in different datapoints(acedemic group, gender, total)
        datapoint 10, 11, 13

27th March:-
    CP-4731:- Fixed data mismatch in different datapoints(acedemic group, gender, total)
    investigated use of stage name in data point views
    replaced stage name with stage_id and related changes and testing
    set up new laptop
    read about models
    investigated why less student in round3 then offered round(data bad from app side)

28th March:-
    investigated why less student in round3 then offered round(data bad from app side)
    reviewed shubham change
    investigated primary, secondary fields for job category.
    why different offered data in 23, 25, 28
    why no data in 24
    why 0-40 academic group hiring by JTG


29th March:-
    to-do:-
        why different offered data in 23, 25, 28
					diff in 23 and 25 is because of null values
					count in 23 :- 46
					count in 25:- 37 + 8 null values = 44 because there are 2 repeated in 023 and in this they are in null values
					count in 28:- 44
        why maximum offered in PCT_40
			  meeting with rishu sir
        

1st April:-
		CP_4759:- why maximum offered in PCT_40 and fixed it
		why different offered data in 23, 25, 28(22873, 21078 community_candidate_id)
		set up new laptop
		to-do:-
				set up laptop
				complete time sheet
		to-do in future:-
				why different offered data in 23, 25, 28(22873, 21078 community_candidate_id)
				add unknown college in place of null colleges
		setup laptop
		time sheet
2nd April:-
	setup laptop
	CP-4790:- debug data_point 014
	CP-4787:- AttributeError: 'CollegeReporter' object has no attribute 'pre_defined_stages'
	read intro to ML book
	to-do:-
		CP-4789:-
		make spell checker using google

3rd April:-
	CP-4789:- 
        dependency on college reporting for running company reporting
        figured which tables to move
        fixed Error
        tested changes
        fixed reporting_raw.colleges_collegecourse in scoring algo
        fixed more tables
        fixed depreciated tables issue:- colleges_university
    CP-4796:- fixed data_points_021 for OFFERED_CANDIDATES_GRADUATION
    read intro to ML book

4th April:-
    misc discussion on similar colleges with shubham.
    CP-4747:- added primary and secondary job category suggestions in pool recommendation.
        cleaned data.
        made primary secondary job fields for degree, stream
        integrated primary, secondary job_fields in existing model
            set up ML models in new laptop
    sprint planning

5th April:-
    CP-4747:- added primary and secondary job category suggestions in pool recommendation
        testing results with abhishek
    dumped files to s3:-
        read about how to dump files to s3
    sync up for hackathon
    planning with team regarding company reporting
    KT on company reporting.

to-do:-
    make a ticket to:- fix request_approval step being called in create_weights

8th April:-
    Feedback forms self and peers.(self, saurabh, pankaj)
    dump files to s3:-
        tested code changes.
        debugged .aws/credentials error

9th April:-
    explored sheet made by abhishek.
    company_reporting:-
        explored and documented Pentaho Mondrian.
    worked on offered_candidates table.
    worked on interview table.
    discussions with team.(wbs plans)

10th April:-
    Fetching data for interview dashboard.
    Validating data.

11th April:-
    Making fact and dimension tables for Interview datapoints.
    Validating data.
    meeting with team.

12th April:-
    Merged all datapoints into 1 table.
    changes in sql scripts.
    review changes and data validation.

15th April:-
    Meeting regarding Paper checking.
    Minor review points.
    Tested merger code.
    Interview data validation.(Made charts)
    Paper checking.
    Discussion with team regarding data loss in existing reporting_clean.candidate_position_stage

16th April:-
    Wrote new merger with all scripts logic in one file.
    Data Validation.
    discussion with team regarding data issues.
    Data duplication detection.
    next_stage_call data issue(no of records with next_stage_call is very less.)

17th April:-
    next_stage_call data issue(no of records with next_stage_call is very less.)
    paper checking.
    self feedback form review.
    mcq meeting.

18th April:-
    MCQ/Output questions for written round.
    bringing company reporting platform on a single platform.(using same clean tables)
    Data Validation.
    Misc discussions with team.

19th April:-
    CP-4925:- 
        Code.
        Testing.
    Data Validation.
    Skills cleaning.

22nd April:- 
    data validation.
        interview_date
        boards data
    Skills cleaning.
        setup mongodb and indeed data.
        extract skills from indeed_v6
    CP-4412:-
        Review Points.
    Fixed create/replace function in db_connection.py

23rd April:-
    Skill cleaning.
	fuzzywuzzy.
	Decision tree based approach.
    Misc discussion with team.
    mcq discussion.

24th April:-
    Verification of algo and cleaning.
    skill cleaning optimization and trying other techniques.
    combining skill cleaning with tfidf.
    Updated engagement reports:-
	understood report
	changes sql, excel, python
	tested changes
	Validated Changes
    KT

25th April:-
    skills_cleaning:-
    	skill cleaning optimization and trying other techniques.
    	combining skill cleaning with tfidf to priortize results.(Stemming and Lemmatization using nltk && can use tfidf to rank)
    	Verification of algo and cleaning.
    CP-5003:- Review Points.
	
26th April:-
    Skills_cleaning:-
	skill cleaning testing/validation of results.
        iteration_2:- ranking of top results(5 or 3 results).

30th April:-
   Skills cleaning:-
	Ranking of top results.
	tuning of ratios.
	updated partial ratio in fuzzywuzzy.
	team meeting and discussion.
	Finalizing code.

1st May:-
    updated partial ratio in fuzzywuzzy.
    code cleaning.

2nd May:-
    result verifications.
    CP-5066:-
	can use:-
	    A value from another closely related record.
	    A value estimated by another predictive model.
    misc discussions.

3rd May:-
    CP-5066:-
	verified validity of solution.
	how does it work.
	how to improve results.
    misc discussions on dbtos3.
    mcq meeting.

6th May:-
    CP-5066:-
	Improving results.
    which is better psql or python to import big csvs into database.
    code review
    CP-5133:- fixed engagement reports.

7th May:-
    CP-5133:- fixed engagement reports.
    Time sheet.
    inivestigated diff in processing power taken when used copy command of sql vs dataframe of python to load a csv file to db.
    discussion with abhishek on loading data into chunks vs using psql command.
    sprint planning.
    Made functions in db_connection.py for importing/exporting large data

8th May:-
    Made functions in db_connection.py for importing/exporting large data
    made import function and testing.
    made export function and testing.(found ways to export large files.)

9th May:-
    tested export function.
    tested import function.
    updated dbtos3.py code.
    tested db_to_s3.py changes.
    handling missing values in training data.(explored MICE algo to impute missing data)
    analyzed knn results.
    a way to bypass memory error.

10th May:-
    Missing values result discussion with Abhishek.
    Drive Recommendation data analyzing.
	Helpful Links:-
	    https://www.goodworklabs.com/travel-recommendation-engine-ai-models/
    	    https://www.hindawi.com/journals/cin/2016/1291358/
    kmeans and haversine distance.
    collect different location codes.
	Discussion on 2 approaches:-
	    make clusters based on company centroids.
            make clusters based on other factors(ranking, location, numbre of colleges).

13th May:-
    Drive Recommendation:-
	collecting geocodes to test.
	    made google api_key(2 Hour.)
    MCQ and output questions.
    MCQ meeting.
	(Reviewed ankurs mcq and output questions)
	got questions reviewed.
    discussion with team on recruitment questions.

14th May:-
    made output_questions(1 hour)
    clustering colleges(reading and implementing)
	tring agglomerative heirarchical clustering with some modifications.(modifications to include count.)
    Understood scikit_learns' hierarchical.py to make modifications into it
    setup python3 to include distance_threshold
	try college_density
    segmenting results

15th May:-
    clustering based on count in colleges.
    mcq meeting
    try neural networks.
    can recommend multiple pools even with less candidates then required.
    can skip non 

16th may:-
    clustering count based.

17th may:-
    clustering count based.

20th May:-
    clustering count based
    review points of CP-4412
    review points of CP-5043x si
21st May:-
    review points of CP-5043
    resolved python package issue.
    resolved unicode decode error.
    manually cleaned all skills file for non ascii chars
    wrote wrapper to incrementally import files from s3 to db.
    sprint planning.

22nd May:-
    CP-5248:-
	added wrapper function for incremental import.
        tested wrapper function.
	import wrapper to import all types of files.
   CP-5252:-
	set up aws and synced data folder.
	coded whole thing

23rd May:-
    Team Retro.
    CP-5252:-
    CP-5247:-

24th May:-
    CP-5247:- Figured when to replace null values and when not to.(max how many values can predict missing values.)
	Figuring where to put missing values code. As ideally it should be placed in calyx side.
	tried various combinations to predict missing values.
	implemented code for finding missing values.

27th May:-
    CP-5247:- Implementing code. and review points.
    CP-5249:- the offers has been made for different offer jobs ids.

28th May:-
    (fixed offered data query)CP-5249:-
    (missing_values)CP-5247:- review fixes.
    made demo results.
    CP-4412:- review changes.
    CP-5309:-

29th May:-
    CP-5309:- Understanding metric.py
    figuring ways to incorporate and/or in metrics.py
    figured all the cases of where condition.

30th may:-
    CP-5309:- implementing code
    Prepared results with new data.
    CP-5251:- 

3rd June:-
    CP-xxxx:- Improve drive recommendation results.
	Removed cluster_on_count function call.

    Explored more clustering algos.
    DBSCAN and OPTICS.
    Paper checking. 1 Hour

4th June:-
    CP-xxxx:- Improve drive recommendation results.
	CLustering only on the basis of distance and stopping on count.
    CP-5309:- Review Points.
    Meeting with Rishu Sir.
    Team Meet.

5th June:-
    Team meet.
    Decide work.
    Investigated careers360.com for data scraping.
    Task defining.
    Meeting with Rishu sir.
    Sprint plan with team.

6th June:-
    CP-5392:-

7th June:-
    CP-5392:- 
	Make a list with verified skills.
	Mark skills to delete and their new names in batches.

10th June:-
    CP-5392:-
	Stuck on an error.(Problem was for each skill i have to move file pointer to the start of the file.)
    Tasks to do:-
        Make a list of verified skills.Done
	Cluster skills with are equal in case insensitive.(make a dict) Done
        Break skills on comma and output a array.(Handled multiple skills in a line.)(dont do anything in that case.)Done
	run script on skills in bathces.
        Manually verify verified skills.

11th June:-
    CP-5392:-
	run script on skills in bathces.(Done)
        Manually verify verified skills.(Done)
	Remove duplicate data from list.(Done)

    Commit neat code in calyx repo
    sprint planning.

12th June:- 
    Commit neat code in calyx repo.
    CP-5447:- Find standardized list of boards
	Find list of school boards.
	Find list of diploma boards.
	It is assisted by 10 Statutory Boards of Studies, namely, UG Studies in Eng. & Tech., PG and Research in Eng. and Tech., Management Studies, 		Vocational Education, Technical Education, Pharmaceutical Education, Architecture, Hotel Management and Catering Technology, Information 		Technology, Town and Country Planning.

13th June:-
   CP-5447:-
	made sheet of verified school boards.(Mark extra boards as verified.)
	made sheet of verified diploma boards.

14th June:-
	make sheet of verified diploma boards.
	to-do:-
		Make sprint plan for college reporting.
		Explore college scrapping.

17th June:-
	Chat bot questions
	review each other questions
	web scrapping:-
		intro
		install and set selenium with chromium driver.

18th June:-
	scrapper how to.
	LinkedIn data scrapper.
	explored inspecting elements.
	scrapped ymca students data.

20th June:-
	Finalize skill cleaning code.(Done)
	change req.txt in calyx.(add fuzzywuzzy, nltk)
	meeting with rishu sir.(Done)
	Try to scrap data from linkedin.

21st June:-
	change req.txt in calyx.(add fuzzywuzzy, nltk)
	Try scrap data from linkedIn.
		tried scrapping through simple getting page from a url.
	Investigate different ways to scrape data from web.
	scrape college ratings from web.

24th June:-
	college data scrapping(do within 4 hours.)

25th June:-
	doing previous sprints pending tickets.
		Gave PR for skill cleaning.
	to-do:-
		Finalized db to s3 code.
	CP-5633:-
		Company names cleaning
	Meeting with Amit sir.

26th June:- 
	CP-5633:-
		Company names cleaning.
		to do:-
			Make robust cleaning structure.
				store results in a file.
				start from a particular chunk.
				store actual uncleaned value in dict
	meeting with amit sir and rishu sir
	discussion with team.

27th June:-
	CP-5674:- Manully verify all skills